Discussion of NoSQL concepts, products, and vendors.
- I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
- The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
- Cassandra is an established leader in multi-data-center operation.
But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.
The story starts:
- Cassandra groups nodes into logical “data centers” (i.e. token rings).
- As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
- There are two levels of replication — within a single logical data center, and between logical data centers.
- Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
- However, copies of the database held elsewhere may have different replication factors …
- … and can indeed have different replication factors for different parts of the database.
In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):
- General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
- Instead, you get that effect by tweaking your specific replication settings.
|Categories: Cassandra, Clustering, DataStax, HBase, NoSQL, Open source, Specific users, Surveillance and privacy||3 Comments|
Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.
- Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
- Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
- Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
- Basho’s revenue is ~90% subscription.
- Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
- Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.
Basho’s product line has gotten a bit confusing, but as best I understand things the story is:
- There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
- Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
- Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
- Riak TS is for time series, and just coming out now.
- Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
- There’s an umbrella marketing term of “Basho Data Platform”.
Technical notes on some of that include: Read more
|Categories: Aerospike, Basho and Riak, Cassandra, Clustering, Couchbase, Databricks, Spark and BDAS, DataStax, HBase, Health care, Log analysis, MapR, Market share and customer counts, MongoDB, NoSQL, Pricing, Specific users, Splunk||Leave a Comment|
I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.
Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.
- 2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
- Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
- A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
- By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
- Couchbase 4.0 has been in beta for a few months.
Technical notes on Couchbase 4.0 — and related riffs — start: Read more
|Categories: Cache, Clustering, Couchbase, Data models and architecture, Databricks, Spark and BDAS, Exadata, Hadoop, MarkLogic, MongoDB, MySQL, NoSQL, Open source, Schema on need, Structured documents, Web analytics||1 Comment|
MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.
It seems fair to say that in most cases:
- Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
- Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.
Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:
- DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
- Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
- A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.
*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data.
While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind: Read more
|Categories: Business intelligence, Cassandra, Databricks, Spark and BDAS, DataStax, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Specific users, Text||6 Comments|
One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:
- >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
- ~75,000 users of MongoDB Cloud Manager.
- Estimated ~1/4 million production users of MongoDB total.
Also >530 staff, and I think that number is a little out of date.
MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:
- Some JOIN capabilities.
- Specifically, these are left outer joins, so they’re for lookup but not for filtering.
- JOINs are not restricted to specific shards of data …
- … but do benefit from data co-location when it occurs.
- A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
- Basic SQL comes in.
- Filters and GroupBys are pushed down to MongoDB. A result set … well, it results.
- The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
- Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
- This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
- MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
- MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
- MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.
There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. Read more
|Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents, Text||6 Comments|
Let’s start with some terminology biases:
- I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability.
- Though I think it’s silly, I understand why BI innovators flee from the term “business intelligence” (they’re afraid of not sounding new).
So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.
It turns out that:
- Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
- Zoomdata depends heavily on Spark.
- Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
- Zoomdata assumes data can be in a variety of data stores, including:
- Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
- Files (generic HDFS — Hadoop Distributed File System or S3).*
- NoSQL (MongoDB and HBase were mentioned).
- Search (Elasticsearch was mentioned among others).
- Zoomdata also tries to detect data variability.
- Zoomdata is OEM/embedding-friendly.
*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.
Core aspects of Zoomdata’s technical strategy include: Read more
It’s difficult to project the rate of IT change in health care, because:
- Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
- … health care is heavily bureaucratic, political and regulated.
Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:
- The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
- Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
- Most tests exploit electronic technology. Progress in electronics is intense.
- Biomedical research is itself intense.
- In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
These vastly greater amounts of data cited above will allow for greatly changed analytics.
I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.
In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks: Read more
|Categories: Business intelligence, Data warehousing, Databricks, Spark and BDAS, Hadoop, Netezza, NoSQL, Predictive modeling and advanced analytics, Tableau Software||1 Comment|
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that: Read more
|Categories: Cloudera, eBay, Facebook, Hadoop, HBase, Market share and customer counts, NoSQL, Open source, Petabyte-scale data management, Specific users, Yahoo||4 Comments|
Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. Here’s the small amount that I have actually figured out.
As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.
- In a relational database, a record is a collection of <name, value> pairs with a particular and predictable — i.e. derived from the table definition — sequence of names. Further, a record usually has an identifying key (commonly one of the first values).
- Something similar can be said about structured-document stores — i.e. JSON or XML — except that the sequence of names may not be consistent from one document to the next. Further, there’s commonly a hierarchical relationship among the names.
- For these purposes, a “wide-column” NoSQL store like Cassandra or HBase can be viewed much as a structured-document store, albeit with different performance optimizations and characteristics and a different flavor of DML (Data Manipulation Language).
Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:
- The NoSQL database is likely to have more null values.
- The NoSQL database, in a naive translation toward relational, may have repeated values. So a less naive translation might require extra tables.
That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.
|Categories: Business intelligence, Cassandra, EAI, EII, ETL, ELT, ETLT, HBase, MongoDB, NoSQL, Structured documents||6 Comments|