Petabyte-scale data management – DBMS 2 : DataBase Management System Services

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

CDH 5.5

Curt Monash — Thu, 19 Nov 2015 11:52:01 +0000

I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:

Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
  - When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
  - Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
  - More “platform” stuff from the Hadoop stack (e.g. for data ingest).
  - Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.

While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of:

Petabyte scale databases — at least one clear case for Impala/business intelligence only, and the likelihood that the Impala/BI part of other bigger installations was also in that range.
Hundreds of nodes.
10s of simultaneous queries in dashboard use cases.
1 – 3 million queries/month as a common figure.

Cloudera also expressed the opinions that:

An “overwhelming majority” of Cloudera customers have adopted Impala. (I imagine there’s a bit of hyperbole in that — for one thing, Cloudera has a pricing option in which Impala is not included.)
It is common for Impala customers to use Hive for “data preparation”.
SparkSQL has “order of magnitude” less performance than Impala, but a little more than performance than Hive running over either Spark or Tez.
SparkSQL’s main use cases are (and these overlap heavily):
- As part of an analytic process (as opposed to straightforwardly DBMS-like use).
- To persist data outside the confines of a single Spark job.

Rocana’s world

Curt Monash — Thu, 17 Sep 2015 11:49:21 +0000

For starters:

My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
… cofounder Eric Sammer.
Rocana recently told me it had 35 people.
Rocana has a very small number of quite large customers.

Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:

Actual operations — figuring out exactly what isn’t working, ASAP.
Security.

Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:

The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
Firewall output.
Database server logs.
Point-of-sale data (at a retailer).
“Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)

In line with segment leader Splunk’s pricing, data volumes in this area tend to be described in terms of new data/day. Rocana seems to start around 3 TB/day, which not coincidentally is a range that would generally be thought of as:

Challenging for Splunk, and for the budgets of Splunk customers.
Not a big problem for well-implemented Hadoop.

And so part of Rocana’s pitch, familiar to followers of analytic RDBMS and Hadoop alike, is “We keep and use all your data, unlike the legacy guys who make you throw some of it away up front.”

Since Rocana wants you to keep all your data, 3 TB/day is about 1 PB/year.

But really, that’s just saying that Rocana is an analytic stack built on Hadoop, using Hadoop for what people correctly think it’s well-suited for, done by guys who know a lot about Hadoop.

The cooler side of Rocana, to my tastes, is the actual analytics. Truth be told, I find almost any well thought out event-series analytics story cool. It’s an area much less mature than relational business intelligence, and accordingly with much more scope for innovation. On the visualization side, crucial aspects start:

Charting over time (duh).
Comparing widely disparate time intervals (e.g., current vs. historical/baseline).
Whichever good features from relational BI apply to your use case as well.

Other important elements may be more data- or application-specific — and the fact that I don’t have a long list of particulars illustrates just how immature the area really is.

Even cooler is Rocana’s integration of predictive modeling and BI, about which I previously remarked:

The idea goes something like this:

Suppose we have lots of logs about lots of things. Machine learning can help:

Notice what’s an anomaly.

Group together things that seem to be experiencing similar anomalies.

That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

So far as I can tell, predictive modeling is used to notice aberrant data (raw or derived). This is quickly used to define a subset of data to drill down to (e.g., certain kinds of information from certain machines in a certain period of time). Event-series BI/visualization then lets you see the flows that led to the aberrant result, which was any luck will allow you to find the exact place where the data first goes wrong. And that, one hopes, is something that the ops guys can quickly fix.

I think similar approaches could make sense in numerous application segments.

Related links

Rocana’s Hadoop stack presumably includes both Kafka and Spark Streaming.
Back when Splunk still answered my email, I wrote about its inverted-list data management architecture.
Ursula Le Guin’s debut novel Rocannon’s World has nothing to do with this post (although it does start with a really lousy bit of temporal analysis ). I just like making allusions to her work.

DataStax and Cassandra update

Curt Monash — Mon, 14 Sep 2015 06:02:59 +0000

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data.

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind:

You can get any kind of data into them very fast; indeed, that’s a central part of what they were designed for.
In the general case, getting it back out for low-latency analytics is problematic …
… but there’s an increasing list of exceptions.

For DataStax Enterprise, exceptions start:

Formally, you can do almost anything in at least one of Solr or Spark/SparkSQL. So if volumes are low enough, you’re fine. In particular, Spark offers the potential to do many things at in-memory speeds.
Between Spark, the new functions, and general scripting, there are several ways to do low-latency aggregations. This can lead to “twinkling dashboards”.*
DataStax is alert to the need to stream data into Cassandra.
- That’s central to the NoSQL expectation of ingesting internet data very quickly.
- Kafka, Storm and Spark Streaming all seem to be in the mix.
Solr over Cassandra has a searchable RAM buffer, which can give the effect of real-time text indexing within a second or so of ingest.

*As much as I love the “twinkling dashboard” term — it reminds me of my stock analyst days — it does raise some concerns. In many use cases, human real-time BI should be closely integrated with the more historical kind.

DataStax Enterprise:

Is based on Cassandra 2.1.
Will probably never include Cassandra 2.2, waiting instead for …
….Cassandra 3.0, which will feature a storage engine rewrite …
… and will surely include Cassandra 2.2 features of note.

This connects to what I said previously in that Cassandra 2.2 adds some analytic features, specifically in the area of user-defined functions. Notes on Cassandra 2.2 UDFs include:

These are functions — not libraries, a programming language, or anything like that.
The “user-defined” moniker notwithstanding, the capability has been used to implement COUNT, SUM, AVG, MAX and so on.
You are meant to run user-defined functions on data in a single Cassandra partition; run them across partitions at your own performance risk.

And finally, some general tidbits:

A while ago, Apple said it had >75,000 Cassandra nodes. The figure is surely bigger now.
There are at least several other petabyte range Cassandra installations, and several more half-petabyte ones.
Netflix is not one of those. Instead, it has many 10s of smaller Cassandra clusters.
There are Cassandra users with >1 million reads+writes per second.

Finally a couple of random notes:

One of the text search use cases for Solr/Cassandra is to — in one query — get at information that originated in multiple places, e.g. for reasons of time period or geography. (I hear this about text search across lots of database technologies, relational and non-relational alike.)
As big a change as Cassandra 3.0 will be, it will not require that you take down your applications for an upgrade. That hasn’t been necessary since Cassandra 0.7.

Teradata will support Presto

Curt Monash — Mon, 08 Jun 2015 09:32:16 +0000

At the highest level:

Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
Teradata is announcing support for Presto with a classic open source pricing model.
Presto will also become, roughly speaking, Teradata’s replacement for Hive.
Teradata’s Presto efforts are being conducted by the former Hadapt.

Now let’s make that all a little more precise.

Regarding Presto (and I got most of this from Teradata)::

To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
… Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
Facebook at various points in time created both Hive and now Presto.
Facebook started the Presto project in 2012 and now has 10 engineers on it.
Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.

Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project:

Data is pipelined between operators, with no gratuitous writing to disk the way you might have in something MapReduce-based. This is different from the sense of “pipelining” in which one query might keep an intermediate result set hanging around because another query is known to need those results as well.
Presto processing is vectorized; functions don’t need to be re-invoked a tuple at a time. This is different from the sense of vectorization in which several tuples are processed at once, exploiting SIMD (Single Instruction Multiple Data). Dan thinks SIMD is useful mainly for column stores, and Presto tries to be storage-architecture-agnostic.
Presto query operators and hence query plans are dynamically compiled, down to byte code.
Although it is generally written in Java, Presto uses direct memory management rather than relying on what Java provides. Dan believes that, despite being written in Java, Presto performs as if it were written in C.

More precisely, this is a checklist for interactive-speed parallel SQL. There are some query jobs long enough that Dan thinks you need the fault-tolerance obtained from writing intermediate results to disk, ala’ HadoopDB (which was of course the MapReduce-based predecessor to Hadapt).

That said, Presto is a newish database technology effort, there’s lots of stuff missing from it, and there still will be lots of stuff missing from Presto years from now. Teradata has announced contribution plans to Presto for, give or take, the next year, in three phases:

Phase 1 (released immediately, and hence in particular already done):
- An installer.
- More documentation, especially around installation.
- Command-line monitoring and management.
Phase 2 (later in 2015)
- Integrations with YARN, Ambari and soon thereafter Cloudera Manager.
- Expanded SQL coverage.
Phase 3 (some time in 2016)
- An ODBC driver, which is of course essential for business intelligence tool connectivity.
- Other connectors (e.g. more targets for query federation).
- Security.
- Further SQL coverage.

Absent from any specific plans that were disclosed to me was anything about optimization or other performance hacks, and anything about workload management beyond what can be gotten from YARN. I also suspect that much SQL coverage will still be lacking after Phase 3.

Teradata’s basic business model for Presto is:

Teradata is selling subscriptions, for which the principal benefit is support.
Teradata reserves the right to make some of its Presto-enhancing code subscription-only, but has no immediate plans to do so.
Teradata being Teradata, it would love to sell you Presto-related professional services. But you’re absolutely welcome to consume Presto on the basis of license-plus-routine-support-only.

And of course Presto is usurping Hive’s role wherever that makes sense in Teradata’s data connectivity story, e.g. Teradata QueryGrid.

Finally, since I was on the phone with Justin Borgman and Dan Abadi, discussing a project that involved 16 former Hadapt engineers, I asked about Hadapt’s status. That may be summarized as:

There are currently no new Hadapt sales.
Only a few large Hadapt customers are still being supported by Teradata.
The former Hadapt folks would love Hadapt or Hadapt-like technology to be integrated with Presto, but no such plans have been finalized at this time.

More notes on HBase

Curt Monash — Tue, 17 Mar 2015 18:13:50 +0000

1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:

Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
Files have indexes and optional Bloom filters.
Files are marked with min/max field values and time stamp ranges, which helps with data skipping.

Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.

2. The discussion of which kinds of data are originally put into HBase was a bit confusing.

HBase is commonly used to receive machine-generated data. Everybody knows that.
Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.

OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.

3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:

Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
eBay has an HBase database largely on spinning disk, used to inform its search engine.

Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.

4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.

5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.

It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.

In general, Cloudera suggested that HBase was used in a fair number of OEM situations.

6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.

Related link

A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.

Databricks and Spark update

Curt Monash — Sat, 28 Feb 2015 11:06:48 +0000

I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:

Databricks Cloud is:
- Spark-as-a-Service.
- Currently running on Amazon only.
- Not dependent on Hadoop.
Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.

I do not expect all of the above to remain true as Databricks Cloud matures.

Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as:

15 certified distributions.
~40 certified applications.
2000 people trained last year by Databricks alone.

Please note that certification of a Spark distribution is a free service from Databricks, and amounts to checking that the API works against a test harness. Speaking of certification, Ion basically agrees with my views on ODP, although like many — most? — people he expresses himself more politely than I do.

We talked briefly about several aspects of Spark or related projects. One was DataFrames. Per Databricks:

In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

I gather this is modeled on Python pandas, and extends an earlier Spark capability for RDDs (Resilient Distributed Datasets) to carry around metadata that was tantamount to a schema.

SparkR is also on the rise, although it has the usual parallel R story to the effect:

You can partition data, run arbitrary R on every partition, and aggregate the results.
A handful of algorithms are truly parallel.

So of course is Spark Streaming. And then there are Spark Packages, which are — and I’m speaking loosely here — a kind of user-defined function.

Thankfully, Ion did not give me the usual hype about how a public repository of user-created algorithms is a Great Big Deal.
Ion did point out that providing an easy way for people to publish their own algorithms is a lot easier than evaluating every candidate contribution to the Spark project itself.

I’ll stop here. However, I have a couple of other Spark-related posts in the research pipeline.

Where the innovation is

Curt Monash — Mon, 19 Jan 2015 08:27:57 +0000

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

Product maturity is a huge issue for all the above, and will remain one for years.
Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Object/file/block.
Graph analytics and data management are still confused.

3. As I suggested last year, data transformation is an important area for innovation.

MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
The smart data preparation crowd is deservedly getting attention.
The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
Similarly, more complex clustering.
Predictive experimentation.
The use of business intelligence and predictive modeling to inform each other.

Beyond that:

Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

“Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
… the cost of such process isolation may need to be borne.
Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.

More specifically,

It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
On servers, we may in many cases be talking about lightweight virtual machines.

And to be clear:

What I’m talking about would do little to help the authentication/authorization aspects of security, but …
… those will never be perfect in any case (because they depend upon fallible humans) …
… which is exactly why other forms of security will always be needed.

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

Application development technology, languages, frameworks, etc.
The integration of analytics into old-style operational apps.
The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related links

In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.
Edit: I followed up on the last point with a post about soft robots.

MongoDB is growing up

Curt Monash — Thu, 17 Apr 2014 08:56:09 +0000

I caught up with my clients at MongoDB to discuss the recent MongoDB 2.6, along with some new statements of direction. The biggest takeaway is that the MongoDB product, along with the associated MMS (MongoDB Management Service), is growing up. Aspects include:

An actual automation and management user interface, as opposed to the current management style, which is almost entirely via scripts (except for the monitoring UI).
- That’s scheduled for public beta in May, and general availability later this year.
- It will include some kind of integrated provisioning with VMware, OpenStack, et al.
- One goal is to let you apply database changes, software upgrades, etc. without taking the cluster down.
A reasonable backup strategy.
- A snapshot copy is made of the database.
- A copy of the log is streamed somewhere.
- Periodically — the default seems to be 6 hours — the log is applied to create a new current snapshot.
- For point-in-time recovery, you take the last snapshot prior to the point, and roll forward to the desired point.
A reasonable locking strategy!
- Document-level locking is all-but-promised for MongoDB 2.8.
- That means what it sounds like. (I mention this because sometimes an XML database winds up being one big document, which leads to confusing conversations about what’s going on.)
Security. My eyes glaze over at the details, but several major buzzwords have been checked off.
A general code rewrite to allow for (more) rapid addition of future features.

Of course, when a DBMS vendor rewrites its code, that’s a multi-year process. (I think of it at Oracle as spanning 6 years and 2 main-number releases.) With that caveat, the MongoDB rewrite story is something like:

Updating has been reworked. Most of the benefits are coming later.
Query optimization and execution have been reworked. Most of the benefits are coming later, except that …
… you can now directly filter on multiple indexes in one query; previously you could only simulate doing that by pre-building a compound index.
One of those future benefits is more index types, for example R-trees or inverted lists.
Concurrency improvements are down the road.
So are rewrites of the storage layer, including the introduction of compression.

Also, you can now straightforwardly transform data in a MongoDB database and write it into new datasets, something that evidently wasn’t easy to do before.

One thing that MongoDB is not doing is offer any ODBC/JDBC or other SQL interfaces. Rather, there’s some other API — I don’t know the details — whereby business intelligence tools or other systems can extract views, and a few BI vendors evidently are doing just that. In particular, MicroStrategy and QlikView were named, as well as a couple of open source usual-suspects.

As of 2.6, MongoDB seems to have a basic integrated text search capability — which however does not rise to the search functionality level that was in Oracle 7.3.2. In particular:

15 Western languages are supported with stopwords, tokenization, etc.
Search predicates can be mixed into MongoDB queries.
The search language isn’t very rich; for example, it lacks WHERE NEAR semantics.
You can’t tweak the lexicon yourself.

And finally, some business and pricing notes:

Two big aspects of the paid-versus-free version of MongoDB (the product line) are:
- Security.
- Management tools.
Well, actually, you can get the management tools for free, but only on a SaaS basis from MongoDB (the company).
- If you want them on premises or in your part of the cloud, you need to pay.
- If you want MongoDB (the company) to maintain your backups for you, you need to pay.
Customer counts include:
- At least 1000 or so subscribers (counting by organization).
- Over 500 (additional?) customers for remote backup.
- 30 of the Fortune 100.

And finally, MongoDB did something many companies should, which is aggregate user success stories for which they may not be allowed to publish full details. Tidbits include:

Over 100 organizations run clusters with more than 100 nodes. Some clusters exceed 1,000 nodes.

Many clusters deliver hundreds of thousands of operations per second (combined read and write).

MongoDB clusters routinely store hundreds of terabytes, and some store multiple petabytes of data. Over 150 clusters exceed 1 billion documents in size. Many manage more than 100 billion documents.

DataStax/Cassandra update

Curt Monash — Sun, 08 Dec 2013 18:06:01 +0000

Cassandra’s reputation in many quarters is:

World-leading in the geo-distribution feature.
Impressively scalable.
Hard to use.

This has led competitors to use, and get away with, sales claims along the lines of “Well, if you really need geo-distribution and can’t wait for us to catch up — which we soon will! — you should use Cassandra. But otherwise, there are better choices.”

My friends at DataStax, naturally, don’t think that’s quite fair. And so I invited them — specifically Billy Bosworth and Patrick McFadin — to educate me. Here are some highlights of that exercise.

DataStax and Cassandra have some very impressive accounts, which don’t necessarily revolve around geo-distribution. Netflix, probably the flagship Cassandra user — since Cassandra inventor Facebook adopted HBase instead — actually hasn’t been using the geo-distribution feature. Confidential accounts include:

A petabyte or so of data at a very prominent company, geo-distributed, with 800+ nodes, in a kind of block storage use case.
A messaging application at a very prominent company, anticipated to grow to multiple data centers and a petabyte of so of data, across 1000s of nodes.
A 300 terabyte single-data-center telecom account (which I can’t find on DataStax’s extensive customer list).
A huge health records deal.
A Fortune 10 company.

DataStax and Cassandra won’t necessarily win customer-brag wars versus MongoDB, Couchbase, or even HBase, but at least they’re strongly in the competition.

DataStax claims that simplicity is now a strength. There are two main parts to that surprising assertion.

DataStax claims that operation is simple, that operators are “bored”, that large users appreciate the ease of operation, and so on. These claims become a lot more plausible if you recall:
- Cassandra isn’t used for databases that resemble relational schemas with 1000s of tables, lots of foreign keys, and so on.
- Performance and capacity problems in Cassandra don’t necessarily require sophisticated operational solutions; you can throw hardware at them instead.
DataStax claims that CQL (Cassandra Query Language) makes Cassandra programming and data modeling much easier than they were before. More on that below.

DataStax claims that Cassandra excels at time series use cases, where “time series” seem to equate to collections of short records with timestamps. This seems borne out by, for example, the first three use cases on my bulleted list above. Actually, it’s not just timestamps, but rather any data that is naturally ordered by a sequential field, such as packet IDs from a packet-switching network.

Finally, DataStax claims that Cassandra is good for high-velocity applications in general. A generic example that DataStax supported with some Very Big Names — whether those were of customers or prospects wasn’t entirely clear — was in retailing, to actually serve accurate information as to whether inventory is in stock, something Walmart failed at as recently as last year.

Now let’s talk a bit about Cassandra technology. I’ll start with an example. Imagine a “phone-home” use case in which many devices emit many records each in the form of (DeviceID, TimeStamp, MeterReading) triples.

A relational database would store that as a bunch of rows, 3 columns wide.
A Cassandra database, however, would have a single row for each DeviceID; each row would contain two columns for each (TimeStamp, MeterReading) pair.
The column names are composite, in a way that shows the different column pairs are each recording the same kind of thing.
Cassandra Query Language (CQL) lets you query (or insert) as if the data were in the relational-table logical format. But of course you can also reference Cassandra in a way that takes its actual (row, column) structure at face value.

So in essence, you have schemas that at once are dynamic and tabular. The big downside vs. a relational DBMS is that — duh! — you can’t have the benefits of normalization.

For clarity, I should note that much of Cassandra’s logical architecture is shared by fellow BigTable-architecture data store HBase; it’s not a coincidence that Facebook invented Cassandra to support messaging, nor that when Facebook changed its mind about that, it adopted HBase as the alternative. Accumulo has similar characteristics as well.

Physically, what’s going on in Cassandra is something like this:

Each Cassandra row is maintained in memory, and in most cases sorted on timestamp (or some other comparator), in either order. This is the basis for the claims of great Cassandra performance and general suitability specifically in time series use cases. (E.g., “Last 10 events” kinds of reads are very easy.)
Once rows are flushed to disk, they are immutable … except that of course they eventually are compacted, typically via a merge sort. (When you do need to do a database update, last write wins.)
Rows are organized into files on disk. There’s a “key cache” that in many cases will tell you exactly which file contains the row you’re looking for. If you have a cache miss …
… each file has a Bloom filter predicting which keys it contains, and you interrogate those. Those Bloom filters are also maintained in memory (and copied on disk just for the sake of persistence).

Cassandra has few indexes, and no physical concept of datatype.

The benefits I see to this physical architecture are mainly:

Plays nicely with Cassandra’s logical architecture.
Plays nicely with scale-out.
Seems to have been designed RAM-first, which matches how databases are actually used.
Is fast for range queries on the comparator (e.g. timestamp).
Doesn’t have a lot of knobs to twiddle, which makes it plausible that a relatively immature product can be easy to administer.

For some use cases, that’s not a bad list of advantages. Not bad at all.

Related link

I covered some real basics in a Cassandra technical overview 3 1/2 years ago.
WibiData Kiji’s most fundamental goal — there are others too — is to tame HBase data modeling much as CQL tames Cassandra’s.