Market share and customer counts – DBMS 2 : DataBase Management System Services

Imanis Data

Curt Monash — Tue, 22 Aug 2017 12:46:20 +0000

I talked recently with the folks at Imanis Data. For starters:

The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
“Imanis” is a new name; the previous name was “Talena”.

Also:

Imanis has ~35 subscription customers, a significant majority of which are in the Fortune 1000.
Customer industries, in roughly declining order, include:
- Financial services other than insurance.
- Insurance.
- Retail.
- “Technology”.
~40% of Imanis customers are in the public cloud.
Imanis is focused on the North American market at this time.
Imanis has ~45 employees.
The Imanis product just hit Version 3.

Imanis correctly observes that there are multiple reasons you might want to recover from backup, including:

General disaster/system failure.
Bug in an application that writes data.
Malicious acts, including encryption-by-ransomware.

Imanis uses the phrase “point-in-time backup” to emphasize its flexibility in letting you choose your favorite time-version of your rolling backup.

Imanis also correctly draws the inference that the right backup strategy is some version of:

Make backups very frequently. This boils down to “Do a great job of making incremental backups (and restoring from them when necessary). This is where Imanis has spent the bulk of its technical effort to date.
In case recovery is needed, identify the last clean (or provably/confidently clean) version of the database and restore from that. The identification part boils down to letting the backup databases be queried directly. That’s largely a roadmap item.
- Imanis has recently added the capability to build its own functionality querying the backup database.
- JDBC/whatever general access is still in the future.

Note: When Imanis backups offer direct query access, the possibility will of course exist to use the backup data for general query processing. But while that kind of capability sounds great in theory, I’m not aware of it being a big deal (on technology stacks that already offer it) in practice.

The most technically notable other use cases Imanis mentioned are probably:

Data science dataset generation. Imanis lets you generate a partial copy of the database for analytic or test purposes.
- You can project, select or sample your data, which suggests use of the current query capabilities.
- There’s an API to let you mask Personally Identifiable Information by writing your own data transformations.
Archiving/tiering/ILM (Information Lifecycle Management). Imanis lets you divide data according to its hotness.

Imanis views its competition as:

Native utilities of the data stores.
Hand-coded scripts.
Datos.io, principally in the Cassandra market (so far).

Beyond those, the obvious comparison to Imanis is Delphix. I haven’t spoken with Delphix for a few years, but I believe that key differences between Delphix and Imanis start:

Delphix is focused on widely-installed RDBMS such as Oracle.
Delphix actually tries to have different production logical copies of your database run off of the same physical copy. Imanis, in contrast, offers technology to help you copy your databases quickly and effectively, but the copies you actually use will indeed be separate from each other.

Imanis software runs on its own cluster, based on hacked Hadoop. A lot of the hacking seems to relate to a metadata store, which supports things like:

Understanding which (incrementally backed up) blocks need to be pulled together to make a specific copy of the database.
Putting data in different places for ILM/tiering.

Another piece of Imanis tech is machine-learning-based anomaly detection.

As incrementally backed-up blocks arrive, Imanis flags anomalous ones and states a reason for them.
A flag is given a reason.
You can denounce the flag as a false alert, and hopefully similar flags won’t be raised in the future.

The technology for this seems rather basic:

Random forests for the flagging.
No drilldown w/in the Imanis system for follow-up.

But in general concept this is something a lot more systems should be doing.

Most of the rest of Imanis’ tech story is straightforward — support various alternatives for computing platforms, offer the usual security choices, etc. One exception that was new to me was the use of erasure codes, which seem to be a generalization of the concept of parity bits. Allegedly, when used in a storage context these have the near-magical property of offering 4X replication safety with only a 1.5X expansion of data volume. I won’t claim to have understood the subject well enough to see how that could make sense, or what tradeoffs it would entail.

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

Cloudera’s Data Science Workbench

Curt Monash — Mon, 20 Mar 2017 00:41:15 +0000

0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:

One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
A third way is needed.

Cloudera’s idea for a third way is:

You don’t run anything on your desktop/laptop machine except a browser.
The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.

In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.

1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:

Hadoop worker nodes.
Hadoop master nodes.
Nodes that run Cloudera Manager.

The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.

2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.

And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.

Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.

3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.

As noted above, a lot of the hassle of Hadoop-based data science relates to security.
As also noted above, evading the hassle by extracting data is a huge security risk. (If you lose customer data, you’re going to have a very, very bad day.)
According to Matt, standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is. Cloudera’s otherwise similar notebook-style interface evidently avoids that flaw. (Presumably, it you want to see the output, you rerun the script against the data store yourself.)

4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:

Sufficiently good quantitative skills to do the analysis.
Sufficiently good computer skills to do SQL queries and so on, but not a lot more than that.

Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.

5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.

6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Introduction to data Artisans and Flink

Curt Monash — Sun, 21 Aug 2016 21:15:59 +0000

data Artisans and Flink basics start:

Flink is an Apache project sponsored by the Berlin-based company data Artisans.
Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.

Like many open source projects, Flink seems to have been partly inspired by a Google paper.

To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:

The first line of Flink code dates back to 2010.
data Artisans and the Flink open source project both started in 2014.
When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
Flink’s current SQL support is very minor.

Per Kostas, about half of Flink committers are at data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).

The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:

The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
In this case, the Flink folks feel that modularity supports efficiency.

In particular:

Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
Windowing and so on operate separately from basic “transport” as well.
The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- A lot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.

The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.

The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:

“Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)

We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.

Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.

As for Flink use cases, they’re about what you’d expect:

Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
Plenty of stream processing.

But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.

Related links

My recent series of Spark posts offer comparison or background to this one.
I surveyed Spark Streaming, Storm et al. in January.
How you factor things is always important.
data Artisans has a non-obvious URL.

More about Databricks and Spark

Curt Monash — Sun, 21 Aug 2016 20:36:15 +0000

Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:

What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.

Ali also walked me through customer use cases and adoption in wonderful detail. In general:

A large majority of Databricks customers have machine learning use cases.
Predicting and preventing user/customer churn is a huge issue across multiple market sectors.

The story on those sectors, per Ali, is:

First, Databricks penetrated ad-tech, for use cases such as ad selection.
Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
- Anti-fraud.

At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.

And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:

There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
… everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
Based on this, Ali thinks Spark 2.0 is now really a streaming engine.

Other tidbits included:

Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.

Notes on Spark and Databricks — generalities

Curt Monash — Sun, 31 Jul 2016 14:29:46 +0000

I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:

Spark is indeed the replacement for Hadoop MapReduce.
Spark is becoming the default platform for machine learning.
SparkSQL (nee’ Shark) is puttering along predictably.
Databricks reports good success in its core business of cloud-based machine learning support.
Spark Streaming has strong adoption, but its position is at risk.
Databricks, the original authority on Spark, is not keeping a tight grip on that role.

I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.

Spark is the replacement for Hadoop MapReduce.

This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.

The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy:

Data-transformation-only use cases are important, but they don’t dominate.
Most other use cases typically have a data transformation element as well …
… which has to be started before any other work can be done.

And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.

Spark is becoming the default platform for machine learning.

Largely, this is a corollary of:

The previous point.
The fact that Spark was originally designed with machine learning as its principal use case.

To do machine learning you need two things in your software:

A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
Support for machine learning workflows. That’s where Spark evidently stands alone.

And thus I have conversations like:

“Are you doing anything with Spark?”
“We’ve gotten more serious about machine learning, so yes.”

SparkSQL (nee’ Shark) is puttering along.

SparkSQL is pretty much following the Hive trajectory.

Useful from Day One as an adjunct to other kinds of processing.
A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.

Databricks reports good success in its core business of cloud-based machine learning support.

Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:

As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
Databricks customers typically already have their data in the Amazon cloud.
Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
- Media
- Financial services (especially but not only insurance)
- Health care/pharma
The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.

Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.

Spark Streaming’s long-term success is not assured.

To a first approximation, things look good for Spark Streaming.

Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
The “traditional” alternatives of Storm and Samza are pretty much done.
Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
Cloudera is a big fan of Spark Streaming.
Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
Cool new Spark Streaming technology is coming out.

But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?

Databricks is not keeping a tight grip on Spark leadership.

For starters:

Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.

At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:

If you want the story on where Spark is going, you do what I did — you ask Databricks.
Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
Databricks employs ~1/3 of Spark committers.
Databricks organizes the Spark Summit.

But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.

Related links

Starting with my introduction to Spark, previous overview posts include those in:

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.

Kafka and more

Curt Monash — Mon, 25 Jan 2016 11:28:02 +0000

In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:

The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
“Schema Management”
- Is used to define fields and so on.
- Is not equivalent to what is ordinarily meant by schema validation, in that …
- … it allows schemas to change, but puts constraints on which changes are allowed.
- Is done in plug-ins that live with the producer or consumer of data.
- Is based on the Hadoop-oriented file format Avro.

Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products.

Per Confluent/Kafka honcho Jay Kreps, the companion is generally Spark Streaming, Storm or Samza, in declining order of popularity, with Samza running a distant third.
Jay estimates that there’s such a companion product at around 50% of Kafka installations.
Conversely, Jay estimates that around 80% of Spark Streaming, Storm or Samza users also use Kafka. On the one hand, that sounds high to me; on the other, I can’t quickly name a counterexample, unless Storm originator Twitter is one such.
Jay’s views on the Storm/Spark comparison include:
- Storm is more mature than Spark Streaming, which makes sense given their histories.
- Storm’s distributed processing capabilities are more questionable than Spark Streaming’s.
- Spark Streaming is generally used by folks in the heavily overlapping categories of:
  - Spark users.
  - Analytics types.
  - People who need to share stuff between the batch and stream processing worlds.
- Storm is generally used by people coding up more operational apps.

If we recognize that Jay’s interests are obviously streaming-centric, this distinction maps pretty well to the three use cases Cloudera recently called out.

Complicating this discussion further is Confluent 2.1, which is expected late this quarter. Confluent 2.1 will include, among other things, a stream processing layer that works differently from any of the alternatives I cited, in that:

It’s a library running in client applications that can interrogate the core Kafka server, rather than …
… a separate thing running on a separate cluster.

The library will do joins, aggregations and so on, and while relying on core Kafka for information about process health and the like. Jay sees this as more of a competitor to Storm in operational use cases than to Spark Streaming in analytic ones.

We didn’t discuss other Confluent 2.1 features much, and frankly they all sounded to me like items from the “You mean you didn’t have that already??” list any young product has.

Related links

My October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post series.