Open source – DBMS 2 : DataBase Management System Services

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Introduction to Crate.io and CrateDB

Curt Monash — Sun, 18 Dec 2016 05:27:15 +0000

Crate.io and CrateDB basics include:

Crate.io makes CrateDB.
CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
Crate.io says it has 22 employees and 5 paying customers.
Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.

In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.

CrateDB’s not-just-relational story starts:

A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
… where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
… except when they’re just BLOBs (Binary Large OBjects).
There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.

Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.

One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:

Writes and single-row look-ups.
Aggregates and joins.

And so it makes sense that:

Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.

CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.

CrateDB technical highlights include:

CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
- In the purely relational case, the documents may be regarded as glorified text strings.
- I got the impression that BLOB storage was somewhat separate from the rest.
CrateDB’s sharding story starts with consistent hashing.
- Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
- However, you can change your shard count, and any future inserts will go into the new set of shards.
In line with its two consistency models, CrateDB also has two indexing strategies.
- Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
- Tables also have a columnar index.
  - More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
  - CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
  - Specific datatypes — e.g. geospatial — can be indexed in different ways.
- The columnar index is shard-specific, and located at the same node as the shard.
- At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
- Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
- Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
- Data can be read from all replicas, for obvious reasons of performance.
Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.

Crate.io is proud of its distributed/parallel story.

Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
Crate.io encourages you to run Spark and CrateDB on the same nodes.
- This is supported by parallel Spark-CrateDB integration of the obvious kind.
- Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.

The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.

Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:

A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
The only join strategy currently implemented is nested loop. Others are in the future.
CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
I imagine CrateDB administrative tools are still rather primitive.

In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.

Edit: For some clarification and even correction, please see the first comment below.

MongoDB 3.4 and “multimodel” query

Curt Monash — Wed, 23 Nov 2016 12:01:25 +0000

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:

A query layer with multiple ways to query and analyze data.
A separate data storage layer in which you have a choice of data storage engines …
… each of which has the same logical (JSON-based) data structure.

When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.

To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:

In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
When writing, the DML paradigm is unmixed — it’s just JSON.

Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.

The main ways to query data in MongoDB, to my knowledge, are:

Native/JSON. Duh.
SQL.
- MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
- More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
- I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
Search.
- MongoDB has been adding text search features for a few releases.
- MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked.

Three years ago, in an overview of layered and multi-DML architectures, I suggested:

Layered DBMS and multimodel functionality fit well together.
Both carried performance costs.
In most cases, the costs could be affordable.

MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Introduction to data Artisans and Flink

Curt Monash — Sun, 21 Aug 2016 21:15:59 +0000

data Artisans and Flink basics start:

Flink is an Apache project sponsored by the Berlin-based company data Artisans.
Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.

Like many open source projects, Flink seems to have been partly inspired by a Google paper.

To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:

The first line of Flink code dates back to 2010.
data Artisans and the Flink open source project both started in 2014.
When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
Flink’s current SQL support is very minor.

Per Kostas, about half of Flink committers are at data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).

The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:

The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
In this case, the Flink folks feel that modularity supports efficiency.

In particular:

Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
Windowing and so on operate separately from basic “transport” as well.
The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- A lot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.

The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.

The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:

“Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)

We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.

Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.

As for Flink use cases, they’re about what you’d expect:

Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
Plenty of stream processing.

But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.

Related links

My recent series of Spark posts offer comparison or background to this one.
I surveyed Spark Streaming, Storm et al. in January.
How you factor things is always important.
data Artisans has a non-obvious URL.

More about Databricks and Spark

Curt Monash — Sun, 21 Aug 2016 20:36:15 +0000

Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:

What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.

Ali also walked me through customer use cases and adoption in wonderful detail. In general:

A large majority of Databricks customers have machine learning use cases.
Predicting and preventing user/customer churn is a huge issue across multiple market sectors.

The story on those sectors, per Ali, is:

First, Databricks penetrated ad-tech, for use cases such as ad selection.
Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
- Anti-fraud.

At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.

And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:

There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
… everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
Based on this, Ali thinks Spark 2.0 is now really a streaming engine.

Other tidbits included:

Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.

Notes on vendor lock-in

Curt Monash — Wed, 20 Jul 2016 01:35:32 +0000

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Thus, doing an additional project using ELAed products may appear low-cost.

Incremental license and maintenance fees may be zero in the short-term.
Incremental personnel costs may be controlled because the needed skills are already in-house.

Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.

4. Subscriptions are closely associated with lock-in.

Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
Cloud lock-in has rapidly become a big deal.
The open source vendors meeting lock-in resistance, noted below, have subscription business models.

Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.

5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.

6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:

Oppose standardization.
Like thick technology stacks.

Thus, departmental influence on IT both encourages and discourages lock-in.

7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business model is to ~~squeeze~~ serve its still-large number of strongly loyal customers as well as it can.

8. Microsoft’s business model over the decades has also greatly depended on lock-in.

Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.

Yet sometimes, Microsoft is more free and open.

Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue per Mac to what it got for each Windows PC.)
Visual Studio is useful for writing apps to run against multiple DBMS.
Just recently, Microsoft SQL Server was ported to Linux.

9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.

10. And with that as background, we can finally get to what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:

Customers are locked into expensive traditional DBMS such as Oracle.
Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.

So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.

I agree with them that enterprises who feel this way are getting it wrong. Indeed:

The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
Serious users need support.
Support and management tools happen to be synergistic with each other.

This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever of MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.

General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.

Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:

Facebook has more and better engineers than you do.
Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?

And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.

Related links

The technology underlying packaged applications (November, 2015, but it has a historical focus)
Topics in migration (January, 2015)
Much of the vendor advice on Strategic Messaging.

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.

Kafka and Confluent

Curt Monash — Mon, 25 Jan 2016 11:27:13 +0000

For starters:

Kafka has gotten considerable attention and adoption in streaming.
Kafka is open source, out of LinkedIn.
Folks who built it there, led by Jay Kreps, now have a company called Confluent.
Confluent seems to be pursuing a fairly standard open source business model around Kafka.
Confluent seems to be in the low to mid teens in paying customers.
Confluent believes 1000s of Kafka clusters are in production.
Confluent reports 40 employees and $31 million raised.

At its core Kafka is very simple:

Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
Kafka handles various issues of scaling, load balancing, fault tolerance and so on.

So it seems fair to say:

Kafka offers the benefits of hub vs. point-to-point connectivity.
Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)

Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file-system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.

The most noteworthy technical point for me was that Kafka persists data, for reasons of buffering, fault-tolerance and the like. The duration of the persistence is configurable, and can be different for different feeds, with two main options:

Guaranteed to have the last update of anything.
Complete for the past N days.

Jay thinks this is a major difference vs. messaging systems that have come before. As you might expect, given that data arrives in timestamp order and then hangs around for a while:

Kafka can offer strong guarantees of delivering data in the correct order.
Persisted data is automagically broken up into partitions.

Technical tidbits include:

Data is generally fresh to within 1.5 milliseconds.
100s of MB/sec/server is claimed. I didn’t ask how big a server was.
LinkedIn runs >1 trillion messages/day through Kafka.
Others in that throughput range include but are not limited to Microsoft and Netflix.
A message is commonly 1 KB or less.
At a guesstimate, 50%ish of messages are in Avro. JSON is another frequent format.

Jay’s answer to any concern about performance overhead for current or future features is usually to point out that anything other than the most basic functionality:

Runs in different processes from core Kafka …
… if it doesn’t actually run on a different cluster.

For example, connectors have their own pools of processes.

I asked the natural open source question about who contributes what to the Apache Kafka project. Jay’s quick answers were:

Perhaps 80% of Kafka code comes from Confluent.
LinkedIn has contributed most of the rest.
However, as is typical in open source, the general community has contributed some connectors.
The general community also contributes “esoteric” bug fixes, which Jay regards as evidence that Kafka is in demanding production use.

Jay has a rather erudite and wry approach to naming and so on.

Kafka got its name because it was replacing something he regarded as Kafkaesque. OK.
Samza is an associated project that has something to do with transformations. Good name. (The central character of The Metamorphosis was Gregor Samsa, and the opening sentence of the story mentions a transformation.)
In his short book about logs, Jay has a picture caption “ETL in Ancient Greece. Not much has changed.” The picture appears to be of Sisyphus. I love it.
I still don’t know why he named a key-value store Voldemort. Perhaps it was something not to be spoken of.

What he and his team do not yet have is a clear name for their product category. Difficulties in naming include:

Kafka is limited and simple. But of course Confluent has plans to broaden its capabilities.
It’s long been hard to decide whether to talk about “events”, “streams” or both.
“Streaming” has another tech meaning, in the context of video, songs, etc.
One candidate name, “event hub”, has already been grabbed by IBM and Microsoft for their specific offerings.
Naming is always hard in general.

Confluent seems to be using “stream data platform” as a placeholder. As per the link above, I once suggested Data Stream Management System, or more concisely Datastream Manager. “Event”, “event stream” or “event series” could perhaps be mixed in as well. I don’t really have an opinion yet, and probably won’t until I’ve studied the space in a little more detail.

And on that note, I’ll end this post for reasons of length, and discuss Kafka-related technology separately.

Related links

My October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post series.