Data models and architecture – DBMS 2 : DataBase Management System Services

Some stuff that’s always on my mind

Curt Monash — Sun, 20 May 2018 19:27:21 +0000

I have a LOT of partially-written blog posts, but am struggling to get any of them finished (obviously). Much of the problem is that they have so many dependencies on each other. Clearly, then, I should consider refactoring my writing plans.

So let’s start with this. Here, in no particular order, is a list of some things that I’ve said in the past, and which I still think are or should be of interest today. It’s meant to be background for numerous posts I write in the near future, and indeed a few hooks for such posts are included below.

1. Data(base) management technology is progressing pretty much as I expected.

Vendors generally recognize that maturing a data store is an important, many-years-long process.
Multiple kinds of data model are viable …
… but it’s usually helpful to be able to do some kind of JOIN.
To deal with the variety of hardware/network/storage arrangements out there, layering/tiering is on the rise. (An amazing number of vendors each seem to think they invented the idea.)

2. Rightly or wrongly, enterprises are often quite sloppy about analytic accuracy.

My two central examples have long been inaccurate metrics and false-positive alerts.
In predictive analytics, it’s straightforward to quantify how much additional value you’re leaving on the table with your imperfect accuracy.
Enterprise search and other text technologies are still often terrible.
After years of “real-time” overhype, organizations have seemingly swung to under-valuing real-time analytics.

3. Outside traditional enterprises, the accuracy problem can be even worse, and the consequences of analytic inaccuracy can be severe. In some cases this is well understood; autonomous vehicle researchers, for example, seem properly attentive to the challenge of not-killing-pedestrians. But in others it’s a mess. For example, I don’t think the “fake news on social media” challenge will be resolved without new technical approaches that, to my knowledge, aren’t yet even being tried.

4. More generally, I’ve long argued that the technology industry would someday have to deal with a variety of public policy and social concerns. That day has come. In anticipation, I wrote at length about privacy/surveillance, and a little about some other areas, including net neutrality, patents, economic development, and public technology spending. Missing subjects include censorship (private and public alike), and perhaps also the efforts to tie data ownership into anti-trust policy.

5. Given all the tech-specific public policy work that’s needed, I’m pulling back from some my broader political efforts. However, I stand by my overview opinions of last February, and I delivered on some of its IOUs in a two-part series on persuasion.

6. The ongoing rise of “edge computing” and the “Internet of Things” fit into the general trend that in 2013 I summarized as appliances, clusters and clouds.

7. I continue to think that a huge fraction of analytics is properly characterized as monitoring. That ties into a number of areas of interest. For example:

Platform technologies — including distributed data management — are often compete on the maturity of their built-in monitoring.
My complaints about BI inaccuracy commonly relate to use cases in monitoring.
Privacy/surveillance issues are commonly about monitoring. It’s common to worry that such monitoring is actually too accurate.
But I also worry that privacy/surveillance monitoring isn’t accurate enough … and hence that it leads to people being discriminated against who absolutely don’t need to be.
Edge computing involves a lot of devices that need to be monitored.
Censorship obviously has a lot to do with monitoring.

8. And finally for now, my core precepts for strategic messaging haven’t changed.

Related link

As you may have already guessed, the title of this post is based on a classic song.

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Introduction to Crate.io and CrateDB

Curt Monash — Sun, 18 Dec 2016 05:27:15 +0000

Crate.io and CrateDB basics include:

Crate.io makes CrateDB.
CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
Crate.io says it has 22 employees and 5 paying customers.
Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.

In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.

CrateDB’s not-just-relational story starts:

A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
… where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
… except when they’re just BLOBs (Binary Large OBjects).
There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.

Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.

One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:

Writes and single-row look-ups.
Aggregates and joins.

And so it makes sense that:

Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.

CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.

CrateDB technical highlights include:

CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
- In the purely relational case, the documents may be regarded as glorified text strings.
- I got the impression that BLOB storage was somewhat separate from the rest.
CrateDB’s sharding story starts with consistent hashing.
- Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
- However, you can change your shard count, and any future inserts will go into the new set of shards.
In line with its two consistency models, CrateDB also has two indexing strategies.
- Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
- Tables also have a columnar index.
  - More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
  - CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
  - Specific datatypes — e.g. geospatial — can be indexed in different ways.
- The columnar index is shard-specific, and located at the same node as the shard.
- At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
- Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
- Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
- Data can be read from all replicas, for obvious reasons of performance.
Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.

Crate.io is proud of its distributed/parallel story.

Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
Crate.io encourages you to run Spark and CrateDB on the same nodes.
- This is supported by parallel Spark-CrateDB integration of the obvious kind.
- Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.

The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.

Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:

A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
The only join strategy currently implemented is nested loop. Others are in the future.
CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
I imagine CrateDB administrative tools are still rather primitive.

In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.

Edit: For some clarification and even correction, please see the first comment below.

Cloudera in the cloud(s)

Curt Monash — Fri, 22 Jan 2016 07:46:34 +0000

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

Cloudera’s usual software, ported to run on the cloud platform(s).
Cloudera Director, which for example launches cloud instances.
Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

An API for job submission.
Support for spot and preemptable instances.
High availability.
Kerberos.
Some cluster repair.
Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.
Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.

That makes sense in part because:

The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.
By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.
The latter group of apps are probably the harder ones to optimize for.

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.
Object storage integration for Windows Azure is “in progress”.
Object storage integration for Google GCP it is “to be determined”.
Security for object storage — e.g. encryption — is a work in progress.
Cloudera Navigator for object storage is a roadmap item.

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.
Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.
In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.
Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)
In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.
Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.
Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:
- General Cloudera vs. Hortonworks distinctions.
- “Hybrid”/portability.

Cloudera also offered a distinction among three types of workload:

ETL (Extract/Transform/Load) and “modeling” (by which Cloudera seems to mean predictive modeling).
- Cloudera pitches this as batch work.
- Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.
- This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards.
- Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.
BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.
“Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.

Couchbase 4.0 and related subjects

Curt Monash — Thu, 15 Oct 2015 15:17:44 +0000

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs — start:

There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
“Index”, “data” and “query” are three different services/tiers.
- You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
- I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.

Up to a point, SQL-on-NoSQL stories can be fairly straightforward.

You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
SELECT, FROM and WHERE clauses work in the usual way.
Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.

For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.

*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.

JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.

But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.

Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.

Multi-model database managers

Curt Monash — Mon, 24 Aug 2015 08:07:00 +0000

I’d say:

Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

One database to rule them all systems aren’t very realistic, but even so, …
… single-model systems will become increasingly obsolete.

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

Operational applications have always needed to accept immediate writes. (Losing data is bad.)
Operational applications have always needed to serve small query result sets based on the freshest data. (If you write something into a database, you might need to immediately retrieve it to finish the business operation.)
It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
… most such analysis should look at historical data as well.
Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

The two-policemen joke seems ever more relevant.
My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.

A new logical data layer?

Curt Monash — Mon, 23 Mar 2015 05:36:44 +0000

I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.

Here are some thoughts as to why, and also as to challenges that need to be overcome.

There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:

When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.

Trivial query routing or federation is … trivial.

Databases have or can be given some kind of data catalog interface. Of course, this is easier for databases that are tabular, whether relational or MOLAP (Multidimensional OnLine Analytic Processing), but to some extent it can be done for anything.
Combining the catalogs can be straightforward. So can routing queries through the system to the underlying data stores.

In fact, what I just described is Business Objects’ original innovation — the semantic layer — two decades ago.

Careless query routing or federation can be a performance nightmare. Do a full scan. Move all the data to some intermediate server that lacks capacity or optimization to process it quickly. Wait. Wait. Wait. Wait … hmmm, maybe this wasn’t the best data-architecture strategy.

Streaming goes well with federation. Some data just arrived, and you want to analyze it before it ever gets persisted. You want to analyze it in conjunction with data that’s been around longer. That’s a form of federation right there.

There are ways to navigate schema messes. Sometimes they work.

Polishing one neat relational schema for all your data is exactly what people didn’t want to do when they decided to store a lot of the data non-relationally instead. Still, memorializing some schema after that fact may not be terribly painful.
Even so, text search can help you navigate the data wilds. So can collaboration tools. Neither helps all the time, however.

Neither extreme view here — “It’s easy!” or “It will never work!” — seems right. Rather, I think there’s room for a lot of effort and differentiation in exposing cross-database schema information.

I’m leaving out one part of the story on purpose — how these data layers are going to be packaged, and specifically what other functionality they will be bundled with. Confidentially would screw up that part of the discussion; so also would my doubts as to whether some of those plans are fully baked yet. That said, there’s an aspect of logical data layer to CDAP, and to Kiji as well. And of course it’s central to BI (Business Intelligence) and ETL (Extract/Transform/Load) alike.

One way or another, I don’t think the subject of logical data layers is going away any time soon.

Related link

Implicit in this post is the belief that enterprises should and do use many different data stores (June, 2014)

Notes on HBase

Curt Monash — Tue, 10 Mar 2015 18:24:40 +0000

I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:

The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
Cloudera still uses a figure of 20% of its customers being HBase-centric.
HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.

Also:

Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)

Cloudera’s views on HBase history — in response to the priorities I brought to the conversation — include:

HBase initially favored consistency over performance/availability, while Cassandra initially favored the opposite choice. Both products, however, have subsequently become more tunable in those tradeoffs.
Cloudera’s initial contributions to HBase focused on replication, disaster recovery and so on. I guess that could be summarized as “scaling”.
Hortonworks’ early HBase contributions included (but were not necessarily limited to):
- Making recovery much faster (10s of seconds or less, rather than minutes or more).
- Some of that consistency vs. availability tuning.
“Coprocessors” were added to HBase ~3 years ago, to add extensibility, with the first use being in security/permissions.
With more typical marketing-oriented version numbers:
- HBase .90, the first release that did a good job on durability, could have been 1.0.
- HBase .92 and .94, which introduced coprocessors, could have been Version 2.
- HBase .96 and .98 could have been Version 3.
- The recent HBase 1.0 could have been 4.0.

The HBase roadmap includes:

A kind of BLOB/CLOB (Binary/Character Large OBject) support.
- Intel is heavily involved in this feature.
- The initial limit is 10 megabytes or so, due to some limitations in the API (I didn’t ask why that made sense). This happens to be all the motivating Chinese customer needs for the traffic photographs it wants to store.
Various kinds of “multi-tenancy” support (multi-tenancy is one of those terms whose meaning is getting stretched beyond recognition), including:
- Mixed workload support (short-request and analytic) on the same nodes.
- Mixed workload support on different nodes in the same cluster.
- Security between different apps in the same cluster.
(Still in the design phase) Bottleneck Whack-A-Mole, with goals including but not limited to:
- Scale-out beyond the current assumed limit of ~1200 nodes.
- More predictable performance, based on smaller partition sizes.
(Possibly) Multi-data-center fail-over.

Not on the HBase roadmap per se are global/secondary indexes. Rather, we talked about projects on top of HBase which are meant to provide those. One is Apache Phoenix, which supposedly:

Makes it simple to manage compound keys. (E.g., City/State/ZipCode)
Provides global secondary indexes (but not in a fully ACID way).
Offers some very basic JOIN support.
Provides a JDBC interface.
Offers efficiencies in storage utilization, scan optimizations, and aggregate calculations.

Another such project is Trafodion — supposedly the Welsh word for “transaction” — open sourced by HP. This seems to be based on NonStop SQL and Neoview code, which counter-intuitively have always been joined at the hip.

There was a lot more to the conversation, but I’ll stop here for two reasons:

This post is pretty long already.
I’m reserving some of the discussion until after I’ve chatted with vendors of other NoSQL systems.

Related link

My July 2011 post on HBase offers context, as do the comments on it.