MarkLogic – DBMS 2 : DataBase Management System Services

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.

Couchbase 4.0 and related subjects

Curt Monash — Thu, 15 Oct 2015 15:17:44 +0000

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs — start:

There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
“Index”, “data” and “query” are three different services/tiers.
- You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
- I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.

Up to a point, SQL-on-NoSQL stories can be fairly straightforward.

You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
SELECT, FROM and WHERE clauses work in the usual way.
Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.

For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.

*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.

JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.

But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.

Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.

21st Century DBMS success and failure

Curt Monash — Mon, 14 Jul 2014 06:37:31 +0000

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say:

At large enterprises, their internet operations perhaps excepted:
- Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
- NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

Performance and/or scalability (many examples).
Developer features (for example dynamic schema).
License/maintenance cost (for example several open source categories).
Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
… appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

Reliability, especially but not only in short-request use cases.
Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
- I think this is one of several reasons NoSQL has been much more successful than NewSQL.
- It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
As for MySQL:
- MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
- MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
- In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

The drive for uninterrupted DBMS operation.
Short-request DBMS trade-offs and alternatives.

Comments on the 2013 Gartner Magic Quadrant for Operational Database Management Systems

Curt Monash — Fri, 08 Nov 2013 16:46:46 +0000

The 2013 Gartner Magic Quadrant for Operational Database Management Systems is out. “Operational” seems to be Gartner’s term for what I call short-request, in each case the point being that OLTP (OnLine Transaction Processing) is a dubious term when systems omit strict consistency, and when even strictly consistent systems may lack full transactional semantics. As is usually the case with Gartner Magic Quadrants:

I admire the raw research.
The opinions contained are generally reasonable (especially since Merv Adrian joined the Gartner team).
Some of the details are questionable.
There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
The trends Gartner highlights are similar to those I see, although our emphasis may be different, and they may leave some important ones out. (Big omission — support for lightweight analytics integrated into operational applications, one of the more genuine forms of real-time analytics.)

Anyhow:

The 2013 Gartner Magic Quadrant for Operational Database Management Systems puts Oracle in the lead, closely followed in some order Microsoft, SAP, and IBM, with everybody else way behind. That’s reasonable, harkening back to the time when Oracle, IBM, Microsoft and to some extent Sybase were seemingly secure oligopolists, and most of the other vendors mentioned didn’t yet exist.
Gartner seems to view a proprietary appliance strategy as good for customers, without mentioning that it’s also a way to sell hardware at ridiculous prices.
Gartner evidently likes memory-centric positioning. SAP, Aerospike, VoltDB and McObject all get surprisingly high marks.
Gartner gives Intersystems pretty high marks, while Progress Software isn’t even mentioned. Despite Progress’ recent restructuring, I’d think the core Progress OpenEdge business — arguably Intersystems’ closest rival — deserves more respect than that. (But given how rarely I write about it myself, perhaps I shouldn’t criticize.)
Gartner has long been oddly positive on Actian, which is a floundering hodgepodge of half a dozen database also-rans. I like Mike Hoskins a lot too, but just how much has Actian’s supposedly “energized” “strong leadership” accomplished in the recent past, at Actian or elsewhere?
Gartner has brutally low “vision” rankings for NuoDB and Clustrix. I think scaling out SQL effectively is more impressive than that. Gartner also omits to mention Clustrix’s past as an appliance vendor.
Gartner refers to Oracle’s multi-tenancy support as if … well, as if it supported multi-tenancy.
I don’t understand Gartner’s rankings of or comments about NoSQL vendors. For example:
- Three “strengths” are mentioned for MongoDB, yet none reference MongoDB’s developer outreach, which may be second only to prime Microsoft’s.
- HBase is discussed as if the Hadoop vendors were still pushing it hard, or if it were showing up in a lot of enterprise evaluations.
- Geo-distribution is mentioned as a strength for Riak, yet not for Cassandra.
Every Gartner Magic Quadrant (or Forrester Wave) features one or more outright brain cramps. In this one:
- Gartner writes “the Clustrix database offers no support for data types beyond traditional relational types,” when in fact Clustrix was one of the early indicators of a trend toward relational DBMS JSON support.
- Gartner suggests that EnterpriseDB’s Oracle compatibility is something new, when it was actually the company’s whole strategy 6-7 years ago.

Finally, since I’ve struggled with the definition of “DBMS”, I’ll finish by quoting with approval the start of Gartner’s:

We define a DBMS as a complete software system used to define, create, manage, update and query a database.

Related links

Comments on the most recent Gartner Magic Quadrant for Data Warehouse Database Management Systems
My definition of operational analytics

Some notes on new-era data management, March 31, 2013

Curt Monash — Mon, 01 Apr 2013 08:44:42 +0000

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

Workloads and use cases vary greatly.
In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
Many workloads are inherently single node (replication aside). Others are not.

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included:

MongoDB’s tunable consistency seems really interesting, with numerous choices available at the program-statement level.
All rumored performance problems notwithstanding, Ron asserts that MongoDB often “kicks butt” in actual proof-of-concept (POC) bake-offs.
Ron cites “12 different language bindings” as a key example of developer functionality giving 10gen an advantage vs. Ron’s previous employer MarkLogic.
10gen is working hard on management tools, security, and so on.
Ron claims that the “MongoDB loses data” knock is a relic of the distant — i.e. 1-2 years ago — past.
We had the same “Who needs joins?” discussion that I used to have with MarkLogic — Ron’s former company — and which MarkLogic has since disavowed.
There’s nothing special about MongoDB’s b-tree indexes. (I mention that because Tokutek thinks it offers a faster MongoDB indexing option.)

While this wasn’t a numbers-oriented conversation, business highlights included:

A lot of MongoDB’s competition is RDBMS — Oracle, SQL Server, MySQL, etc.
MongoDB’s top NoSQL competitor is Cassandra. 10gen sees less Couchbase than before, and also less HBase than Cassandra.
There’s yet another favorable MongoDB soft metric — 50,000 registrants for free online education, 2/3 outside the US.

I can add that anecdotal evidence from other industry participants suggests there’s a lot of MongoDB mindshare.

Specific traditional-enterprise use cases we discussed focused on combining data from heterogeneous systems. Specifically mentioned were:

Reference data/360-degree customer view.
Reference data about securities.
Aggregation of analytic results from various analytic systems across an enterprise. (For risk management).

DBAs’ roles in development

A lot of marketing boils down to “We don’t need no stinking DBAs!!!” I’m thinking in particular of:

NoSQL.
Hadoop and/or exploratory BI* messaging that positions against the alleged badness of “traditional data warehousing”.

*See in particular the comments to that post.

The worst-case data warehousing scenario is indeed pretty bad. It could feature:

Much internal discussion and politicking to determine the One True Way to view various data fields, with …
… lots of ongoing bureaucratic safeguards in the area of data governance.
Long additional efforts in the area of performance tuning.
Data integration projects up the wazoo.

But if the goal is just to grab some data from an existing data warehouse, perhaps add in some additional data from the outside, and start analyzing it — well, then there are many attempted solutions to that problem, including from within the analytic RDBMS world. The question is whether the data warehouse administrators try to help — which usually means “Here’s your data; now go away and stop bothering me!” — or whether they focus on “business prevention”.

Meanwhile, on the NoSQL side:

The smart folks at WibiData felt the need for schema-definition tools over HBase.
Per Ron Avnur, MongoDB users are clamoring for consistency-rule specification via an administrative (rather than programmatic) UI.

It’s the old loose-/tight-coupling trade-off. Traditional relational practices offer a clean interface between database and code, but bundle the database characteristics for different applications tightly together. NoSQL tends to tie the database for any one app tightly to that app, at the cost of difficulties if multiple applications later try to use the same data. Either can make sense, depending on (for example):

How it seems natural to organize your development and data administration talent.
Whether the app is likely to survive long enough that you’ll want to run many other applications against the same database.

DBMS development and other subjects

Curt Monash — Mon, 18 Mar 2013 05:29:42 +0000

The cardinal rules of DBMS development

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1.

In particular:

Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
Mixed workload management is harder than you’re assuming it is.
Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

DBMS with Hadoop underpinnings …

… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.

But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.

MarkLogic …

… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.

As for MarkLogic’s Enterprise NoSQL messaging — it basically equates “NoSQL” to “short-request dynamic-schema“, and in 2013 I have little quarrel with that definition.

RDBMS-oriented Hadoop file formats are confusing

I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.

Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:

Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
What’s the nested data structure story? (It seems there is one.)
What’s the compression story?

Come to think of it, the name “Parquet” suggests that either:

Rows and columns are mixed together.
Somebody has the good taste to be a Celtics fan.

Whither analytic platforms?

I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.

I think that problems include:

Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
… a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.

Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.

Related links

One database to rule them all? (February, 2013)
NewSQL thoughts (January, 2013)
Bottleneck Whack-A-Mole (August, 2009)

One database to rule them all?

Curt Monash — Thu, 21 Feb 2013 05:52:05 +0000

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:

Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.
Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.
Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.
Tokenization can help with the fixed-/variable-length tradeoff.
Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.
Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.
Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.

What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.

So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears — for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:

If there’s much commonality among the component DBMS, each one would be sub-optimal.
If there’s little commonality among them, then there’s also little benefit to the combination.

Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.

Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:

Analytic DBMS have combined row and column data models so fluidly that I’ve made fun of Oracle for not being able to pull it off. SAP HANA sort of does the same thing, but perhaps with a columnar bias, and not just for analytics.
Relational DBMS can also have a variety of index types, suitable for different relational use cases. This is especially true for analytic uses of general-purpose RDBMS.
Oracle, DB2, PostgreSQL, and Informix have had full extensibility architectures since the 1990s. That said:
- Almost all the extensions come from the DBMS vendors themselves.
- Extensions that resemble (or are) a tabular datatype — for example geospatial or financial-date — are often technically well-regarded.
- Others are usually not so strong technically, but in a few cases sell well anyway (e.g. Oracle Text).
- While Microsoft never went through the trouble of offering full extensibility, otherwise the SQL Server story is similar.
- Sybase’s extensibility projects went badly in the 1990s, and Sybase doesn’t seem to have tried hard in that area since.
IBM DB2, Microsoft SQL Server, and Oracle added XML capabilities around the middle of the last decade.
Analytic platforms can wind up with all sorts of temporary data structures.
Analytic DBMS have various ways to reach out and touch Hadoop.

Further:

Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.
Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.
Hadapt combines Hadoop and PostgreSQL.
DataStax combines Cassandra, Hadoop, and Solr.
Akiban fondly thinks its data layouts are well-suited for relational tables, JSON, and XML alike. (But business at Akiban may be in flux.)
GenieDB (Version 1 only) and NuoDB are both implemented over key-value stores. GenieDB Version 2 is implemented over Berkeley DB or MySQL.
Membase/Couchbase was first implemented over SQLite, then over (a forked version of) CouchDB.

Related links

A taxonomy of database use cases (July, 2012)
An early form of this discussion in the single domain of analytic RDBMS (February, 2009)

Our clients, and where they are located

Curt Monash — Sat, 31 Mar 2012 20:36:32 +0000

From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:

This is a list of Monash Advantage members.
All our vendor clients are Monash Advantage members, unless …
… we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
We do not usually disclose our user clients.
We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
Excluded from this round of disclosure is one vendor I have never written about.
Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.

For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me.

City of San Francisco

KXEN
Metamarkets
PivotLink
salesforce.com
Splunk
WibiData

Other San Francisco area

10gen
ClearStory
Cloudera
Couchbase
DataStax
Hortonworks
MarketShare
MarkLogic
Schooner
Sybase, an SAP company
VMware
Yarcdata, a division of Cray

Boston and Cambridge

Akiban
Cloudant
Hadapt
Vertica, an HP company

Other Boston area

Netezza, an IBM company
StreamBase

Everywhere else

CodeFutures
Infobright
SAND
Syncsort
Tableau
Teradata

For most of the companies listed above, you can find coverage here, and specifically a blog category in the list on the right. The exceptions, for now, are:

Cloudant
MarketShare
Metamarkets
PivotLink
VMware
Yarcdata

The main reason I threw in the geographical notes is to support the idea that there’s a real suburb-to-urban shift in the startup tech industry. Mike Arrington made the point recently about the San Francisco area, primarily with respect to the mass/consumer tech areas he focuses on, and of course it’s echoed by the rise of the New York City tech sector. My point is to add that it’s also true for system and enterprise technology, at least in the areas I cover.

In particular, the re-urbanization of the Boston-area software industry is striking:

Akiban and Cloudant are in the same office complex in the city of Boston. I was surprised to find even one tech startup in the city of Boston proper, but it seems there is a two-building complex packed with them.
Hadapt moved from Connecticut to the Kendall Square area of Cambridge. Edit: Actually, see the comments below.
Vertica moved from Burlington to the Alewife area of Cambridge. That’s (evidently deliberately) at the boundary of what might be regarded as the “urban” and “suburban” parts of metro Boston.

The cluster in the city of San Francisco — which also half-includes Cloudera — is relatively new as well.

Big data terminology and positioning

Curt Monash — Mon, 09 Jan 2012 01:35:57 +0000

Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:

Bigness — Volume, Velocity, size
Structure — Variety, Variability, Complexity

given that

High-velocity “big data” problems are usually high-volume as well.*
Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.

But the conflation should stop there.

*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.

When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:

Relational big data is data of high volume that fits well into a relational DBMS.
Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.

Notes on all this include:

“Relational big data” is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.
The paradigmatic example of “multi-structured big data” is log files. Thus, multi-structured big data is commonly what you need a big bit bucket for.
One might want to equate non-analytic relational big data technology to “NewSQL”. However, I’m struggling to think of a database size range in which the entire NewSQL industry can match Oracle’s market share alone.
One might want to equate non-analytic multi-structured big data technology to “NoSQL”. However:
- “NoSQL” is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.
- “NoSQL” has non-ACID/low(er)-data-integrity connotations that aren’t appropriate for all non-relational systems.
Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
- 1-2 terabytes if you’ve never bought anything past Oracle Standard Edition.
- 5-10 terabytes if you’re already paying for Oracle Enterprise Edition.
- A lot higher than that if you actually find Oracle Exadata to be cost-effective.
Depending on how big one acknowledges as “big”, the market share leader in “big bit bucket” use cases is either Splunk or Hadoop.
If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.
It is wrong to say that the large web companies invented “big data” technology. But it is more reasonable to say they invented much of “multi-structured big data” management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.

MarkLogic’s Hadoop connector

Curt Monash — Fri, 04 Nov 2011 00:58:06 +0000

It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.

Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:

Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.

Otherwise, the whole thing is just what you would think:

Hadoop can read from and write to MarkLogic, in parallel at both ends.
If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”

MarkLogic said that it wrote this Hadoop connector itself.

When I realized MarkLogic was claiming the ability to seamlessly integrate short-request and batch analytic processing, I asked about workload management. I gathered that:

MarkLogic believes that MarkLogic 5 does a great job of granular workload monitoring.
However, MarkLogic doesn’t have a strong workload management administrative interface. Rather, you may have to do workload management programmatically.

Overall, I think the MarkLogic Hadoop connector could prove pretty useful. The first question I ask somebody who wants to process relational data in Hadoop is “Why not just an analytic RDBMS?” But the natural use cases for MarkLogic are often ones in which you might as well do your analytics in Hadoop, including a 4 billion Word/PDF/image document insurance-industry example I recently encountered, and for which I favor MarkLogic over MongoDB or straight Hadoop alike.