The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.
But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.
… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.
RDBMS-oriented Hadoop file formats are confusing
I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.
Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:
- Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
- What’s the nested data structure story? (It seems there is one.)
- What’s the compression story?
Come to think of it, the name “Parquet” suggests that either:
- Rows and columns are mixed together.
- Somebody has the good taste to be a Celtics fan.
Whither analytic platforms?
I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.
I think that problems include:
- Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
- But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
- … a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
- More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.
Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.
- One database to rule them all? (February, 2013)
- NewSQL thoughts (January, 2013)
- Bottleneck Whack-A-Mole (August, 2009)