Kognitio – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Notes on memory-centric data management

Curt Monash — Fri, 03 Jan 2014 09:35:44 +0000

I first wrote about in-memory data management a decade ago. But I long declined to use that term — because there’s almost always a persistence story outside of RAM — and coined “memory-centric” as an alternative. Then I relented 1 1/2 years ago, and defined in-memory DBMS as

DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory)

By way of contrast:

Hybrid memory-centric DBMS is our term for a DBMS that has two modes:

In-memory.

Querying and updating (or loading into) persistent storage.

These definitions, while a bit rough, seem to fit most cases. One awkward exception is Aerospike, which assumes semiconductor memory, but is happy to persist onto flash (just not spinning disk). Another is Kognitio, which is definitely lying when it claims its product was in-memory all along, but may or may not have redesigned its technology over the decades to have become more purely in-memory. (But if they have, what happened to all the previous disk-based users??)

Two other sources of confusion are:

The broad variety of memory-centric data management approaches.
The over-enthusiastic marketing of SAP HANA.

With all that said, here’s a little update on in-memory data management and related subjects.

I maintain my opinion that traditional databases will eventually wind up in RAM.
At conventional large enterprises — as opposed to for example pure internet companies — production deployments of HANA are probably comparable in number and investment to production deployments of Hadoop. (I’m sorry, but much of my supporting information for that is confidential.)
Cloudera is emphatically backing Spark. And a key aspect of Spark is that, unlike most of Hadoop, it’s memory-centric.
It has become common for disk-based DBMS to persist data through a “log-structured” architecture. That’s a whole lot like what you do for persistence in a fundamentally in-memory system.
I’m also sensing increasing comfort with the strategy of committing writes as soon as they’ve been acknowledged by two or more nodes in RAM.

And finally,

I’ve never heard a story about an in-memory DBMS actually losing data. It’s surely happened, but evidently not often.

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Curt Monash — Tue, 05 Feb 2013 13:25:15 +0000

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:

I think HP’s troubles are less relevant to HP Vertica than Gartner does.
In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.

That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.

2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.

*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated

What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:

ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
Gartner dings ParAccel for a variety of product weaknesses.
Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)

That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.

I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.

*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.

I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.

Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
Gartner’s view of Exasol seems similar to mine.
I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)

Notes on some basic database terminology

Curt Monash — Tue, 07 Aug 2012 10:25:42 +0000

In a call Monday with a prominent company, I was told:

Teradata, Netezza, Greenplum and Vertica aren’t relational.
Teradata, Netezza, Greenplum and Vertica are all data warehouse appliances.

That, to put it mildly, is not accurate. So I shall try, yet again, to set the record straight.

In an industry where people often call a DBMS just a “database” — so that a database is something that manages a database! — one may wonder why I bother. Anyhow …

1. The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.

2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:

Relational:
- Was designed to do relational stuff* from the get-go, even if it now does other things too.
- Supports a lot of SQL.
Non-relational:
- Was designed primarily to do non-relational things.*
- Doesn’t support all that much SQL.

*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.

3. There are two chief kinds of relational DBMS:

RDBMS that are designed for, among other things, online transaction processing (OLTP). Examples include Oracle, DB2, SQL Server, Sybase ASE, PostgreSQL, and MySQL. It is reasonable to refer to these as general-purpose or OLTP RDBMS.*
RDBMS that are designed strictly for analytic uses. Examples include Sybase IQ, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio and the DBMS software inside systems from Teradata and Netezza. It is most accurate to refer to these as analytic RDBMS or just analytic DBMS (sometimes abbreviated ADBMS).

* “General-purpose” is usually a better term than “OLTP”; most OLTP DBMS can handle at least basic reporting, and the leading ones go well beyond that.

4. Some analytic RDBMS were designed to be columnar. Some were designed to be row-based. Multiple systems from both groups now offer both column- and row-based storage options. But they’re all equally relational.

And once again, I remind you that columnar storage and columnar compression are not the same thing.

5. An appliance can include a DBMS, and indeed exist for no purpose other than to run a DBMS; but a DBMS is not an appliance. At a minimum, a data warehouse appliance is a computing system (hardware, storage, operating system, etc.) with an analytic RDBMS preinstalled.

Occasionally somebody suggests that a “virtual appliance” doesn’t have to have hardware included, but they usually draw little attention.

However, reasonable people can disagree about pickier questions, such as:

Does appliance hardware have to be in any way purpose-built? I lean to a No — but I prefer those “appliance” stories that include an actual a hardware advantage.
Does appliance hardware have to have custom silicon, or at least FPGAs (Field-Programmable Gate Arrays)? My answer is an emphatic No.
Does an appliance have to be super-easy to install and administer? I lean to a No — but two of the top appliance benefits are ease of deployment and administration.

For example, I think:

All hardware systems Teradata makes are appliances, even the ones it thinks aren’t.
Similarly, Oracle Exadata systems are appliances.
IBM Netezza is the classic line of data warehouse appliances.
IBM’s “Smart Analytic Systems” can justifiably be called appliances if IBM wishes — but IBM would be wise to save that word for its Netezza line.

Again, reasonable people can disagree — just so long as they don’t slap the label “appliance” onto software-only analytic RDBMS.

Why I recommend avoiding Kognitio

Curt Monash — Wed, 18 Jul 2012 01:39:55 +0000

Since my recent post about Kognitio, things have gotten worse. The company is insistently pushing the marketing message that Kognitio has always been an in-memory product, and at one point went so far as to publicly pretend that I had agreed.

I do not agree. Yes, it’s fair to say — as I did in 2008 — that Kognitio is very RAM-centric, but that’s not at all the same thing. In particular:

I did due diligence for Warburg Pincus’ original investment in Kognitio in the 1990s (it was then called White Cross). I have no memory of an in-memory positioning, nor of discussing same with anybody.
I checked my notes from a 2006 briefing, which included Kognitio CTO Roger Gaskell. There was no claim that Kognitio was an in-memory product.
Indeed, as I also posted in 2008, Kognitio keeps indexes on disk. If you use indexes on disk, you’re not an in-memory product.

The truth is that Kognitio offers a disk-based DBMS that has long been worked on by a small team. I believe that the team really has put considerable effort into how Kognitio uses RAM. But there’s no basis to give Kognitio credit for being “really” in-memory vs. a variety of other analytic RDBMS alternatives. And a row-based product that doesn’t currently offer compression is at a large disadvantage versus, say, columnar products that already do.*

*Columnar systems don’t clobber row-based ones in-memory as extremely as they do in some disk-based use cases. But even in-memory it’s good not to have to move around data that isn’t relevant to your query.

Until Kognitio gets at least somewhat more honest in its marketing, I recommend avoiding Kognitio like the plague. It’s simply not a big enough company to buy from unless you have some level of trust in the management team.

Kognitio’s story today

Curt Monash — Wed, 23 May 2012 01:36:17 +0000

I had dinner tonight with the Kognitio folks. So far as I can tell:

Branding has been mercifully simplified. Everything is now called “Kognitio” (as opposed to, for example, “WX2”).
Notwithstanding its long history of selling disk-based DBMS and denigrating memory-only configurations, Kognitio now says that in fact it’s always been an in-memory DBMS vendor.
Notwithstanding its long history of selling (or attempting to sell) analytic DBMS, Kognitio wants to be viewed as an accelerator to your existing DBMS. This is apparently inspired in part by SAP HANA, notwithstanding that HANA’s direction is to evolve into a hybrid OLTP/analytic general-purpose DBMS.
Notwithstanding its lack of analytic platform features, Kognitio wants to be viewed as selling an analytic platform.
Notwithstanding its memory-centric focus, Kognitio doesn’t want to compress data. Kognitio’s opinion — which to my knowledge is shared by few people outside Kognitio — seems to be that the CPU cost of compression/decompression isn’t justified by the RAM savings from compression.
Kognitio still is pushing a cloud/SaaS (Software as a Service) story. Even if you want to use Kognitio (the product) on-premises, Kognitio (the company) calls that “private cloud” and offers to let you pay annually.

Kognitio believes that this story is appealing, especially to smaller venture-capital-backed companies, and backs that up with some frieNDA pipeline figures.

Between that success claim and SAP’s HANA figures, it seems that the idea of using an in-memory DBMS to accelerate analytics has legs. This makes sense, as the BI vendors — Qlik Tech excepted — don’t seem to be accomplishing much with their proprietary in-memory alternatives. But I’m not sure that Kognitio would be my first choice to fill that role. Rather, if I wanted to buy an unsuccessful analytic RDBMS to use as an in-memory accelerator, I might consider ParAccel, which is columnar, has an associated compression story, has always had a hybrid memory-centric flavor much as Kognitio has, and is well ahead of Kognitio in the analytic platform derby. That said, I’ll confess to not having talked with or heard much about ParAccel for a while, so I don’t know if they’ve been able maintain technical momentum any more than Kognitio has.

Many kinds of memory-centric data management

Curt Monash — Sun, 08 Apr 2012 01:33:31 +0000

I’m frequently asked to generalize in some way about in-memory or memory-centric data management. I can start:

The desire for human real-time interactive response naturally leads to keeping data in RAM.
Many databases will be ever cheaper to put into RAM over time, thanks to Moore’s Law. (Most) traditional databases will eventually wind up in RAM.
However, there will be exceptions, mainly on the machine-generated side. Where data creation and RAM data storage are getting cheaper at similar rates … well, the overall cost of RAM storage may not significantly decline.

Getting more specific than that is hard, however, because:

The possibilities for in-memory data storage are as numerous and varied as those for disk.
The individual technologies and products for in-memory storage are much less mature than those for disk.
Solid-state options such as flash just confuse things further.

Consider, for example, some of the in-memory data management ideas kicking around.

In many cases there is essentially an in-memory DBMS, trying for as much ACIDity as RAM reasonably allows, then (usually) also copying data synchronously to persistent storage. These can have many different architectures. For example:
- SAP HANA is an in-memory columnar DBMS, with text indexing/inverted-list antecedents, except when it uses one of a couple of approaches to in-memory row-based data management.
- solidDB, now an IBM product, is an RDBMS that relies on Patricia tries. It is actually a hybrid memory/disk product, but optimized for in-memory operation.
- eXtremeDB is an OODBMS, but relies on B-trees.
- H-Store and its commercialization VoltDB are row-based RDBMS that make drastic assumptions about the nature of your workload, but in return get to drop much of the overhead other DBMS need.
- Oracle TimesTen is a row-based RDBMS, oriented to OLTP (OnLine Transaction Processing), which stores its data persistently via another RDBMS. (MySQL was the default choice before Oracle bought the company.)
- Oracle’s answer to SAP HANA is to take TimesTen and do analytics on it, via the Exalytics appliance.
Some disk-based DBMS just happen to be architected in ways so that for good performance you’re going to want to keep all the data in RAM. Often, their in-memory architecture is lot like their on-disk architecture, with memory mapping for I/O. This is done in very different kinds of DBMS.
- MongoDB is one visible example. In general, scale-out web databases (whether NoSQL or MySQL) often keep all their data in RAM, whether or not that plan is baked into the DBMS architecture.
- Various analytic DBMS vendors have at time been memory-oriented. At the moment, I think:
  - Exasol (columnar) isn’t quite as extreme about wanting to be in-memory as it used to be.
  - ParAccel (columnar) and its memory-mapped architecture can be happily used either in-memory or on disk.
  - Kognitio (row-based), which used to be portrayed as a disk-based system that’s smart about using RAM, is currently being marketed as an in-memory system.
My last technical briefing on Applix TM1 (now an IBM Cognos product) was in September, 2005. (The product itself dates back to 1984.) At the time TM1 had an interesting sparse MOLAP (Multi-Dimensional OnLine Analytic Processing) story, the point being that the system worked hard to isolate what was actually non-zero. Loading of raw data seemed to be batch, but you could update models with derived data, and there was a transaction log for confident persistence.
Alternatively, you can use a caching layer, typically on a separate set of servers from your DBMS, which has no responsibility for managing data persistence. For example:
- TimesTen and solidDB are used, respectively, as relational caches for Oracle and DB2.
- Peter Zencke told me years ago that SAP had a purpose-built caching layer that kept over 99% of requests from touching disk.
- The key-value store memcached is central to many of the world’s largest web sites, typically backed by a MySQL cluster.
- ScaleArc has key-value cache that stores — rather than individual records — the entire TCP string sent by an RDBMS in response to a particular SQL query.
Some systems manage data in memory in one kind of structure, then ensure persistence via a very different structure on disk. Examples include:
- Workday’s architecture — object-oriented in RAM, MySQL (really key-value) on disk. Edit: Workday thinks “key-value” is a slightly misleading way to put it. Stay tuned for more.
- Oracle Coherence (formerly Tangosol) — object-oriented in RAM, Oracle on disk. Edit: Actually, Coherence isn’t really a write-through ORM (Object-Relational Mapper). It functions more like memcached, albeit with a very different data model.
- Couchbase — memcached (key-value) in-memory, evolving from SQLite to CouchDB on disk.
Similarly, business intelligence suites can manage data in-memory that comes from some other kind of data store (usually an RDBMS, sometimes Hadoop or whatever). I haven’t had a lot of luck in getting details, with one exception — QlikView, which uses a simple tabular data structure.
Stream processors — i.e. CEP engines — are a whole other sort of in-memory engine, doing something that’s a lot like data management.

And that, kiddies, is why I hesitate to generalize in too much detail about “in-memory database management.”

Despite its length, this is still a very partial list of memory-centric data management approaches. I encourage you to add other examples into the comments that I might have left out.

Related link

I did a simpler overview of memory-centric alternatives in 2005.

Comments on the analytic DBMS industry and Gartner’s Magic Quadrant for same

Curt Monash — Wed, 08 Feb 2012 17:17:32 +0000

This year’s Gartner Magic Quadrant for Data Warehouse Database Management Systems is out.* I shall now comment, just as I did on the 2010, 2009, 2008, 2007, and 2006 Gartner Data Warehouse Database Management System Magic Quadrants, to varying extents. To frame the discussion, let me start by saying:

In general, I regard Gartner Magic Quadrants as a bad use of good research.
Illustrating the uselessness of — or at least poor execution on — the overall quadrant metaphor, a large majority of the vendors covered are lined up near the line x = y, each outpacing the one below in both of the quadrant’s dimensions.
I find fewer specifics to disagree with in this Gartner Magic Quadrant than in previous year’s versions. Two factors jump to mind as possible reasons:
- This year’s Gartner Magic Quadrant for Data Warehouse Database Management Systems is somewhat less ambitious than others; while it gives as much company detail as its predecessors, it doesn’t add as much discussion of overall trends. So there’s less to (potentially) disagree with.
- Merv Adrian is now at Gartner.
Whatever the problems may be with Gartner’s approach, the whole thing comes out better than do Forrester’s failed imitations.

*As of February, 2012 — and surely for many months thereafter — Teradata is graciously paying for a link to the report.

Specific company comments, roughly in line with Gartner’s rough single-dimensional rank ordering, include:

The Gartner Magic Quadrant’s comments on Teradata seem pretty fair. I don’t think I’m much in disagreement when I say:
- Teradata has the richest, most mature analytic DBMS offering.
- Teradata has an outstanding track record both for managing large data volumes and for high-concurrency mixed workloads.
- Aster Data was a cool Teradata acquisition, even if Teradata/Aster synergies or integration have been nominal to date.
- Teradata still needs to get out of its own way in marketing, positioning, packaging, and/or defining its premium-priced system vs. its more moderately-priced alternatives. Indeed, as necessary as this approach may have been to fending off encroachments by Netezza and others, what Teradata really needs to do is evolve to a more pick-your-own-node-combination mix-match kind of offering.
Gartner has talked with a lot of Oracle Exadata users who say that the product works; Gartner also has stopped beating Oracle up for its previous policy of almost never doing onsite POCs (Proofs of Concept); both parts of that ring true with me. But Gartner also rightly dings Oracle for various issues in cost and cumbersomeness. Overall, while I agree there are organizations for which Oracle should indeed be a top-ranked choice, there are many others who shouldn’t put Oracle on their short list.
Third in the Gartner MQ rankings is IBM.
- Gartner gets so caught up in reciting the names of various IBM product offerings that it neglects to say much good about DB2 itself. (I tend to have a similar problem.)
- But Gartner does mention concurrency as a strength. I agree, especially if we presume that that was a reference to DB2 rather than Netezza.
- Gartner cites Netezza’s post-acquisition annual growth rate as 30%. Gartner seems to think this is a good number. I disagree, but in Netezza’s defense, it has had to endure IBM’s post-acquisition on-boarding process.
Arguably fourth in the Gartner Data Warehouse Magic Quadrant rankings is EMC/Greenplum.
- In general, Gartner likes the taste of Greenplum Kool-Aid.
- Gartner neglects to ding Greenplum for concurrency challenges, which I view as an oversight given Gartner’s general stress on that area.
- Gartner does ding Greenplum for support challenges.
- Gartner neglects to praise Greenplum for true hybrid row/columnar data management, a feature shared by Teradata and Vertica, among others, but not by Oracle, DB2, or Netezza.
- Gartner located a half-petabyte Greenplum database. This doesn’t surprise me, even though Greenplum has frequently made exaggerated claims about large-size database successes in the past.
- Gartner reports a >400 figure for Greenplum customers, which is plausible.
In its first deviation from strict one-dimensional rank ordering, the Gartner Magic Quadrant ranks Sybase ahead of Greenplum in completeness of vision but behind in “ability to execute”.
- If that were the other way around, it might make more sense. Greenplum promises anything and everything you might ever want for analytic data management or the associated analysis; but Sybase has vastly more analytic DBMS users than Greenplum does, running a variety of demanding workloads.
- Gartner appears to think that Sybase IQ requires less database administration than I do.
- Gartner seems concerned that SAP will position HANA and Sybase ASE as, between them, the only DBMS you’ll ever need, casting doubt on Sybase IQ’s future. I wouldn’t worry about that if you have a problem you want to solve today.
The Gartner Magic Quadrant for Data Warehouse Database Management Systems ranks Microsoft sixth overall, despite noting that there isn’t a single production reference for Microsoft’s Parallel Data Warehouse. In support of this ranking, it for example cites the compression feature, which distinguishes Microsoft SQL Server from no other product on the list except Kognitio. If you have such an undemanding data warehousing problem that many different analytic DBMS could meet your needs, there’s a good chance Microsoft SQL Server can also do the job; and if you’ve bought into the Microsoft technology stack, you might as well keep going down that path. Otherwise, I don’t know why somebody should adopt Microsoft’s offering at this time.
Seventh along the main diagonal path in the Gartner Magic Quadrant is HP Vertica. I’d rank Vertica higher than that, but in fairness I note two execution concerns. First, HP has a lousy track record, both in acquisitions and in data warehousing/analytics. Second, Vertica is bad about answering my email. Anyhow, Gartner doesn’t seem to have given Vertica credit either for its full customer count or for the multiple petabyte-scale databases Vertica runs.
1010data is an outlier, with Gartner noting that it only partly fits in with other “Data Warehousing Database Management” companies, and hence kind of confessing that 1010data’s specific location on the Magic Quadrant is somewhat arbitrary. Stuff like that is bound to happen, given the inherent difficulties of defining market categories. Anyhow, my thoughts on 1010data include:
- I’m nervous about the fact that 1010data doesn’t actually control its own DBMS technology, but rather relies on old code from the small private company KX Systems.
- There are three main reasons to consider 1010data:
  - You want to enter the data mart outsourcing business in a casual way, and you like its SaaS offering.
  - You want to engage in stakeholder-facing analytics in a casual way, and you like its SaaS offering.
  - You love 1010data’s particular set of interactive analytic features and performance.
Back to the main path winding along the Gartner Magic Quadrant main diagonal — next up is ParAccel. While I question some of the peripheral comments, I agree with Gartner’s core messages that:
- ParAccel, the product, is blazingly fast in certain use cases.
- ParAccel, the company, is dangerously small.
Eighth on the Gartner MQ’s main path is Kognitio. This is too high. Kognitio positions itself as offering in-memory DBMS, yet stubbornly refuses to do any kind of data compression. That’s an awful combination of choices. As for using Kognitio’s data warehousing SaaS offering — why would you do that, when more modern products are available on a SaaS/cloud basis as well?
Ninth in the Gartner Magic Quadrant main rankings is SAND.
- The SAND section is not a triumph of Gartner accuracy. For example:
  - Gartner completely missed the errors in SAND’s reported customer counts.
  - Gartner refers to SAND as being “in existence for approximately nine years”, which is too low by at least a factor of 2.
  - Gartner says “SAND is a privately held company”, even though Merv knows better than that.
- Otherwise, Gartner’s opinion on SAND seems to boil down to “Interesting technology and ideas, but dangerously small company.” I agree.
Tenth and too low in the Gartner MQ main rankings is Infobright.
- At least by some metrics (e.g. customer count), Infobright isn’t as dangerously small as ParAccel, SAND, Kognitio, et al.
- That said, Infobright is small and focused on machine-generated data. So I wouldn’t be confident in Infobright’s future technology path for human-generated data use cases.
- Infobright’s performance is uneven — blazing in cases where the Knowledge Grid helps, but not necessarily stellar by analytic DBMS standards when full table scans are called for.
- I agree with Gartner that the possibility of Oracle/MySQL future shenanigans is a concern. But while the energy behind MySQL forking efforts doesn’t seem too great right now, I’d expect them to revive and offer a successful escape path if it seemed Oracle was going to indeed play hardball.
- Also, given that it’s already an open source vendor, there are various kinds of assurances Infobright could give that would also help alleviate customer concerns.
Actian, formerly Ingres, took a big tumble in Gartner’s rankings versus last year, when I simply wrote “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” I’m even a little harsher about Ingres/Actian’s DBMS products and prospects than Gartner is, but at least now we’re in the same ballpark.
Along with Infobright, ParAccel, and SAND, Exasol appears to be another of the “good columnar technology/small company” crowd. As with other such products, one should be careful about fit-and-finish features that are missing today, as there is no assurance they’ll be added in a timely manner going forward.
illuminate Solutions, which was on last year’s Gartner list, now appears to be an ex-company.

Database implications if IBM acquires Sun

Curt Monash — Wed, 18 Mar 2009 14:48:13 +0000

Reported or rumored merger discussions between IBM and Sun are generating huge amounts of discussion today (some links below). Here are some quick thoughts around the subject of how the IBM/Sun deal — if it happens — might affect the database management system industry.

IBM is already serious about supporting multiple database management systems. DB2 on open systems is IBM’s flagship DBMS. But DB2 on mainframes and at least one flavor of Informix seem to be getting maintained and enhanced fairly seriously as well. And IBM has further DBMS products as well (e.g., DB/2 on the AS/400). There’s little reason to think IBM would orphan MySQL or any other DBMS product.
IBM is very open-source-friendly. For a company that grew up for decades on proprietary software — and still is a huge software products vendor — IBM is very serious about open source. If you doubt that, I have two words for you: “Linux” and “Eclipse”.
MySQL might finally get its industrial-strength act together. IBM is good at database management and good at open source. MySQL becoming a no-apologies transactional DBMS would obviously put pressure on Ingres, PostgreSQL, and EnterpriseDB, although there surely would be lots of happy talk about the open source DBMS market being validated, lifting all the vendors and so on. Also, a better MySQL could be bad news for Microsoft SQL Server too.
Sun has a lot of DBMS partnerships right now. Obviously, Sun owns MySQL, and has partnerships with MySQL storage engine vendors such as Infobright and Kickfire. Sun also has a substantial partnership with Greenplum, and a Barneyesque* one with ParAccel. And of course Sun has strong working relationships with major database vendors such as Oracle and Sybase. What’s more, on a case-by-case basis, Sun may cooperate in the field with yet other DBMS sellers. E.g., I’ve confirmed at least one instance of a Sun sales rep recommending a Kognitio DBMS.
IBM partners with outside DBMS vendors too. You’d think IBM’s gazillion DBMS product lines would be enough. But nooooo. I frequently hear rumblings of IBM’s hardware or services operations working with other DBMS products as well. (This is, of course, actually to their credit.)
Short-term, there probably would be little effect on partnerships. Greenplum runs on Sun’s Thumper/Thor line of boxes. DB2 doesn’t, and certainly isn’t optimized for same. In the short term, to sell Thors, Sun would presumably continue to sell Greenplum.
Longer-term, there could be a DBMS rationalization. DB2, Informix, MySQL + storage engines, and big independent vendors such as Oracle and Sybase would surely always get attention. That’s a lot. There might not be room for much mind share for many database products and vendors beyond that list.

*A Barney partnership is one in which two or more vendors get on stage and do a song and dance about how much they love each other, with little substance beyond that.

Related links

Larry Dignan thinks the IBM/Sun deal is sensible and ripe to happen.
Dana Gardner thinks otherwise.
Matt Asay seems to agree that IBM understands the open source business.
Before IBM acquired it, solidDB was scheduled to provide a serious MySQL transaction processing engine.

One vendor’s trash is another’s treasure

Curt Monash — Mon, 02 Feb 2009 07:05:44 +0000

A few months ago, CEO Mayank Bawa of Aster Data commented to me on his surprise at how “profound” the relationship was between design choices in one aspect of a data warehouse DBMS and choices in other parts. The word choice in that was all Mayank, but the underlying thought is one I’ve long shared, and that I’m certain architects of many analytic DBMS share as well.

For that matter, the observation is no doubt true in many other product categories as well. But in the analytic database management arena, where there are literally 10-20+ competitors with different, non-stupid approaches, it seems most particularly valid. Here are some examples of what I mean.

Hash partitioning distribution. In shared-nothing or shared-not-very-much database architectures, multiple processors pull data off disk in parallel. Ideally, it will be the case that for each long-running query, the amount of data retrieved at each node is almost identical. That way, each node is done at the same time, with no wasteful waiting.

Consequently, data should be distributed more or less randomly across the nodes. That can be done through “round-robin” allocation — each node takes a turn in strict order receiving new records or blocks. Or it can be done by hashing on a particular key — in essence, by assigning data to different disks depending on the value in some particular field or combination of fields.

Hash partitioning distribution is a wonderful optimization. For most large tables, there’s a obvious join key that will be relevant to a significant fraction of all long-running queries. Pre-hashing on that key saves a huge step in the execution of hash joins involving that key, and hence can provide a significant reduction in the total query processing workload. Nor is this benefit confined to single-fact-table or single-primary-key schemas. When different kinds of data are stored in the same warehouse, each large fact table can be hash partitioned distributed on its own key.

For almost all databases on almost all shared-nothing vendors’ systems, hash partitioning distribution is the way to go. Even so, a couple of products don’t even bother supporting it. Oracle Exadata isn’t going to perform joins of that kind anyway until data is moved from the storage to the database tier, so hash partitioning distribution has no benefit in Exadata’s multi-tier architecture. Kognitio, while not having such a clean proof of why hash partitioning distribution is utterly beside the point, thinks the costs of violating strict randomness outweigh the costs in its silicon-centric approach.

Indexing alternatives. More generally, analytic DBMS generally differ from OLTP DBMS in that they’re optimized to run more table scans and fewer updates and pinpoint queries. I’ve written about that many times, even coining the phrase index-light to encapsulate the story. The general idea is that if you’re retrieving a lot of rows per query, it becomes inefficient to keep spinning the disk to ensure you get only the rows you want. You get a lot more bytes/second doing sequential than random reads, so if a sufficiently large fraction of the rows are ones you actually want, it’s better to just scan them all.

If you’re going to follow an extreme form of that approach (e.g. Netezza, DATAllegro), you might as well have huge block sizes for your data (1megabyte+). If you think indexes of various kinds will actually be useful a reasonable fraction of the time, you might go with smaller sizes, such as 128K, which is what Teradata and HP (Neoview) favor.

Meanwhile, columnar vendor Vertica recreates some of the benefits of indexes by storing the same column in multiple sort orders. And that leads me to the next point.

High availability/failover alternatives. Most analytic DBMS mirror the data on-the-fly. But strategies differ. Some just rely on a storage vendor’s technology; others build in their own forms redundancy.

Particularly interesting is Vertica’s approach. Not only does Vertica allow multiple copies of the data to each be used for querying; it encourages the storage of the same columns in different sort orders, with the optimizer obviously choosing to query the copy that’s sorted in the way most useful for a specific query’s execution plan.

Redundancy and failover strategies are tightly tied to other administration issues too. For example, Aster Data and other vendors brag, with varying degrees of emphasis, that a new node can be added to a system, and the whole thing reconfigures itself automagically with zero down time. Similarly, different systems respond differently to node failure, in terms of metrics such as time to reestablish normal operation, performance hit (if any) after normal operation resumes, performance hit before normal operation resumes, and time window (if any) that redundancy is lost — so that a second failure would crash the whole system.

Bottom line: There never will be an analytic DBMS that simultaneously possesses all highly desirable architectural attributes for the product category.