Exasol – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Curt Monash — Tue, 05 Feb 2013 13:25:15 +0000

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:

I think HP’s troubles are less relevant to HP Vertica than Gartner does.
In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.

That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.

2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.

*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated

What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:

ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
Gartner dings ParAccel for a variety of product weaknesses.
Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)

That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.

I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.

*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.

I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.

Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
Gartner’s view of Exasol seems similar to mine.
I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)

Notes on some basic database terminology

Curt Monash — Tue, 07 Aug 2012 10:25:42 +0000

In a call Monday with a prominent company, I was told:

Teradata, Netezza, Greenplum and Vertica aren’t relational.
Teradata, Netezza, Greenplum and Vertica are all data warehouse appliances.

That, to put it mildly, is not accurate. So I shall try, yet again, to set the record straight.

In an industry where people often call a DBMS just a “database” — so that a database is something that manages a database! — one may wonder why I bother. Anyhow …

1. The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.

2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:

Relational:
- Was designed to do relational stuff* from the get-go, even if it now does other things too.
- Supports a lot of SQL.
Non-relational:
- Was designed primarily to do non-relational things.*
- Doesn’t support all that much SQL.

*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.

3. There are two chief kinds of relational DBMS:

RDBMS that are designed for, among other things, online transaction processing (OLTP). Examples include Oracle, DB2, SQL Server, Sybase ASE, PostgreSQL, and MySQL. It is reasonable to refer to these as general-purpose or OLTP RDBMS.*
RDBMS that are designed strictly for analytic uses. Examples include Sybase IQ, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio and the DBMS software inside systems from Teradata and Netezza. It is most accurate to refer to these as analytic RDBMS or just analytic DBMS (sometimes abbreviated ADBMS).

* “General-purpose” is usually a better term than “OLTP”; most OLTP DBMS can handle at least basic reporting, and the leading ones go well beyond that.

4. Some analytic RDBMS were designed to be columnar. Some were designed to be row-based. Multiple systems from both groups now offer both column- and row-based storage options. But they’re all equally relational.

And once again, I remind you that columnar storage and columnar compression are not the same thing.

5. An appliance can include a DBMS, and indeed exist for no purpose other than to run a DBMS; but a DBMS is not an appliance. At a minimum, a data warehouse appliance is a computing system (hardware, storage, operating system, etc.) with an analytic RDBMS preinstalled.

Occasionally somebody suggests that a “virtual appliance” doesn’t have to have hardware included, but they usually draw little attention.

However, reasonable people can disagree about pickier questions, such as:

Does appliance hardware have to be in any way purpose-built? I lean to a No — but I prefer those “appliance” stories that include an actual a hardware advantage.
Does appliance hardware have to have custom silicon, or at least FPGAs (Field-Programmable Gate Arrays)? My answer is an emphatic No.
Does an appliance have to be super-easy to install and administer? I lean to a No — but two of the top appliance benefits are ease of deployment and administration.

For example, I think:

All hardware systems Teradata makes are appliances, even the ones it thinks aren’t.
Similarly, Oracle Exadata systems are appliances.
IBM Netezza is the classic line of data warehouse appliances.
IBM’s “Smart Analytic Systems” can justifiably be called appliances if IBM wishes — but IBM would be wise to save that word for its Netezza line.

Again, reasonable people can disagree — just so long as they don’t slap the label “appliance” onto software-only analytic RDBMS.

Many kinds of memory-centric data management

Curt Monash — Sun, 08 Apr 2012 01:33:31 +0000

I’m frequently asked to generalize in some way about in-memory or memory-centric data management. I can start:

The desire for human real-time interactive response naturally leads to keeping data in RAM.
Many databases will be ever cheaper to put into RAM over time, thanks to Moore’s Law. (Most) traditional databases will eventually wind up in RAM.
However, there will be exceptions, mainly on the machine-generated side. Where data creation and RAM data storage are getting cheaper at similar rates … well, the overall cost of RAM storage may not significantly decline.

Getting more specific than that is hard, however, because:

The possibilities for in-memory data storage are as numerous and varied as those for disk.
The individual technologies and products for in-memory storage are much less mature than those for disk.
Solid-state options such as flash just confuse things further.

Consider, for example, some of the in-memory data management ideas kicking around.

In many cases there is essentially an in-memory DBMS, trying for as much ACIDity as RAM reasonably allows, then (usually) also copying data synchronously to persistent storage. These can have many different architectures. For example:
- SAP HANA is an in-memory columnar DBMS, with text indexing/inverted-list antecedents, except when it uses one of a couple of approaches to in-memory row-based data management.
- solidDB, now an IBM product, is an RDBMS that relies on Patricia tries. It is actually a hybrid memory/disk product, but optimized for in-memory operation.
- eXtremeDB is an OODBMS, but relies on B-trees.
- H-Store and its commercialization VoltDB are row-based RDBMS that make drastic assumptions about the nature of your workload, but in return get to drop much of the overhead other DBMS need.
- Oracle TimesTen is a row-based RDBMS, oriented to OLTP (OnLine Transaction Processing), which stores its data persistently via another RDBMS. (MySQL was the default choice before Oracle bought the company.)
- Oracle’s answer to SAP HANA is to take TimesTen and do analytics on it, via the Exalytics appliance.
Some disk-based DBMS just happen to be architected in ways so that for good performance you’re going to want to keep all the data in RAM. Often, their in-memory architecture is lot like their on-disk architecture, with memory mapping for I/O. This is done in very different kinds of DBMS.
- MongoDB is one visible example. In general, scale-out web databases (whether NoSQL or MySQL) often keep all their data in RAM, whether or not that plan is baked into the DBMS architecture.
- Various analytic DBMS vendors have at time been memory-oriented. At the moment, I think:
  - Exasol (columnar) isn’t quite as extreme about wanting to be in-memory as it used to be.
  - ParAccel (columnar) and its memory-mapped architecture can be happily used either in-memory or on disk.
  - Kognitio (row-based), which used to be portrayed as a disk-based system that’s smart about using RAM, is currently being marketed as an in-memory system.
My last technical briefing on Applix TM1 (now an IBM Cognos product) was in September, 2005. (The product itself dates back to 1984.) At the time TM1 had an interesting sparse MOLAP (Multi-Dimensional OnLine Analytic Processing) story, the point being that the system worked hard to isolate what was actually non-zero. Loading of raw data seemed to be batch, but you could update models with derived data, and there was a transaction log for confident persistence.
Alternatively, you can use a caching layer, typically on a separate set of servers from your DBMS, which has no responsibility for managing data persistence. For example:
- TimesTen and solidDB are used, respectively, as relational caches for Oracle and DB2.
- Peter Zencke told me years ago that SAP had a purpose-built caching layer that kept over 99% of requests from touching disk.
- The key-value store memcached is central to many of the world’s largest web sites, typically backed by a MySQL cluster.
- ScaleArc has key-value cache that stores — rather than individual records — the entire TCP string sent by an RDBMS in response to a particular SQL query.
Some systems manage data in memory in one kind of structure, then ensure persistence via a very different structure on disk. Examples include:
- Workday’s architecture — object-oriented in RAM, MySQL (really key-value) on disk. Edit: Workday thinks “key-value” is a slightly misleading way to put it. Stay tuned for more.
- Oracle Coherence (formerly Tangosol) — object-oriented in RAM, Oracle on disk. Edit: Actually, Coherence isn’t really a write-through ORM (Object-Relational Mapper). It functions more like memcached, albeit with a very different data model.
- Couchbase — memcached (key-value) in-memory, evolving from SQLite to CouchDB on disk.
Similarly, business intelligence suites can manage data in-memory that comes from some other kind of data store (usually an RDBMS, sometimes Hadoop or whatever). I haven’t had a lot of luck in getting details, with one exception — QlikView, which uses a simple tabular data structure.
Stream processors — i.e. CEP engines — are a whole other sort of in-memory engine, doing something that’s a lot like data management.

And that, kiddies, is why I hesitate to generalize in too much detail about “in-memory database management.”

Despite its length, this is still a very partial list of memory-centric data management approaches. I encourage you to add other examples into the comments that I might have left out.

Related link

I did a simpler overview of memory-centric alternatives in 2005.

Comments on the analytic DBMS industry and Gartner’s Magic Quadrant for same

Curt Monash — Wed, 08 Feb 2012 17:17:32 +0000

This year’s Gartner Magic Quadrant for Data Warehouse Database Management Systems is out.* I shall now comment, just as I did on the 2010, 2009, 2008, 2007, and 2006 Gartner Data Warehouse Database Management System Magic Quadrants, to varying extents. To frame the discussion, let me start by saying:

In general, I regard Gartner Magic Quadrants as a bad use of good research.
Illustrating the uselessness of — or at least poor execution on — the overall quadrant metaphor, a large majority of the vendors covered are lined up near the line x = y, each outpacing the one below in both of the quadrant’s dimensions.
I find fewer specifics to disagree with in this Gartner Magic Quadrant than in previous year’s versions. Two factors jump to mind as possible reasons:
- This year’s Gartner Magic Quadrant for Data Warehouse Database Management Systems is somewhat less ambitious than others; while it gives as much company detail as its predecessors, it doesn’t add as much discussion of overall trends. So there’s less to (potentially) disagree with.
- Merv Adrian is now at Gartner.
Whatever the problems may be with Gartner’s approach, the whole thing comes out better than do Forrester’s failed imitations.

*As of February, 2012 — and surely for many months thereafter — Teradata is graciously paying for a link to the report.

Specific company comments, roughly in line with Gartner’s rough single-dimensional rank ordering, include:

The Gartner Magic Quadrant’s comments on Teradata seem pretty fair. I don’t think I’m much in disagreement when I say:
- Teradata has the richest, most mature analytic DBMS offering.
- Teradata has an outstanding track record both for managing large data volumes and for high-concurrency mixed workloads.
- Aster Data was a cool Teradata acquisition, even if Teradata/Aster synergies or integration have been nominal to date.
- Teradata still needs to get out of its own way in marketing, positioning, packaging, and/or defining its premium-priced system vs. its more moderately-priced alternatives. Indeed, as necessary as this approach may have been to fending off encroachments by Netezza and others, what Teradata really needs to do is evolve to a more pick-your-own-node-combination mix-match kind of offering.
Gartner has talked with a lot of Oracle Exadata users who say that the product works; Gartner also has stopped beating Oracle up for its previous policy of almost never doing onsite POCs (Proofs of Concept); both parts of that ring true with me. But Gartner also rightly dings Oracle for various issues in cost and cumbersomeness. Overall, while I agree there are organizations for which Oracle should indeed be a top-ranked choice, there are many others who shouldn’t put Oracle on their short list.
Third in the Gartner MQ rankings is IBM.
- Gartner gets so caught up in reciting the names of various IBM product offerings that it neglects to say much good about DB2 itself. (I tend to have a similar problem.)
- But Gartner does mention concurrency as a strength. I agree, especially if we presume that that was a reference to DB2 rather than Netezza.
- Gartner cites Netezza’s post-acquisition annual growth rate as 30%. Gartner seems to think this is a good number. I disagree, but in Netezza’s defense, it has had to endure IBM’s post-acquisition on-boarding process.
Arguably fourth in the Gartner Data Warehouse Magic Quadrant rankings is EMC/Greenplum.
- In general, Gartner likes the taste of Greenplum Kool-Aid.
- Gartner neglects to ding Greenplum for concurrency challenges, which I view as an oversight given Gartner’s general stress on that area.
- Gartner does ding Greenplum for support challenges.
- Gartner neglects to praise Greenplum for true hybrid row/columnar data management, a feature shared by Teradata and Vertica, among others, but not by Oracle, DB2, or Netezza.
- Gartner located a half-petabyte Greenplum database. This doesn’t surprise me, even though Greenplum has frequently made exaggerated claims about large-size database successes in the past.
- Gartner reports a >400 figure for Greenplum customers, which is plausible.
In its first deviation from strict one-dimensional rank ordering, the Gartner Magic Quadrant ranks Sybase ahead of Greenplum in completeness of vision but behind in “ability to execute”.
- If that were the other way around, it might make more sense. Greenplum promises anything and everything you might ever want for analytic data management or the associated analysis; but Sybase has vastly more analytic DBMS users than Greenplum does, running a variety of demanding workloads.
- Gartner appears to think that Sybase IQ requires less database administration than I do.
- Gartner seems concerned that SAP will position HANA and Sybase ASE as, between them, the only DBMS you’ll ever need, casting doubt on Sybase IQ’s future. I wouldn’t worry about that if you have a problem you want to solve today.
The Gartner Magic Quadrant for Data Warehouse Database Management Systems ranks Microsoft sixth overall, despite noting that there isn’t a single production reference for Microsoft’s Parallel Data Warehouse. In support of this ranking, it for example cites the compression feature, which distinguishes Microsoft SQL Server from no other product on the list except Kognitio. If you have such an undemanding data warehousing problem that many different analytic DBMS could meet your needs, there’s a good chance Microsoft SQL Server can also do the job; and if you’ve bought into the Microsoft technology stack, you might as well keep going down that path. Otherwise, I don’t know why somebody should adopt Microsoft’s offering at this time.
Seventh along the main diagonal path in the Gartner Magic Quadrant is HP Vertica. I’d rank Vertica higher than that, but in fairness I note two execution concerns. First, HP has a lousy track record, both in acquisitions and in data warehousing/analytics. Second, Vertica is bad about answering my email. Anyhow, Gartner doesn’t seem to have given Vertica credit either for its full customer count or for the multiple petabyte-scale databases Vertica runs.
1010data is an outlier, with Gartner noting that it only partly fits in with other “Data Warehousing Database Management” companies, and hence kind of confessing that 1010data’s specific location on the Magic Quadrant is somewhat arbitrary. Stuff like that is bound to happen, given the inherent difficulties of defining market categories. Anyhow, my thoughts on 1010data include:
- I’m nervous about the fact that 1010data doesn’t actually control its own DBMS technology, but rather relies on old code from the small private company KX Systems.
- There are three main reasons to consider 1010data:
  - You want to enter the data mart outsourcing business in a casual way, and you like its SaaS offering.
  - You want to engage in stakeholder-facing analytics in a casual way, and you like its SaaS offering.
  - You love 1010data’s particular set of interactive analytic features and performance.
Back to the main path winding along the Gartner Magic Quadrant main diagonal — next up is ParAccel. While I question some of the peripheral comments, I agree with Gartner’s core messages that:
- ParAccel, the product, is blazingly fast in certain use cases.
- ParAccel, the company, is dangerously small.
Eighth on the Gartner MQ’s main path is Kognitio. This is too high. Kognitio positions itself as offering in-memory DBMS, yet stubbornly refuses to do any kind of data compression. That’s an awful combination of choices. As for using Kognitio’s data warehousing SaaS offering — why would you do that, when more modern products are available on a SaaS/cloud basis as well?
Ninth in the Gartner Magic Quadrant main rankings is SAND.
- The SAND section is not a triumph of Gartner accuracy. For example:
  - Gartner completely missed the errors in SAND’s reported customer counts.
  - Gartner refers to SAND as being “in existence for approximately nine years”, which is too low by at least a factor of 2.
  - Gartner says “SAND is a privately held company”, even though Merv knows better than that.
- Otherwise, Gartner’s opinion on SAND seems to boil down to “Interesting technology and ideas, but dangerously small company.” I agree.
Tenth and too low in the Gartner MQ main rankings is Infobright.
- At least by some metrics (e.g. customer count), Infobright isn’t as dangerously small as ParAccel, SAND, Kognitio, et al.
- That said, Infobright is small and focused on machine-generated data. So I wouldn’t be confident in Infobright’s future technology path for human-generated data use cases.
- Infobright’s performance is uneven — blazing in cases where the Knowledge Grid helps, but not necessarily stellar by analytic DBMS standards when full table scans are called for.
- I agree with Gartner that the possibility of Oracle/MySQL future shenanigans is a concern. But while the energy behind MySQL forking efforts doesn’t seem too great right now, I’d expect them to revive and offer a successful escape path if it seemed Oracle was going to indeed play hardball.
- Also, given that it’s already an open source vendor, there are various kinds of assurances Infobright could give that would also help alleviate customer concerns.
Actian, formerly Ingres, took a big tumble in Gartner’s rankings versus last year, when I simply wrote “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” I’m even a little harsher about Ingres/Actian’s DBMS products and prospects than Gartner is, but at least now we’re in the same ballpark.
Along with Infobright, ParAccel, and SAND, Exasol appears to be another of the “good columnar technology/small company” crowd. As with other such products, one should be careful about fit-and-finish features that are missing today, as there is no assurance they’ll be added in a timely manner going forward.
illuminate Solutions, which was on last year’s Gartner list, now appears to be an ex-company.

Exasol update

Curt Monash — Sun, 13 Nov 2011 02:37:13 +0000

I last wrote about Exasol in 2008. After talking with the team Friday, I’m fixing that now. The general theme was as you’d expect: Since last we talked, Exasol has added some new management, put some effort into sales and marketing, got some customers, kept enhancing the product and so on.

Top-level points included:

Exasol’s technical philosophy is substantially the same as before, albeit not with as extreme a focus on fitting everything in RAM.
Exasol believes its flagship DBMS EXASolution has great performance on a load-and-go basis.
Exasol has 25 EXASolution customers, all in Germany.*
5 of those are “cloud” customers, at hosting providers engaged by Exasol.
EXASolution database sizes now range from the low 100s of gigabytes up to 30 terabytes.
Pretty much the whole company is in Nuremberg.

*That excludes some money from Hitachi. Exasol’s Hitachi partnership is still in limbo, an apparent casualty of the world economic crisis.

On the technical side:

As noted in my 2008 post, EXASolution is a columnar, no-head-node MPP (Massively Parallel Processing) DBMS.
The main way EXASolution compresses data is via dictionary/tokenization. 5:1 is a typical compression ratio before mirroring and so on, out of a 2-10:1 range.
EXASolution writes data to blocks in memory that are smaller than what is otherwise its preferred size (1/2 to 5 megabytes). These are sent to disk, where merge eventually happens. Exasol insists that write performance has always been fully satisfactory to customers to date.
EXASolution doesn’t have much in the way of performance tuning knobs. Exasol says they aren’t needed, and says that one really can start an EXASolution POC (Proof of Concept) in a day or so.
EXASolution doesn’t have much in the way of workload management capabilities, except what’s automagic (e.g., short query bias). However, it does collect statistics you can query via your favorite BI tool.
EXASolution doesn’t have much in the way of analytic platform capabilities, although there is some Lua-based scripting. However, there’s something NDA in the analytic platform area Coming Soon.*

In general, the whole thing sounds somewhat like ParAccel, at least at a high level.

*Exasol is not and never has been our client, but we can keep secrets for them even so.

Naturally, Exasol believes EXASolution has fine concurrency, with at least one customer routinely running 2000 concurrent users, 200 concurrent sessions (via connection pooling), and 5-10 concurrent queries. Another customer has 3500 Cognos users. 1-200 concurrent queries appears to be the record peak load. Anyhow, Exasol says that plans to offer real workload management could be accelerated if a need were discovered.

Exasol says it almost never loses POCs, but admits that it competes fairly rarely against Vertica and ParAccel, no doubt for reasons of geography. Exasol boasts one visible Sybase IQ replacement (Sony Music).

While Exasol’s sales to date have been in Germany, there are plans to change that soon. At least one sales cycle is well underway in Eastern Europe. Offices in other Germanic countries are planned. Existing customers are planning to deploy additional copies outside Germany. Discussions are underway regarding other geographies, e.g. English-speaking ones.

Draft slides on how to select an analytic DBMS

Curt Monash — Wed, 04 Feb 2009 22:44:12 +0000

I need to finalize an already-too-long slide deck on how to select an analytic DBMS by late Thursday night. Anybody see something I’m overlooking, or just plain got wrong?

Edit: The slides have now been finalized.

Dividing the data warehousing work among MPP nodes

Curt Monash — Fri, 05 Sep 2008 08:48:47 +0000

I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.

The base-case MPP DBMS architecture is one in which there are two kinds of nodes:

A boss node, whose jobs include:
- Receiving and parsing queries
- Optimizing queries, determining execution plans, and sending execution plans to the nodes
- Receiving result sets and sending them back to the querier
Worker nodes, which do their part of the query execution job and eventually ship data back to the head

In primitive forms of this architecture, there’s a “fat head” that does altogether too much aggregation and query resolution. In more mature versions, data is shipped intelligently from worker nodes to their peers, reducing or eliminating “fat head” bottlenecks.

Exceptions to the base case include Vertica and Exasol. In their systems, all nodes run identical software. At the other extreme, some vendors use dedicated nodes for particular purposes. For example, Aster Data famously has special nodes for bulk data loading and export. Greenplum has a logical split between nodes that execute queries and nodes that talk to storage, and is considering offering the option of physically separating them in a future release.

The basic tradeoffs between these schemes go something like this:

If there are more kinds of dedicated nodes, real-time load-balancing is harder; you’re more likely to have idle capacity.
If there are more kinds of dedicated nodes, you can optimize hardware better, by using different kinds of hardware for different kinds of nodes. Potentially, this is a bigger factor if some kinds of nodes have dedicated disks attached and some don’t.

Calpont, which hasn’t actually shipped a DBMS yet, has an interesting twist. They’re building a columnar DBMS in which the querying work is split between a kind of worker node, which does the query processing, and a storage node, which talks to disk. These nodes are not in any kind of one-to-one correspondence; any worker node can talk with any storage node. Calpont believes that in the future some of the storage node logic can migrate into storage systems themselves, in almost a Netezza-like strategy, but on more standard equipment.

The Calpont story may actually make more sense in a shared-disk storage-area-network implementation than for a fully shared-nothing MPP, but that’s a subject for a different post.

Exasol technical briefing

Curt Monash — Sun, 17 Aug 2008 00:14:56 +0000

It took 5 ½ months after my non-technical introduction, but I finally got a briefing from Exasol’s technical folks (specifically, the very helpful Mathias Golombek and Carsten Weidmann). Here are some highlights:

Like Vertica and ParAccel, Exasol is in the business of MPP shared-nothing software-only columnar data warehouse database management.
Exasol has no concept of a “head” or “master” node, with different software than the others. Instead, all nodes are peers. For example, any node’s IP address can be given to an application; that node will then parse the SQL and distribute it appropriately to the other nodes.
Exasol is ACID-compliant, swapping blocks to disk when there’s an update. And one certainly can query data that’s on disk …
… however, Exasol’s memory structures are totally optimized for in-memory operation. Exasol is perfectly happy to swap in different parts of the database on a scheduled basis every few hours, but sending queries straight to disk isn’t optimal. Exasol’s recommended hardware configurations always are designed so that most queries can be executed against data already in RAM. However, if for example only the last 30 days of data are in RAM and a few queries go against full-year data, that’s OK.
Exasol has a compression story typical for a columnar DBMS vendor – heavy use of dictionary/token compression, other unspecified compression algorithms as well, data kept compressed in RAM, etc.
Like most other MPP data warehousing vendors, Exasol partitions data among nodes via a hash key. This is the industry’s most common scheme, because it has the parallelization benefits of random/equal distribution of data, yet still lets you get a head start on some hairy hash joins for extra performance.
Like Vertica, Exasol replicates small tables (e.g., dimension tables) across each node.
Exasol’s optimizer creates and maintains join indexes automagically on the fly. Exasol disagreed when I say “Oh, like a materialized view?” But I suspect this is the kind of join index that Teradata says privately is a special case of materialized view, and says publicly is a lot like a materialized view.
Generally, Exasol describes its optimizer as being “very MPP-aware.”
Exasol mainly wrote its own code from scratch. Right now they seem to have a kind of distributed operating system called EXACluster running over Linux, but they seem to be replacing the Linux underpinnings with their own stuff. E.g., disk access is going into EXACluster.
EXACluster already handles high availability/failover between nodes.
Exasol replicates data among nodes to allow for failover. That sounds similar to Vertica’s approach. Also, if you add nodes and restart Exasol, the database will automagically be repartitioned.
The biggest deployed Exasol system mentioned has 3 terabytes of user data. It is running on 5 nodes w/ 32 GB of RAM each.
For any given amount of total RAM a user is willing to deploy, Exasol recommends more nodes with less RAM/node. I didn’t probe directly as to why.
Exasol doesn’t have stored procedures. They assert that stored procedures would be useful mainly for ELT/ETL, and that alternatives perform well enough.
Like many data warehouse specialists, Exasol recommends ELT (Extract/Load/Transform) over ETL (Extract/Transform/Load).
Exasol has user-defined functions (UDFs).
Exasol is working on BLOB support. Geospatial data is also on the radar (no pun intended), but it didn’t sound as if there was a currently active project.

We also talked about concurrency, which is always a confusing subject. Exasol said that to date there were no more than 50 concurrent “log-ins,” which they equate to there being 1000s of named users (because queries execute so quickly). They also say they’ve tested up to 400 concurrent queries internally. I didn’t probe about what they’d do to balance short-running and long-running queries, in part because Exasol gives the impression that on their systems, there is no such thing as a long-running query. But obviously this is all somewhat fuzzy.

In a related point, Exasol says that overall throughput is higher when there is at least a certain number of concurrent users. The supporting evidence offered was, of all things, TPC-H benchmarks. Apparently (I haven’t checked this myself), Exasol (and also ParAccel, which of course has a similar architecture) chose to run the benchmark with more than the minimum number of simultaneous users required. SMP systems, Exasol believes, don’t exhibit similar behavior.

Finally, a couple of less technical highlights:

Licensing is per-gigabyte of RAM. (This fits with the whole memory-centric orientation.) 100 gigabytes of RAM are 120,000 Euros list price. Price doesn’t scale linearly with amount of RAM.
The partner whose name was redacted in February is now officially disclosed. Exasol is partnering in Japan with the services side of Hitachi. Exasol says Hitachi has 15-20 people working on introducing Exasol to Japan. Target customers are not primarily Hitachi’s hardware installed base.

Compare/constrast of Vertica, ParAccel, and Exasol

Curt Monash — Tue, 12 Aug 2008 22:31:21 +0000

I talked with Exasol today – at 5:00 am! — and of course want to blog about it. For clarity, I’d like to start by comparing/contrasting the fundamental data structures at Vertica, ParAccel, and Exasol. And it feels like that should be a separate post. So here goes.

Exasol, Vertica, and ParAccel all store data in columnar formats.
Exasol, Vertica, and ParAccel all compress data heavily.
Exasol and Vertica operate on in-memory data in compressed formats. ParAccel decompresses the data when it gets to RAM. Exasol, Vertica, and ParAccel all — perhaps to varying extents — operate on in-memory data in compressed formats.
ParAccel and Exasol write data to what amounts to the in-memory part of their basic data structures; the data then gets persisted to disk. Vertica, however, has a separate in-memory data structure to accept data and write it to disk.
Vertica is a disk-centric system that doesn’t rely on there being a lot of RAM.
ParAccel can be described that way too; however, in some cases (including on the TPC-H benchmarks), ParAccel recommends loading all your data into RAM for maximum performance.
Exasol is totally optimized for the assumption that queries will be run against data that had already been previously loaded into RAM.

Beyond the above, I plan to discuss in a separate post how Exasol does MPP shared-nothing software-only columnar data warehouse database management differently than Vertica and ParAccel do shared-nothing software-only columnar data warehouse database management.