SAND Technology – DBMS 2 : DataBase Management System Services

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Curt Monash — Tue, 05 Feb 2013 13:25:15 +0000

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:

I think HP’s troubles are less relevant to HP Vertica than Gartner does.
In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.

That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.

2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.

*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated

What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:

ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
Gartner dings ParAccel for a variety of product weaknesses.
Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)

That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.

I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.

*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.

I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.

Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
Gartner’s view of Exasol seems similar to mine.
I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)

Notes on some basic database terminology

Curt Monash — Tue, 07 Aug 2012 10:25:42 +0000

In a call Monday with a prominent company, I was told:

Teradata, Netezza, Greenplum and Vertica aren’t relational.
Teradata, Netezza, Greenplum and Vertica are all data warehouse appliances.

That, to put it mildly, is not accurate. So I shall try, yet again, to set the record straight.

In an industry where people often call a DBMS just a “database” — so that a database is something that manages a database! — one may wonder why I bother. Anyhow …

1. The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.

2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:

Relational:
- Was designed to do relational stuff* from the get-go, even if it now does other things too.
- Supports a lot of SQL.
Non-relational:
- Was designed primarily to do non-relational things.*
- Doesn’t support all that much SQL.

*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.

3. There are two chief kinds of relational DBMS:

RDBMS that are designed for, among other things, online transaction processing (OLTP). Examples include Oracle, DB2, SQL Server, Sybase ASE, PostgreSQL, and MySQL. It is reasonable to refer to these as general-purpose or OLTP RDBMS.*
RDBMS that are designed strictly for analytic uses. Examples include Sybase IQ, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio and the DBMS software inside systems from Teradata and Netezza. It is most accurate to refer to these as analytic RDBMS or just analytic DBMS (sometimes abbreviated ADBMS).

* “General-purpose” is usually a better term than “OLTP”; most OLTP DBMS can handle at least basic reporting, and the leading ones go well beyond that.

4. Some analytic RDBMS were designed to be columnar. Some were designed to be row-based. Multiple systems from both groups now offer both column- and row-based storage options. But they’re all equally relational.

And once again, I remind you that columnar storage and columnar compression are not the same thing.

5. An appliance can include a DBMS, and indeed exist for no purpose other than to run a DBMS; but a DBMS is not an appliance. At a minimum, a data warehouse appliance is a computing system (hardware, storage, operating system, etc.) with an analytic RDBMS preinstalled.

Occasionally somebody suggests that a “virtual appliance” doesn’t have to have hardware included, but they usually draw little attention.

However, reasonable people can disagree about pickier questions, such as:

Does appliance hardware have to be in any way purpose-built? I lean to a No — but I prefer those “appliance” stories that include an actual a hardware advantage.
Does appliance hardware have to have custom silicon, or at least FPGAs (Field-Programmable Gate Arrays)? My answer is an emphatic No.
Does an appliance have to be super-easy to install and administer? I lean to a No — but two of the top appliance benefits are ease of deployment and administration.

For example, I think:

All hardware systems Teradata makes are appliances, even the ones it thinks aren’t.
Similarly, Oracle Exadata systems are appliances.
IBM Netezza is the classic line of data warehouse appliances.
IBM’s “Smart Analytic Systems” can justifiably be called appliances if IBM wishes — but IBM would be wise to save that word for its Netezza line.

Again, reasonable people can disagree — just so long as they don’t slap the label “appliance” onto software-only analytic RDBMS.

Our clients, and where they are located

Curt Monash — Sat, 31 Mar 2012 20:36:32 +0000

From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:

This is a list of Monash Advantage members.
All our vendor clients are Monash Advantage members, unless …
… we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
We do not usually disclose our user clients.
We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
Excluded from this round of disclosure is one vendor I have never written about.
Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.

For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me.

City of San Francisco

KXEN
Metamarkets
PivotLink
salesforce.com
Splunk
WibiData

Other San Francisco area

10gen
ClearStory
Cloudera
Couchbase
DataStax
Hortonworks
MarketShare
MarkLogic
Schooner
Sybase, an SAP company
VMware
Yarcdata, a division of Cray

Boston and Cambridge

Akiban
Cloudant
Hadapt
Vertica, an HP company

Other Boston area

Netezza, an IBM company
StreamBase

Everywhere else

CodeFutures
Infobright
SAND
Syncsort
Tableau
Teradata

For most of the companies listed above, you can find coverage here, and specifically a blog category in the list on the right. The exceptions, for now, are:

Cloudant
MarketShare
Metamarkets
PivotLink
VMware
Yarcdata

The main reason I threw in the geographical notes is to support the idea that there’s a real suburb-to-urban shift in the startup tech industry. Mike Arrington made the point recently about the San Francisco area, primarily with respect to the mass/consumer tech areas he focuses on, and of course it’s echoed by the rise of the New York City tech sector. My point is to add that it’s also true for system and enterprise technology, at least in the areas I cover.

In particular, the re-urbanization of the Boston-area software industry is striking:

Akiban and Cloudant are in the same office complex in the city of Boston. I was surprised to find even one tech startup in the city of Boston proper, but it seems there is a two-building complex packed with them.
Hadapt moved from Connecticut to the Kendall Square area of Cambridge. Edit: Actually, see the comments below.
Vertica moved from Burlington to the Alewife area of Cambridge. That’s (evidently deliberately) at the boundary of what might be regarded as the “urban” and “suburban” parts of metro Boston.

The cluster in the city of San Francisco — which also half-includes Cloudera — is relatively new as well.

Clarifying SAND’s customer metrics, positioning and technical story

Curt Monash — Sun, 13 Nov 2011 02:45:36 +0000

Talking with my clients at SAND can be confusing. That said:

I need to revise my figures for SAND’s customer count way downward.
SAND finally has a reasonably clear positioning.
SAND’s product actually seems to have a lot of features.

A few months ago, I wrote:

SAND Technology reported >600 total customers, including >100 direct.

Upon talking with the company, I need to revise that figure downward, from > 600 to 15.

One embarrassing point: SAND is a client, and I view it as part of my job to save clients from that kind of inadvertent misstatement.

It turns out that SAND has a very impressive customer — Dunnhumby, a data mart outsourcer with 200 terabytes of data in SAND, 30 or so incoming data streams, 400 or so nodes … and 600 or so end customers, all of which SAND was counting as OEM end customers for its DBMS. But I, other industry observers, and other vendors generally don’t count that way.

Besides Dunnhumby, SAND has 14 other customers on maintenance, with < 1 terabyte of data each. Until recently, SAND had a couple dozen more customers than that, but it sold its SAP-oriented archiving/near-line storage product line to Informatica.

I still don’t know where the “> 100 direct” part came from.

After the sale of its other product line, SAND is squarely in the market for analytic DBMS. SAND’s sales efforts seem to be focused on investigative analytics, although some of its existing users seem to be more focused on operational analytics. Most specifically, SAND is trying to focus on “people data” — customer loyalty, health care, etc . — rather than purely machine-generated data, with the paradigmatic target application being personalized marketing.

SAND technical highlights include:

SAND sells a columnar analytic DBMS.
The SAND DBMS operates on bitmaps, with heavy use of run-length encoding on the bitmaps. Bitmaps are used for everything except BLOBs (Binary Large OBjects).
Actual data compression also comes into play, e.g. as result sets are being assembled. This is based on a true global dictionary — multiple columns are tokenized together.
Indeed, SAND can decompose columns and tokenize their parts (e.g. time stamps).
SAND’s workload management sees RAM and CPU, but not explicitly I/O.
SAND lets you pin certain tables or even table segments in RAM if you want to.

SAND’s update story is straightforward — when data comes in, all the columns and bitmaps are updated as needed. Still, since SAND is columnar, you wouldn’t expect true updates in place, and you’d be right. Rather, there’s a story with MVCC (MultiVersion Concurrency Control) and garbage collection, lock-free. The MVCC is also exploited for a kind of time travel, and further for some kind of virtual data mart capability.

SAND’s parallelization story is a bit complicated.

SAND has, or at least has the potential for, node specialization, with database and storage nodes being different.
In principle, disks are specific to storage nodes, and it’s a configuration option as to whether a database node sees one, some, or all storage nodes.
In practice, only Dunnhumby among SAND’s customers operates on other than a shared-disk basis. Dunnhumby’s configuration is mixed/matched among various SAND sharing options.

SAND is proud of its PMML (Predictive Modeling Markup Language) scoring capabilities, but otherwise hasn’t shipped much in the way of analytic platform capabilities. That said, work is underway on a user-defined table function capability that can also query external tables, fire off MapReduce jobs, and so on, under the code name UQL.

Workload management and RAM

Curt Monash — Sun, 25 Sep 2011 05:04:35 +0000

Closing out my recent round of Teradata-related posts, here’s a little anomaly:

Teradata is proud that Teradata 14’s workload management now explicitly manages I/O, to go with Teradata’s long-standing management of CPU. Teradata’s WLM still does not explicitly manage RAM.
Aster is proud that Aster 5’s workload management now explicitly manages RAM, to go along with the WLM capabilities Aster has had for a while managing CPU and I/O. Aster’s Tasso Argyros believes this is an important capability, at least in some edge cases.
Mike Pilcher of SAND emailed me that SAND’s WLM capabilities to explicitly manage CPU, I/O, and RAM are very well-received by the marketplace.

One would think that Teradata’s workload management is more sophisticated and powerful than Aster Data’s.* So I asked Scott Gnau what gives (he was pretty much the ideal guy to comment, since he runs development for Teradata and oversees Teradata’s Aster acquisition as well).

*Except, of course, that Aster was a pioneer in having workload management cover all kinds of analytic processes, rather than just traditional database requests.

Scott’s main response was that Aster’s system was much more consumptive of RAM than Teradata’s; indeed, he reminded me that in the very old days, Teradata could make do with as little as 4 megabytes. Scott also did not argue when I suggested that Aster’s not-just-database analytic processes might require large amounts of RAM as well.

Eight kinds of analytic database (Part 2)

Curt Monash — Tue, 05 Jul 2011 08:18:18 +0000

In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear.

Bit bucket

Kinds of data likely to be included: Logs, other technical/external
Likely use styles: Staging/ETL, investigative
Canonical example: Log files in a Hadoop cluster
Stresses: TCO, scale-out, transform/big-query performance, ETL functionality

With the explosion of machine-generated data has come the need for a place to put it all, sometimes called the big bit bucket. This is like the investigative data mart for big databases, but more poly-structured. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.

The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.

Archival data store

Kinds of data likely to be included: Operational, CDR (call detail record), security log
Likely use styles: Archival, reporting (for compliance), possibly also investigative
Examples: Any long-term detailed historical store
Stresses: TCO, compression, scale-out, performance (if multi-use)

Analytic DBMS vendors have been insulting each other with the claim “that’s just an archival data store,” dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only Rainstor truly embraces the archival positioning, and I’ve become pretty dubious about their technical claims and their company alike.

Still, there’s a legitimate need for data stores — especially relational analytic DBMS that:

Store data cheaply, with high rates of compression.
Have decent performance if you do want to query the data.
May have archiving/compliance-specific features as well.

Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.

Outsourced data mart

Kinds of data likely to be included: All
Likely use styles: Traditional BI, investigative analytics, staging/ETL
Examples: Advertising tracking, SaaS CRM
Stresses: Performance, TCO, reliability, concurrency

Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I’ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that’s an analytic data management challenge. The possibilities expand from there.

Data outsourcers are in the IT business, and so their IT development is — hopefully! — more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.* Multitenancy is commonly an issue, as is running in the cloud.

*Even so, there’s often That Guy who doesn’t want to migrate away from Oracle, no matter what.

Vertica gets the nod in a number of these cases; it’s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you’re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.

Operational analytic(s) server

Kinds of data likely to be included: Customer-centric, log, financial trade
Likely use styles: Advanced operational analytics
Examples:
- Lower latency: Web or call-center personalization, anti-fraud
- Higher latency: Customer profiling, Basel 3 risk analysis
Stresses: Performance, reliability, analytic functionality, perhaps concurrency

Even with eight different choices, I need a “catch-all” category; this is it.

Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in integrating short-request and analytic processing. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you’re set. Otherwise, you may want to pipe derived data into a more “industrial-strength” DBMS, ideally the one that runs your operational apps anyway

Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you’re starting out with the data in a convenient bit bucket.

Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.

So did I get them all? Or are there yet other analytic data management use cases that I don’t fit into my eight categories?

Eight kinds of analytic database (Part 1)

Curt Monash — Tue, 05 Jul 2011 08:17:44 +0000

Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.

Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning.

Enterprise data warehouse (Full or partial)

Kinds of data likely to be included: All, but especially operational
Likely use styles: All
Canonical example: Central EDW for a big enterprise
Stresses: Concurrency, reliability, workload management

The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. Full EDWs are pipedreams. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you’re going to stress concurrency and/or operational use cases.

Traditional data mart

Kinds of data likely to be included: All
Likely use styles: Business intelligence, budgeting/consolidation, investigative
Examples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.
Stresses: Performance, concurrency, TCO

Whether or not you have something like an enterprise data warehouse, it’s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation. Some investigative analytics may be in the mix as well.

Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well. Ted Codd pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.

Investigative data mart — agile

Kinds of data likely to be included: All, especially customer-centric
Likely use styles: Investigative
Canonical example: A few analysts getting a few TB to examine
Stresses: Ease of setup/load, ease of admin, price/performance

Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they’re differentiated by database size.

If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) — and if that data is “single-subject” and fairly homogenous — your watchwords should be “cheap”, “easy”, and “fast”. You don’t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs), nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).

*If you have dozens or even hundreds of analysts hitting the same database, you’re probably back to the more concurrency-oriented scenarios outlined above.

Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don’t rule out Netezza’s lowest-end products (even if they’d really rather sell you something bigger). Or, if you’re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.

Investigative data mart — big

Kinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientific
Likely use styles: Investigative
Canonical example: Single-subject 20 TB – 20 PB relational database
Stresses: Performance, scale-out, analytic functionality

But if you’re looking at tens of terabytes of relational data, or even more, you really do have a “big data” problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.

Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression — e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.

Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full analytic platforms, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.

Continued in Part 2, where we cover some of the more difficult use cases.

Columnar DBMS vendor customer metrics

Curt Monash — Mon, 20 Jun 2011 05:41:54 +0000

Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive.

Sybase said that Sybase IQ had > 2000 direct customers and >500 indirect customers (i.e., end customers of OEMs). That’s counting by customers; I know from prior discussions that Sybase IQ is running at close to two installations per customer. I also believe that Sybase counts different divisions of the same large enterprise as separate customers.

Vertica cited a figure of 500 customers as of April (end Q1?), which is close to 600 now, about 40% or a little more direct. The difference between this and a 2010 year-end figure of 328 is not only new sales, but also slow reporting by OEMs. One cool figure — a single OEM reported 82 end sales in a single (quarterly?) report. And a number of those direct customers are substantial; Vertica’s customer logo page features lots of telcos, lots of internet companies, and the national operation of Blue Cross/Blue Shield.

Pay no attention to small inconsistencies in the number of Vertica direct customers (250 at year-end, no more than that now); Colin Mahony just estimates these numbers for me from memory, and minor inaccuracies are quite excusable.

Even cooler — Vertica reports 7 customers with a petabyte or more of user data each. About 5 of the 7 are obvious-suspect big-name firms; but unsurprisingly, those big names are NDA. I did secure permission to say that there are 2 telecom companies, a mobile gaming vendor, another internet company, and 3 financial services outfits of various kinds.

SAND Technology reported >600 total customers, including >100 direct. Since SAND has been around since the 1990s, those aren’t great average annual figures, but they’re probably more than many people (including me) thought.

Infobright reported around 200 total paying customers, 130 direct. There are surely a lot more users of open source Infobright, but precise numbers are of course hard to come by.

If I asked ParAccel in the April go-round, I’ve misplaced their answer, but back in October the figure was >30 customers, 2 of them over 100 terabytes. I’ve seen published figures of 40+ for ParAccel since.

Updating our vendor client disclosures

Curt Monash — Mon, 28 Feb 2011 08:03:39 +0000

Edit: This disclosure has been superseded by a March, 2012 version.

From time to time, I disclose our vendor client lists. Another iteration is below. To be clear:

This is a list of Monash Advantage members.
All our vendor clients are Monash Advantage members, unless …
… we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
We do not usually disclose our user clients.
We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
Included in the list below are two expired Monash Advantage members who haven’t said they will renew, as mentioned in my recent post on analyst bias. (You can probably imagine a couple of reasons for that obfuscation.)

With that said, our vendor client disclosures at this time are:

Aster Data
Cloudera
CodeFutures/dbShards
Couchbase
EMC/Greenplum
Endeca
IBM/Netezza
Infobright
Intel
MarkLogic
ParAccel
QlikTech
salesforce.com/database.com
SAND Technology
SAP/Sybase
Schooner Information Technology
Skytide
Splunk
Teradata
Vertica

That list includes the two I’m obfuscating, plus one more who just emailed to say a signed renewal contract is arriving this week. It does not include others who, less concretely, have said they will sign up soon.

Also, I guess there’s a bit of a gray area for Tableau. As far as I’m concerned, I’m doing an upcoming co-sponsored webinar just for Monash Advantage member Aster Data. Indeed, I declined to contract with or bill Tableau directly for its share, because I had no good way to do that paperwork. But even so, Tableau is a cosponsor, was involved in the planning discussions and, behind the scenes, is surely footing part of the bill.

Comments on the Gartner 2010/2011 Data Warehouse Database Management Systems Magic Quadrant

Curt Monash — Sat, 05 Feb 2011 15:49:39 +0000

Edit: Comments on the February, 2012 Gartner Magic Quadrant for Data Warehouse Database Management Systems — and on the companies reviewed in it — are now up.

The Gartner 2010 Data Warehouse Database Management Systems Magic Quadrant is out. I shall now comment, just as I did to varying degrees on the 2009, 2008, 2007, and 2006 Gartner Data Warehouse Database Management System Magic Quadrants.

Note: Links to Gartner Magic Quadrants tend to be unstable. Please alert me if any problems arise; I’ll edit accordingly.

In my comments on the 2008 Gartner Data Warehouse Database Management Systems Magic Quadrant, I observed that Gartner’s “completeness of vision” scores were generally pretty reasonable, but their “ability to execute” rankings were somewhat bizarre; the same remains true this year. For example, Gartner ranks Ingres higher by that metric than Vertica, Aster Data, ParAccel, or Infobright. Yet each of those companies is growing nicely and delivering products that meet serious cutting-edge analytic DBMS needs, neither of which has been true of Ingres since about 1987.

The general list of “market forces, end-user expectations and vendors’ resulting solution approaches” at the top of the 2010 Gartner Data Warehouse Database Management System Magic Quadrant article is a mixed bag. Following Gartner’s order, I’ll address those first, and particular companies cited afterwards. Specific items and comments include:

“Increased demand for optimization techniques and performance enhancement.“ Gartner seems to be saying that data warehouse DBMS buyers want lists of specific, esoteric performance features. Well, buyers always want their DBMS to run fast, and they’d like the products to be mature enough to have been through a few rounds of Bottleneck Whack-A-Mole, but otherwise I’m not sure I’d put that at the top of my list.
“The argument made by purchasing departments that buying power increases when dealing with a single, incumbent vendor.“ I agree that vendor consolidation and account control are a huge part of the Oracle, Microsoft, IBM and even Teradata stories. (Vertica can prove it’s 10X more price-performant than Oracle and still not get the business.) But it’s not just about price negotiations; once annual maintenance is included, one has to squint pretty hard to see Oracle as a low-cost alternative. Also important is reducing the number of total product-specific skill-sets needed on the IT staff.
“Prepackaged, prebalanced warehouse environments delivered using data warehouse appliances.“ Yep. To varying extents, Oracle, Microsoft, Teradata, and IBM are all committed to designed-hardware strategies.
“Expectations for the delivery of on-site POCs.“ Honestly, not as many buyers insist on on-site Proofs of Concept as should. Still, Oracle is shameful in its reluctance to do them. (Teradata tries to avoid them too, for obvious reasons of expense, but is much more gracious about capitulating when the buyer insists.)
“Cost controls and data warehouse performance management.“ See next comment.
“Demands for delivering a fully mixed workload.“ I’d have phrased the workload management and administrative tools points rather differently than this, but so be it.
“Demands for departmental analytics delivered quickly via data marts.“ Agreed. Data-mart-only installations are a huge part of the market of the analytic DBMS market. Data mart spin-out is also important.
“Wider indexing and fast performance within clusters of data, delivered via column-based solutions.“ This bizarrely seems to conflate column stores and parallel processing (both of which are of course highly important).
“A wave of new data warehouse implementers seeking fast-track, low-risk delivery.“ Well, yes. Netezza noticed that quite some years ago. And by now the long-gestation EDW (Enterprise Data Warehouse) is widely disliked.
“Global organizations seeking distributed solutions as potential architecture.“ If this is the MPP point, it’s oddly phrased. If this is a suggestion that data warehouses should be partitioned across wide-area networks, it’s just plain odd. If it’s a reiteration that departments like to control their own data marts, I agree. And if it’s a comment on keep-data-in-the-country privacy laws, it could be the most prescient thing Donald Feinberg has said in many years.

Long though it is, that list of general items and issues for the 2010 Gartner Data Warehouse Database Management System Magic Quadrant has some gaps. Most glaringly, I don’t see any references to advanced analytics in general, or even to the specific case of integrated predictive analytics. There’s also nothing about solid-state memory or other storage-technology considerations, although in fairness it’s still early days for much of what vendors conceive of as competitive differentiation in those respects.

Here are some vendor-specific comments on the 2010 Gartner Data Warehouse Database Management System Magic Quadrant:

It’s pretty bizarre to compare 1010data to database.com or Microsoft Azure. Kognitio would be a better choice. So would cloud-hosted instances of Vertica, Aster Data nCluster, or others.
Gartner’s comments on Aster Data and nCluster are actually pretty reasonable.
Gartner’s comments on EMC/Greenplum are a bit Kool-Aid-drinky, and don’t account for the inevitable flailing that occurs right after an acquisition. But otherwise they’re pretty reasonable.
I don’t take IBM’s super-comprehensive-all-inclusive architectural stories as seriously as Gartner does.
I don’t take Netezza’s small stable of OEM partners as seriously as Gartner does. I also don’t share Gartner’s optimism for the continuation of Netezza’s NEC partnership in the face of IBM’s Netezza ownership.
I’m even more skeptical about illuminate than Gartner is.
I’m delighted that Gartner has adopted my phrase machine-generated data (Infobright is one of several firms pushing that one).
“Only open-source column-store DBMS” is a bit exaggerated, but Infobright is indeed the only one with serious traction, or offered by a serious analytic DBMS vendor.
What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.
While Gartner’s write-up of Kognitio is a bit confused, that’s excusable. Kognitio’s strategy changes often.
I’m not persuaded by the claim of low Microsoft TCO. The days when Microsoft’s tools were vastly better than the competition’s are long gone. And using an OLTP DBMS for data warehousing generally takes more people effort than using something more purpose-built.
Gartner is right to ding Oracle for high prices, high people costs, and unwillingness to do onsite POCs.
Gartner is right that Exadata is a huge improvement over non-Exadata Oracle data warehousing.
Gartner is right to suggest that Exadata can easily handle data warehouses over 20 terabytes in size, but wrong to suggest that software-only Oracle also can. Just because the pain is less than it was with earlier releases of Oracle doesn’t mean it isn’t still bad.
Gartner’s comments on ParAccel are pretty reasonable.
Gartner’s comments on compression in connection with SAND make no technical sense (tokenization is a key form of columnar compression, not an alternative to it). Also, SAP’s acquisition of Sybase is a business challenge for SAND, not a technical one.
Unless I’m forgetting something, Sybase IQ has no more in-database data mining than any other Fuzzy Logix partner does.
Gartner failed to note that, like other DBMS dating back to the 1990s and before, Sybase IQ is more complex to administer than some newer products are.
Gartner’s take on Teradata is pretty reasonable.
Gartner’s take on Vertica, while sloppy, is basically sensible. However, Gartner failed to note that Vertica is a laggard in non-query analytics. (I am sure those deficiencies are being addressed, but Vertica’s competitors are moving ahead as well.)