Analytic technologies

Discussion of technologies related to information query and analysis. Related subjects include:

June 22, 2009

The TPC-H benchmark is a blight upon the industry

ParAccel has released a 30,000-gigabtye TPC-H benchmark, and no less a sage than Merv Adrian paid attention. Now, the TPCs may have had some use in the 1990s. Indeed, Merv was my analyst relations contact for a visit to my clients at Sybase around the time — 1996 or so — I was advising Sybase on how to market against its poor benchmark results. But TPCs are worthless today.

It’s not just that TPCs are highly tuned (ParAccel’s claim of “load-and-go” is laughable Edit: Looking at Appendix A of the full disclosure report, maybe it’s more justified than I thought.). It’s also not just that different analytic database management products perform very differently on different workloads, making the TPC-H not much of an indicator of anything real-life.  The biggest problem is: Most TPC benchmarks are run on absurdly unrealistic hardware configurations.

For example, if you look at some details, the ParAccel 30-terabyte benchmark ran on 43 nodes, each with 64 gigabytes of RAM and 24 terabytes of disk. That’s 961,124.9 gigabytes of disk, officially, for a 32:1 disk/data ratio. By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.

Meanwhile, the RAM:data ratio is around 1:11  It’s clear that ParAccel’s early TPC-H benchmarks ran entirely in RAM; indeed, ParAccel even admits that.  And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well. Once again, this would illustrate that the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.

More generally — I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.

June 16, 2009

Aster Data on parallelism

Aster Data’s core claim boils down to “We do parallelism better.” Aster has shied away from saying that for marketing purposes, for fear of the response “Yeah, right, everybody says that.” But when I talked with Mayank Bawa, Steve Wooledge, et al. yesterday, I focused discussions on just that point. Based on that chat and others before, here are some highlights (as I understand them) of what Aster claims, believes, or believes to be differentiated about its nCluster technology: Read more

June 15, 2009

An example of what’s wrong with big vendors’ approaches to BI (SAP in this case)

I just found Chris Kanaracus’ article about SAP’s rollout last month of its “clear enterprises” strategy. The money quote comes from Sara Lee, the user SAP seems to have trotted out:

But Sara Lee has not yet decided to purchase the software, and there are substantial underlying tasks to perform as well, he added.

“This is giving us the horsepower [to analyze data] but we need to have harmonized and structured data underneath it.”

This is from the leading test user of the product?

Business intelligence and the associated data management processes need to be reimagined, and I’m increasingly coming to suspect that the big BI conglomerates aren’t up to the task.

June 15, 2009

Google Fusion Tables

Google has announced an experimental cloud-based data management system called Fusion Tables. A press article and Slashdot thread ensued, based on some bizarre-sounding analyst quotes that I will not attempt to parse.

What Fusion Tables really seems to be is a spreadsheet without the formulae. That is, it’s a place to dump data in a grid of cells, comment on it, version it, and do elementary data manipulation.  This could, I guess, be useful as an alternative to traditional RDBMS — assuming, of course, that you want to have a row-by-row debate about 100 megs of data.

Seriously, while Google Fusion Tables bears some vague resemblance to what I’m thinking about for the future of both business intelligence and data marts, it sounds as if it has a long way to go before it’s something most enterprises should spend time looking at.

June 10, 2009

Dataupia’s troubles are now confirmed

Todd Fin pointed me yesterday to an article by Wade Roush that confirmed in detail layoffs and other troubles at Dataupia.  The article quotes Dataupia marketing VP Samantha Stone as saying Dataupia is down to 23 employees, and that some of the layoffs were in engineering.  This is consistent with what I’d been hearing for a while, namely that other analytic DBMS vendors were seeing a flood of Dataupia resumes, especially technical ones.

The article goes on to discuss difficulties Dataupia has had in raising another round of financing.  During Dataupia’s very long CEO search — which I kept hearing about from people who’d been approached for the job — it was obvious money wouldn’t come in until a CEO was found. But it seems that even with a new CEO, existing investors are reluctant to re-up without a new investor as well, and that new investment is slow in happening.

On the plus side, the article quotes Samantha as saying founder Foster Hinshaw is recovering well from his heart surgery.

June 10, 2009

Netezza Q1 earning call transcript

I finally read the Netezza Q1 earnings call transcript, put out by Seeking Alpha.  Highlights included:

One tip for the Netezza folks, by the way, from this former stock analyst — you should never use the word “certainly” about a deal you haven’t closed yet. “Almost surely” could be OK, but “certainly” — well, it certainly was not the thing to say.

June 9, 2009

Aster Data sticks by its SQL/MapReduce guns

Aster Data continues to think that MapReduce, integrated with SQL, is an important technology. For example:

I was a big fan of SQL/MapReduce when it was first announced last August. Notwithstanding persuasive examples favoring pure DBMS or pure MapReduce over DBMS/MapReduce integration, I continue to think the SQL/MapReduce idea has great potential.  But I do wish more successful production examples would become visible …

June 8, 2009

Per-terabyte pricing

Software-only DBMS vendors sometimes price per terabyte of user data.  Vertica’s list price is $100K/TB. Greenplum’s list price is $70K/TB. In practice, both offer substantial discounts, especially at higher volumes.  In both cases, this means raw data, uncompressed, without counting indexes or temp space.

Client experience teaches me that this definition is easy to forget, so let me reemphasize the key point:

Per-terabyte pricing is based on a calculated figure.  Per-terabyte pricing is not based on the current disk space used by your database when managed by the DBMS you are replacing.

There’s at least one important difference in how Vertica and Greenplum calculate database size.  No matter how many times you copy the data, Vertica only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy.  Similarly, Vertica charges only for deployment, and not for test or development; I didn’t remember to ask what Greenplum’s policies are in those regards. (Edit: Greenplum says in a comment below that it doesn’t charge for test or development data either.)

*That policy is a great fit with Vertica’s performance recommendation that you should store columns in different sort orders, perhaps an average of two copies per column.

June 8, 2009

Greenplum blogs about some customers

I’ve written some about Greenplum’s customers at eBay and Fox Interactive Media.  But as I recently grumped, I’m not in the mood right now to write much about other Greenplum customers.  Fortunately, Greenplum has filled the gap itself.  Marketing chief Paul Salazar just blogged about a number of other big Greenplum customers. And last month Paul blogged in considerable detail about what he characterizes as an enterprise data warehouse (EDW) conversion — Oracle replacement — at a large pharmaceutical company.

June 8, 2009

The future of data marts

Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:

In essence, Greenplum is pitching the story:

When put that starkly, it’s overstated, not least because

Specialized Analytic DBMS != Data Warehouse Appliance

But basically it makes sense, for two main reasons:

Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.