September 6, 2008

SANs vs. DAS in MPP data warehousing

Generally speaking:

SANs (Storage Area Networks) are pulling ahead of DAS (Direct Attached Storage).
Much of the growth in storage is due to data warehousing.
MPP (Massively Parallel Processing) is pulling ahead of SMP (Symmetric MultiProcessing) for high-end data warehousing.
MPP architectures are commonly shared-nothing.
Shared-nothing entails DAS.

But if you think about it, those facts don’t exactly add up. Read more

Categories: Calpont, Parallelization, Storage, Vertica Systems

September 5, 2008

Dividing the data warehousing work among MPP nodes

I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.

Categories: Aster Data, Calpont, Exasol, Greenplum, Parallelization, Theory and architecture, Vertica Systems

22 Comments

September 5, 2008

More on known MapReduce application areas

In surveying MapReduce applications to date, I said that they fell mainly into three overlapping categories:

Text tokenization, indexing, and search
Creation of other kinds of data structures (e.g., graphs)
Data mining and machine learning

and really should have included a fourth:

Data transformation

Nokia just released another MapReduce implementation, Disco, and its list of applications to date fits right into that template. The relevant quote is:

This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.

Categories: MapReduce

Three different implementations of MapReduce

So far as I can see, there are three implementations of MapReduce that matter for enterprise analytic use – Hadoop, Greenplum’s, and Aster Data’s.* Hadoop has of course been available for a while, and used for a number of different things, while Greenplum’s and Aster Data’s versions of MapReduce – both in late-stage beta – have far fewer users.

*Perhaps Nokia’s Disco or another implementation will at some point join the list.

Earlier this evening I posted some Mike Stonebraker criticisms of MapReduce. It turns out that they aren’t all accurate across all MapReduce implementations. So this seems like a good time for me to stop stalling and put up a few notes about specific features of different MapReduce implementations. Here goes. Read more

Categories: Aster Data, Greenplum, MapReduce

3 Comments

September 4, 2008

Mike Stonebraker’s counterarguments to MapReduce’s popularity

In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says: Read more

Categories: Data warehousing, MapReduce, Michael Stonebraker, PostgreSQL

5 Comments

September 4, 2008

More data on data warehouse sizes and issues

I spoke today with Paul Barth and Randy Bean of consultancy NewVantage Partners. The core of NewVantage’s business seems to be helping large enterprises (especially financial services) with their data warehouse strategies. Takeaways — none of which should shock regular readers of DBMS2 — included:

Administrative cost and difficulty are often the single biggest issue in selecting analytic DBMS products.
Oracle hits a wall around 10 terabytes of user data. The one customer NewVantage can think of with an Oracle data warehouse over 10 terabytes is fleeing Oracle for Netezza.
NewVantage says that very specialized data warehouses on Oracle could conceivably be larger than that.
NewVantage does have a customer on DB2/UDB in the 30-40 terabyte range. That customer does a lot of careful tuning to make it work.
About 15% of NewVantage’s customers use Netezza. Few if any use newer analytic DBMS (but I got the sense more will soon). The rest rely on “traditional” DBMS, a group that includes Teradata.

Categories: Data warehousing, IBM and DB2, Netezza, Oracle

1 Comment

September 3, 2008

Head to head blog debate between EMC, NetApp, and HP

Chuck Hollis of EMC started a fierce debate with a blog post on how to measure effective storage capacity. Competitors from NetApp and HP responded in often sarcastic detail in the comment thread, Hollis shot back, and the volleying continued for quite a while.

I’m not a storage maven, and I don’t understand all the details of that stuff. If you’re like me in that regard, you may find the post worth skimming just to see what some of the choices, trade-offs, and complications are in designing and measuring storage systems. Stephen Foskett’s related post is also worth a look in that regard.

My recent foray into measuring disk storage pales by comparison.

Categories: Storage, Theory and architecture

3 Comments

September 2, 2008

Introduction to Aster Data and nCluster

I’ve been writing a lot about Greenplum since a recent visit. But on the same trip I met with Aster Data, and have talked with them further since. Let me now redress the balance and outline some highlights of the Aster Data story.

Categories: Analytic technologies, Aster Data, Data warehousing, Parallelization, Specific users

4 Comments

September 1, 2008

Estimating user data vs. spinning disk

There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:

Indexes and temporary working space. That’s what I emphasized a couple of years ago in my post about Expansion Ratios.
Compression. I write about database compression a lot.
Disk redundancy. I usually gloss over that one, but I’ll try to make amends in this post.
Replication other than that which is primarily designed for redundancy. I usually gloss over that one too, and I think it’s safe to continue doing so. That’s because data warehouse replication – at least in most of the system architectures I know of – generally divides into three categories:
- a lot like redundancy
- a lot like an index
- only a minor issue (e.g., when small fact tables are replicated across each node of an MPP cluster)

Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.

Categories: Data warehousing, Database compression, Greenplum, Theory and architecture

5 Comments

September 1, 2008

Yes, but what are the Very Biggest benefits of MapReduce?

On behalf of On-Demand Enterprise, nee’ Grid Today, Dennis Barker asked me to clarify the most important benefits, features, etc. to various constituencies (business users, programmers, DBAs, etc.) of the Greenplum and Aster Data MapReduce announcements. Questions like that are hard to answer simply. Here’s why.

The core benefit of MapReduce is price/performance (because it allows the cost benefits of parallelization to be applied to analyses that are hard to parallelize otherwise). Large price/performance gains commonly mix together three kinds of benefits.

1. They let you do what you did before, for less money.
2. They let you do a better version of what you did before, for similar money.
3. They let you do new things that didn’t make economic sense before, but now do.
Read more

Categories: Analytic technologies, Data warehousing, MapReduce

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in