SANs vs. DAS in MPP data warehousing
Generally speaking:
- SANs (Storage Area Networks) are pulling ahead of DAS (Direct Attached Storage).
- Much of the growth in storage is due to data warehousing.
- MPP (Massively Parallel Processing) is pulling ahead of SMP (Symmetric MultiProcessing) for high-end data warehousing.
- MPP architectures are commonly shared-nothing.
- Shared-nothing entails DAS.
But if you think about it, those facts don’t exactly add up. Read more
| Categories: Calpont, Parallelization, Storage, Vertica Systems | 24 Comments |
Dividing the data warehousing work among MPP nodes
I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.
| Categories: Aster Data, Calpont, Exasol, Greenplum, Parallelization, Theory and architecture, Vertica Systems | 22 Comments |
More on known MapReduce application areas
In surveying MapReduce applications to date, I said that they fell mainly into three overlapping categories:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
and really should have included a fourth:
- Data transformation
Nokia just released another MapReduce implementation, Disco, and its list of applications to date fits right into that template. The relevant quote is:
This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.
| Categories: MapReduce | Leave a Comment |
Three different implementations of MapReduce
So far as I can see, there are three implementations of MapReduce that matter for enterprise analytic use – Hadoop, Greenplum’s, and Aster Data’s.* Hadoop has of course been available for a while, and used for a number of different things, while Greenplum’s and Aster Data’s versions of MapReduce – both in late-stage beta – have far fewer users.
*Perhaps Nokia’s Disco or another implementation will at some point join the list.
Earlier this evening I posted some Mike Stonebraker criticisms of MapReduce. It turns out that they aren’t all accurate across all MapReduce implementations. So this seems like a good time for me to stop stalling and put up a few notes about specific features of different MapReduce implementations. Here goes. Read more
| Categories: Aster Data, Greenplum, MapReduce | 3 Comments |
Mike Stonebraker’s counterarguments to MapReduce’s popularity
In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says: Read more
| Categories: Data warehousing, MapReduce, Michael Stonebraker, PostgreSQL | 5 Comments |
More data on data warehouse sizes and issues
I spoke today with Paul Barth and Randy Bean of consultancy NewVantage Partners. The core of NewVantage’s business seems to be helping large enterprises (especially financial services) with their data warehouse strategies. Takeaways — none of which should shock regular readers of DBMS2 — included:
- Administrative cost and difficulty are often the single biggest issue in selecting analytic DBMS products.
- Oracle hits a wall around 10 terabytes of user data. The one customer NewVantage can think of with an Oracle data warehouse over 10 terabytes is fleeing Oracle for Netezza.
- NewVantage says that very specialized data warehouses on Oracle could conceivably be larger than that.
- NewVantage does have a customer on DB2/UDB in the 30-40 terabyte range. That customer does a lot of careful tuning to make it work.
- About 15% of NewVantage’s customers use Netezza. Few if any use newer analytic DBMS (but I got the sense more will soon). The rest rely on “traditional” DBMS, a group that includes Teradata.
| Categories: Data warehousing, IBM and DB2, Netezza, Oracle | 1 Comment |
Head to head blog debate between EMC, NetApp, and HP
Chuck Hollis of EMC started a fierce debate with a blog post on how to measure effective storage capacity. Competitors from NetApp and HP responded in often sarcastic detail in the comment thread, Hollis shot back, and the volleying continued for quite a while.
I’m not a storage maven, and I don’t understand all the details of that stuff. If you’re like me in that regard, you may find the post worth skimming just to see what some of the choices, trade-offs, and complications are in designing and measuring storage systems. Stephen Foskett’s related post is also worth a look in that regard.
My recent foray into measuring disk storage pales by comparison.
| Categories: Storage, Theory and architecture | 3 Comments |
Introduction to Aster Data and nCluster
I’ve been writing a lot about Greenplum since a recent visit. But on the same trip I met with Aster Data, and have talked with them further since. Let me now redress the balance and outline some highlights of the Aster Data story.
| Categories: Analytic technologies, Aster Data, Data warehousing, Parallelization, Specific users | 4 Comments |
Estimating user data vs. spinning disk
There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:
- Indexes and temporary working space. That’s what I emphasized a couple of years ago in my post about Expansion Ratios.
- Compression. I write about database compression a lot.
- Disk redundancy. I usually gloss over that one, but I’ll try to make amends in this post.
- Replication other than that which is primarily designed for redundancy. I usually gloss over that one too, and I think it’s safe to continue doing so. That’s because data warehouse replication – at least in most of the system architectures I know of – generally divides into three categories:
- a lot like redundancy
- a lot like an index
- only a minor issue (e.g., when small fact tables are replicated across each node of an MPP cluster)
Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.
Yes, but what are the Very Biggest benefits of MapReduce?
On behalf of On-Demand Enterprise, nee’ Grid Today, Dennis Barker asked me to clarify the most important benefits, features, etc. to various constituencies (business users, programmers, DBAs, etc.) of the Greenplum and Aster Data MapReduce announcements. Questions like that are hard to answer simply. Here’s why.
The core benefit of MapReduce is price/performance (because it allows the cost benefits of parallelization to be applied to analyses that are hard to parallelize otherwise). Large price/performance gains commonly mix together three kinds of benefits.
1. They let you do what you did before, for less money.
2. They let you do a better version of what you did before, for similar money.
3. They let you do new things that didn’t make economic sense before, but now do.
Read more
| Categories: Analytic technologies, Data warehousing, MapReduce | Leave a Comment |
