Parallelization
Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
The Great MapReduce Debate
Google’s highly parallel file manipulator MapReduce has gotten great attention recently, after a research paper revealed:
- MapReduce is running the core Google search engine, plus much of Google Analytics and other applications.
- MapReduce is processing 400+ petabytes of data per month.
(Niall Kennedy popularized the paper and surveyed its results.)
David DeWitt and Mike Stonebraker then launched a blistering attack on MapReduce, accusing it of disregarding almost all the lessons of database management system theory and practice. A vigorous comment thread has ensued, pointing out that MapReduce is not a DBMS and asserting it therefore shouldn’t be judged as one.
While correct, that defense begs the question – what is MapReduce good for? Proponents of MapReduce highlight two advantages:
- MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
- MapReduce runs in massively parallel mode “for free,” without extra programming.
Based on those advantages, MapReduce would indeed seem to have significant uses, including: Read more
Categories: Cloud computing, MapReduce, Michael Stonebraker | 10 Comments |
Why Oracle and Microsoft will lose in VLDB data warehousing
I haven’t been as clear as I could have been in explaining why I think MPP/shared-nothing beats SMP/shared-everything. The answer is in a short white paper, currently bottlenecked at the sponsor’s end of the process. Here’s an excerpt from the latest draft:
There are two ways to make more powerful computers:
1. Use more powerful parts – processors, disk drives, etc.
2. Just use more parts of the same power.
Of the two, the more-parts strategy much more cost-effective. Smaller* parts are much more economical, since the bigger the part, the harder and more costly it is to avoid defects, in manufacturing and initial design alike. Consequently, all high-end computers rely on some kind of parallel processing.
*As measured in terms of capacity, transistor count, etc., not physical size. Read more
Categories: Data warehouse appliances, Data warehousing, DATAllegro, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, Teradata, Theory and architecture, Vertica Systems | 7 Comments |
Vendor segmentation for data warehouse DBMS
February, 2011 edit: I’ve now commented on Gartner’s 2010 Data Warehouse Database Management System Magic Quadrant as well.
Several vendors are offering links to Gartner’s new Magic Quadrant report on data warehouse DBMS. (Edit: This is now a much better link to the 2006 MQ.) Somewhat atypically for Gartner, there’s a strict hierarchy among most of the vendors, with Teradata > IBM > Oracle > Microsoft > Sybase > Kognitio > MySQL > Sand, in each case on both axes of the matrix. The only two exceptions are Netezza and DATallegro, which are depicted as outvisioning Microsoft somewhat even as they trail both Microsoft and Sybase in execution.
Gartner Magic Quadrants tend to annoy me, and I’m not going to critique the rankings in detail. But I do think this particular MQ is helpful in framing a vendor segmentation, namely:
- Big full-spectrum MPP/shared-nothing vendors: Teradata and IBM.
- MPP/shared-nothing appliance upstarts: Netezza and DATallegro
- Big SMP/shared-everything vendors who also are apt to be your OLTP incumbent, and who want to integrate your software stack soup-to-nuts: Oracle and Microsoft
- Niche vendors: Pretty much everybody else
Categories: Data warehouse appliances, Data warehousing, DATAllegro, IBM and DB2, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, Teradata | 6 Comments |
Oracle and Microsoft in data warehousing
Most of my recent data warehouse engine research has been with the specialists. But over the past couple of days I caught up with Oracle and Microsoft (IBM is scheduled for Friday). In at least three ways, it makes sense to lump those vendors together, and contrast them with the newer data warehouse appliance startups:
- Shared-everything architecture
- End-to-end solution story
- OLTP industrial-strengthness carried over to data warehousing
In other ways, of course, their positions are greatly different. Oracle may have a full order-of-magnitude lead on Microsoft in warehouse sizes, for example, and has a broad range of advanced features that Microsoft either hasn’t matched yet, or else just released in SQL Server 2005. Microsoft was earlier in pushing DBA ease as a major product design emphasis, although Oracle has played vigorous catch-up in Oracle10g.