Parallelization
Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
High-performance analytics
For the past few months, I’ve collected a lot of data points to the effect that high-performance analytics – i.e., beyond straightforward query — is becoming increasingly important. And I’ve written about some of them at length. For example:
- MapReduce – controversial or in some cases even disappointing though it may be – has a lot of use cases.
- It’s early days, but Netezza and Teradata (and others) are beefing up their geospatial analytic capabilities.
- Memory-centric analytics is in the spotlight.
Ack. I can’t decide whether “analytics” should be a singular or plural noun. Thoughts?
Another area that’s come up which I haven‘t blogged about so much is data mining in the database. Data mining accounts for a large part of data warehouse use. The traditional way to do data mining is to extract data from the database and dump it into SAS. But there are problems with this scenario, including:
| Categories: Analytic technologies, Aster Data, Data warehousing, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Netezza, Oracle, Parallelization, SAS Institute, Teradata | 1 Comment |
Big scientific databases need to be stored somehow
A year ago, Mike Stonebraker observed that conventional DBMS don’t necessarily do a great job on scientific data, and further pointed out that different kinds of science might call for different data access methods. Even so, some of the largest databases around are scientific ones, and they have to be managed somehow. For example:
- Microsoft just put out an overwrought press release. The substance seems to be that Pan-STARRS — a Jim Gray legacy also discussed in an August, 2008 Computerworld article — is adding 1.4 terabytes of image data per night, and one not so new database adds 15 terabytes per year of some kind of computer simulation output used to analyze protein folding. Both run on SQL Server, of course.
- Kognitio has an astronomical database too, at Cambridge University, adding 1/2 a terabyte of data per night.
- Oracle is used for a McGill University proteonomics database called CellMapBase. A figure of 50 terabytes of “mass storage” is included, which doesn’t include tape backup and so on.
- The Large Hadron Collider, once it actually starts functioning, is projected to generate 15 petabytes of data annually, which will be initially stored on tape and then distributed to various computing centers around the world.
- Netezza is proud of its ability to serve images and the like quickly, although off the top of my head I’m not thinking of a major customer it has in that area. (But then, if you just sell software, your academic discount can approach 100%; but if like Netezza you have an actual cost of goods sold, that’s not as appealing an option.)
Long-term, I imagine that the most suitable DBMS for these purposes will be MPP systems with strong datatype extensibility — e.g., DB2, PostgreSQL-based Greenplum, PostgreSQL-based Aster nCluster, or maybe Oracle.
| Categories: Aster Data, Data types, Greenplum, IBM and DB2, Kognitio, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, PostgreSQL, Scientific research | 1 Comment |
Update on Aster Data Systems and nCluster
I spent a few hours at Aster Data on my West Coast swing last week, which has now officially put out Version 3 of nCluster. Highlights included:
| Categories: Application areas, Aster Data, Data warehousing, Database compression, MapReduce, Market share, Parallelization, Specific users, Theory and architecture, Web analytics | 1 Comment |
Oracle notes
I spent about six hours at Oracle today — talking with Andy Mendelsohn, Ray Roccaforte, Juan Loaiza, Cetin Ozbutun, et al. — and plan to write more later. For now, let me pass along a few quick comments. Read more
| Categories: Data warehousing, Exadata, Oracle, Parallelization, Pricing, Storage, Theory and architecture | 6 Comments |
eBay doesn’t love MapReduce
The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce. That was early this year. Subsequently, however, eBay seems to have become a MapReduce non-fan. The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time. The specific figure they mentioned was parallel efficiency of 18%.
| Categories: MapReduce, Parallelization, eBay | 5 Comments |
Exadata and Oracle Database Machine parallelization clarified
Some kind Oracle development managers have reached out and helped me better understand where Oracle does or doesn’t stand in query and analytic parallelization. This post supersedes prior discussions of the subject over the past week. Read more
| Categories: Data warehouse appliances, Data warehousing, Exadata, Oracle, Parallelization | 10 Comments |
So what’s Oracle’s MPP-aware optimizer and query execution plan story?
Edit: Answers to the title question have now shown up, and so the post below is now superseded by this one.
In most respects — including most data warehousing respects — Oracle’s query optimizer is the most sophisticated on the planet (even ahead of IBM’s, I’d say). But in all the Exadata discussion — and also in a good, comprehensive review of Oracle’s data warehouse technology — I haven’t seen any claims that Oracle has tackled the hard problems of parallel analytics.
Yes, Oracle is now getting data off of multiple disks onto multiple processors at once, without SAN bottlenecks, and doing some local filtering. That’s the heart of the Exadata storage story, and it’s indeed a huge advance over Oracle’s prior technology. But what happens to the data after that? It’s sent over to a RAC cluster. And unless I’m terribly mistaken, any further processing will be done on just a single node in that cluster.
| Categories: Data warehousing, Oracle, Parallelization | 8 Comments |
Exadata: Oracle finally answers the data warehouse challengers
Oracle, in partnership with HP, has announced a new data warehouse appliance product line, cleverly branded “Exadata.” The basic idea seems to be that database processing is split among two sets of servers:
- (The new stuff) A set of back-end servers — the Oracle Exadata Storage Servers — that gets data off of disk and does some preliminary query processing.
- (The old stuff) A conventional Oracle RAC cluster on the front-end.
Numbers are being thrown around suggesting that, unlike prior Oracle offerings, the Exadata-based appliance at least has scalability and price/performance worth comparing to Teradata — hey, Exa is bigger than Tera! — Netezza, et al.
Kevin Closson, who evidently worked on the project, offers the most useful and detailed description of Exadata I’ve seen so far. In particular, he and Oracle seem to claim: Read more
| Categories: Data warehousing, Exadata, Oracle, Parallelization | 17 Comments |
SANs vs. DAS in MPP data warehousing
Generally speaking:
- SANs (Storage Area Networks) are pulling ahead of DAS (Direct Attached Storage).
- Much of the growth in storage is due to data warehousing.
- MPP (Massively Parallel Processing) is pulling ahead of SMP (Symmetric MultiProcessing) for high-end data warehousing.
- MPP architectures are commonly shared-nothing.
- Shared-nothing entails DAS.
But if you think about it, those facts don’t exactly add up.
| Categories: Calpont, Parallelization, Storage, Vertica Systems | 17 Comments |
Dividing the data warehousing work among MPP nodes
I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.
| Categories: Aster Data, Calpont, Exasol, Greenplum, Parallelization, Theory and architecture, Vertica Systems | 21 Comments |
More on known MapReduce application areas
In surveying MapReduce applications to date, I said that they fell mainly into three overlapping categories:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
and really should have included a fourth:
- Data transformation
Nokia just released another MapReduce implementation, Disco, and its list of applications to date fits right into that template. The relevant quote is:
This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.
| Categories: MapReduce | Leave a Comment |
Three different implementations of MapReduce
So far as I can see, there are three implementations of MapReduce that matter for enterprise analytic use – Hadoop, Greenplum’s, and Aster Data’s.* Hadoop has of course been available for a while, and used for a number of different things, while Greenplum’s and Aster Data’s versions of MapReduce – both in late-stage beta – have far fewer users.
*Perhaps Nokia’s Disco or another implementation will at some point join the list.
Earlier this evening I posted some Mike Stonebraker criticisms of MapReduce. It turns out that they aren’t all accurate across all MapReduce implementations. So this seems like a good time for me to stop stalling and put up a few notes about specific features of different MapReduce implementations. Here goes.
| Categories: Aster Data, Greenplum, MapReduce | 1 Comment |
Mike Stonebraker’s counterarguments to MapReduce’s popularity
In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says: Read more
| Categories: Data warehousing, MapReduce, Michael Stonebraker, PostgreSQL | 4 Comments |
Introduction to Aster Data and nCluster
I’ve been writing a lot about Greenplum since a recent visit. But on the same trip I met with Aster Data, and have talked with them further since. Let me now redress the balance and outline some highlights of the Aster Data story.
| Categories: Analytic technologies, Aster Data, Data warehousing, Parallelization, Specific users | 3 Comments |
Yes, but what are the Very Biggest benefits of MapReduce?
On behalf of On-Demand Enterprise, nee’ Grid Today, Dennis Barker asked me to clarify the most important benefits, features, etc. to various constituencies (business users, programmers, DBAs, etc.) of the Greenplum and Aster Data MapReduce announcements. Questions like that are hard to answer simply. Here’s why.
The core benefit of MapReduce is price/performance (because it allows the cost benefits of parallelization to be applied to analyses that are hard to parallelize otherwise). Large price/performance gains commonly mix together three kinds of benefits.
1. They let you do what you did before, for less money.
2. They let you do a better version of what you did before, for similar money.
3. They let you do new things that didn’t make economic sense before, but now do.
| Categories: Analytic technologies, Data warehousing, MapReduce | Leave a Comment |
Donut holes converted to code
And with impressively linear scalability.
| Categories: Humor, Parallelization | Leave a Comment |
Are analytic DBMS vendors overcomplicating their interconnect architectures?
I don’t usually spend a lot of time researching Ethernet switches. But I do think a lot about high-end data warehousing, and as I noted back in July, networking performance is a big challenge there. Among the very-large-scale MPP data warehouse software vendors, Greenplum is unusual in that its interconnect of choice is (sufficiently many) cheap 1 gigabit Ethernet switches.
A recent Network World story suggested that Greenplum wasn’t alone in this preference; other people also feel that clusters of commodity 1 gigabit Ethernet switches can be superior to higher-performing ones. So I pinged CTO Luke Lonergan of Greenplum for more comment. His response, which I got permission to publish, was: Read more
| Categories: Data warehousing, Greenplum, Parallelization | 4 Comments |
Three approaches to parallelizing data transformation
Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.
*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.
| Categories: Aster Data, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, MapReduce, Parallelization, Pervasive Software | 6 Comments |
Why MapReduce matters to SQL data warehousing
Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products. So why do I think this could be a big deal? The short answer is “Because MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.” The long answer goes something like this.
The core ideas of MapReduce are:
| Categories: Analytic technologies, Data warehousing, MapReduce, Parallelization | 20 Comments |
Known applications of MapReduce
Most of the actual MapReduce applications I’ve heard of fall into a few areas:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what’s in the two big sources I found online.
| Categories: MapReduce, RDF and graphs, Text | 11 Comments |
MapReduce links
For whatever reason, I seem to be making the peripheral posts about MapReduce tonight before getting to the meat of the issues. So be it. There’s a rich set of links out there about MapReduce, and here are some of the best of them:
- Aster Data introduced MapReduce integrated into its SQL data warehouse DBMS tonight. Aster’s site features an excellent white paper.
- Exactly the same is true of Greenplum.
- Google Labs offers the seminal MapReduce research paper. It also has a broken link to an associated slide presentation, which fortunately is available here.
- One can get a good sense of MapReduce by reading up on the open source implementation Hadoop.
- In particular, this list of Hadoop applications is the longest list of MapReduce applications I know of (ahead even of Google’s long internal list).
- Joel Spolsky explained the core MapReduce concept a couple of years ago.
| Categories: MapReduce, Parallelization | 7 Comments |
MapReduce sound bites
Last Thursday, both Greenplum and Aster Data — the two most recent of my numerous data warehouse specialist customers — both told me of the same major innovation. Both were rushing to announce it first, before anybody else did. This led to considerable tap dancing, with the upshot being that both are releasing the information tonight or tomorrow morning.
What’s going on is that Aster Data and Greenplum have both integrated MapReduce into their respective MPP shared-nothing data warehouse DBMS. Read more
| Categories: Analytic technologies, Aster Data, Greenplum, MapReduce, Parallelization | 11 Comments |
Kevin Closson doesn’t like MPP
Kevin Closson of Oracle offers a long criticism of the popularity of MPP. Key takeaways include:
- TPC-H benchmarks that show Oracle as somewhat superior to DB2 are highly significant.
- TPC-H benchmarks in which MPP vendors destroy Oracle are too unimportant to even mention.
- SMP did better than MPP the last time he was in a position to judge (which evidently was some time during the Clinton Administration), so it surely must still be superior for all purposes today.
| Categories: Data warehousing, Oracle, Parallelization | 18 Comments |
Response to Rita Sallam of Oracle
In a comment thread on Seth Grimes’ blog, Rita Sallam of Oracle engaged in a passionate defense of her data warehousing software. I’d like to take it upon myself to respond to a few of here points here. Read more
| Categories: Data warehousing, Oracle, Parallelization | 8 Comments |
ParAccel unveils its EMC-related appliance strategy
Embargoes are getting ever more stupid these days, wasting analysts’ and bloggers’ time in doomed attempts to micromanage the news flow. ParAccel is no exception to the rule. An announcement that’s actually been public knowledge for a couple of months was finally made official a few minutes ago. It’s an appliance, or at least an attempt to gain customers for an appliance. The core ideas include:
- ParAccel’s usual shared-nothing configuration is hooked up to SAN-based EMC storage at the back end.
- Around half of the total data is on internal (i.e., node-specific) disks, mirrored on the storage device. The rest of the data lives only on the EMC device. Logically, all this data is integrated. So hopefully you’ll be able to process more data per unit of time than you could on a standard ParAccel configuration.
- Also, different parts of the EMC device are dedicated to different ParAccel nodes. So, while this isn’t a shared-nothing architecture, at least it’s shared-not-very-much. (DATAllegro does something similar, although without the mirroring on direct-attached storage.)
- Backup, snapshotting, and so on are inherited from EMC. Administration will increasingly be integrated with EMC’s.
