Parallelization
Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
Big scientific databases need to be stored somehow
A year ago, Mike Stonebraker observed that conventional DBMS don’t necessarily do a great job on scientific data, and further pointed out that different kinds of science might call for different data access methods. Even so, some of the largest databases around are scientific ones, and they have to be managed somehow. For example:
- Microsoft just put out an overwrought press release. The substance seems to be that Pan-STARRS — a Jim Gray legacy also discussed in an August, 2008 Computerworld article — is adding 1.4 terabytes of image data per night, and one not so new database adds 15 terabytes per year of some kind of computer simulation output used to analyze protein folding. Both run on SQL Server, of course.
- Kognitio has an astronomical database too, at Cambridge University, adding 1/2 a terabyte of data per night.
- Oracle is used for a McGill University proteonomics database called CellMapBase. A figure of 50 terabytes of “mass storage” is included, which doesn’t include tape backup and so on.
- The Large Hadron Collider, once it actually starts functioning, is projected to generate 15 petabytes of data annually, which will be initially stored on tape and then distributed to various computing centers around the world.
- Netezza is proud of its ability to serve images and the like quickly, although off the top of my head I’m not thinking of a major customer it has in that area. (But then, if you just sell software, your academic discount can approach 100%; but if like Netezza you have an actual cost of goods sold, that’s not as appealing an option.)
Long-term, I imagine that the most suitable DBMS for these purposes will be MPP systems with strong datatype extensibility — e.g., DB2, PostgreSQL-based Greenplum, PostgreSQL-based Aster nCluster, or maybe Oracle.
Categories: Aster Data, Data types, Greenplum, IBM and DB2, Kognitio, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, PostgreSQL, Scientific research | 1 Comment |
Update on Aster Data Systems and nCluster
I spent a few hours at Aster Data on my West Coast swing last week, which has now officially put out Version 3 of nCluster. Highlights included: Read more
Oracle notes
I spent about six hours at Oracle today — talking with Andy Mendelsohn, Ray Roccaforte, Juan Loaiza, Cetin Ozbutun, et al. — and plan to write more later. For now, let me pass along a few quick comments. Read more
Categories: Data warehousing, Exadata, Oracle, Parallelization, Pricing, Storage, Theory and architecture | 10 Comments |
eBay doesn’t love MapReduce
The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce. That was early this year. Subsequently, however, eBay seems to have become a MapReduce non-fan. The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time. The specific figure they mentioned was parallel efficiency of 18%.
Categories: eBay, MapReduce, Parallelization | 7 Comments |
Exadata and Oracle Database Machine parallelization clarified
Some kind Oracle development managers have reached out and helped me better understand where Oracle does or doesn’t stand in query and analytic parallelization. This post supersedes prior discussions of the subject over the past week. Read more
Categories: Clustering, Data warehouse appliances, Data warehousing, Exadata, Oracle, Parallelization | 10 Comments |
So what’s Oracle’s MPP-aware optimizer and query execution plan story?
Edit: Answers to the title question have now shown up, and so the post below is now superseded by this one.
In most respects — including most data warehousing respects — Oracle’s query optimizer is the most sophisticated on the planet (even ahead of IBM’s, I’d say). But in all the Exadata discussion — and also in a good, comprehensive review of Oracle’s data warehouse technology — I haven’t seen any claims that Oracle has tackled the hard problems of parallel analytics.
Yes, Oracle is now getting data off of multiple disks onto multiple processors at once, without SAN bottlenecks, and doing some local filtering. That’s the heart of the Exadata storage story, and it’s indeed a huge advance over Oracle’s prior technology. But what happens to the data after that? It’s sent over to a RAC cluster. And unless I’m terribly mistaken, any further processing will be done on just a single node in that cluster.
Categories: Data warehousing, Oracle, Parallelization | 9 Comments |
Exadata: Oracle finally answers the data warehouse challengers
Oracle, in partnership with HP, has announced a new data warehouse appliance product line, cleverly branded “Exadata.” The basic idea seems to be that database processing is split among two sets of servers:
- (The new stuff) A set of back-end servers — the Oracle Exadata Storage Servers — that gets data off of disk and does some preliminary query processing.
- (The old stuff) A conventional Oracle RAC cluster on the front-end.
Numbers are being thrown around suggesting that, unlike prior Oracle offerings, the Oracle Exadata-based appliance at least has scalability and price/performance worth comparing to Teradata — hey, Exa is bigger than Tera! — Netezza, et al.
Kevin Closson, who evidently worked on the project, offers the most useful and detailed description of Oracle Exadata I’ve seen so far. In particular, he and Oracle seem to claim: Read more
Categories: Data warehousing, Exadata, Oracle, Parallelization | 18 Comments |
SANs vs. DAS in MPP data warehousing
Generally speaking:
- SANs (Storage Area Networks) are pulling ahead of DAS (Direct Attached Storage).
- Much of the growth in storage is due to data warehousing.
- MPP (Massively Parallel Processing) is pulling ahead of SMP (Symmetric MultiProcessing) for high-end data warehousing.
- MPP architectures are commonly shared-nothing.
- Shared-nothing entails DAS.
But if you think about it, those facts don’t exactly add up. Read more
Categories: Calpont, Parallelization, Storage, Vertica Systems | 24 Comments |
Dividing the data warehousing work among MPP nodes
I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.
Categories: Aster Data, Calpont, Exasol, Greenplum, Parallelization, Theory and architecture, Vertica Systems | 22 Comments |
More on known MapReduce application areas
In surveying MapReduce applications to date, I said that they fell mainly into three overlapping categories:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
and really should have included a fourth:
- Data transformation
Nokia just released another MapReduce implementation, Disco, and its list of applications to date fits right into that template. The relevant quote is:
This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.
Categories: MapReduce | Leave a Comment |