Parallelization

Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:

November 7, 2008

Big scientific databases need to be stored somehow

A year ago, Mike Stonebraker observed that conventional DBMS don’t necessarily do a great job on scientific data, and further pointed out that different kinds of science might call for different data access methods. Even so, some of the largest databases around are scientific ones, and they have to be managed somehow. For example:

Microsoft just put out an overwrought press release. The substance seems to be that Pan-STARRS — a Jim Gray legacy also discussed in an August, 2008 Computerworld article — is adding 1.4 terabytes of image data per night, and one not so new database adds 15 terabytes per year of some kind of computer simulation output used to analyze protein folding. Both run on SQL Server, of course.
Kognitio has an astronomical database too, at Cambridge University, adding 1/2 a terabyte of data per night.
Oracle is used for a McGill University proteonomics database called CellMapBase. A figure of 50 terabytes of “mass storage” is included, which doesn’t include tape backup and so on.
The Large Hadron Collider, once it actually starts functioning, is projected to generate 15 petabytes of data annually, which will be initially stored on tape and then distributed to various computing centers around the world.
Netezza is proud of its ability to serve images and the like quickly, although off the top of my head I’m not thinking of a major customer it has in that area. (But then, if you just sell software, your academic discount can approach 100%; but if like Netezza you have an actual cost of goods sold, that’s not as appealing an option.)

Long-term, I imagine that the most suitable DBMS for these purposes will be MPP systems with strong datatype extensibility — e.g., DB2, PostgreSQL-based Greenplum, PostgreSQL-based Aster nCluster, or maybe Oracle.

Categories: Aster Data, Data types, Greenplum, IBM and DB2, Kognitio, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, PostgreSQL, Scientific research

1 Comment

October 22, 2008

Update on Aster Data Systems and nCluster

I spent a few hours at Aster Data on my West Coast swing last week, which has now officially put out Version 3 of nCluster. Highlights included: Read more

Categories: Application areas, Aster Data, Data warehousing, Database compression, MapReduce, Market share and customer counts, Parallelization, Specific users, Theory and architecture, Web analytics

3 Comments

October 17, 2008

Oracle notes

I spent about six hours at Oracle today — talking with Andy Mendelsohn, Ray Roccaforte, Juan Loaiza, Cetin Ozbutun, et al. — and plan to write more later. For now, let me pass along a few quick comments. Read more

Categories: Data warehousing, Exadata, Oracle, Parallelization, Pricing, Storage, Theory and architecture

10 Comments

October 15, 2008

eBay doesn’t love MapReduce

The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce. That was early this year. Subsequently, however, eBay seems to have become a MapReduce non-fan. The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time. The specific figure they mentioned was parallel efficiency of 18%.

Categories: eBay, MapReduce, Parallelization

7 Comments

September 28, 2008

Exadata and Oracle Database Machine parallelization clarified

Some kind Oracle development managers have reached out and helped me better understand where Oracle does or doesn’t stand in query and analytic parallelization. This post supersedes prior discussions of the subject over the past week. Read more

Categories: Clustering, Data warehouse appliances, Data warehousing, Exadata, Oracle, Parallelization

10 Comments

September 25, 2008

So what’s Oracle’s MPP-aware optimizer and query execution plan story?

Edit: Answers to the title question have now shown up, and so the post below is now superseded by this one.

In most respects — including most data warehousing respects — Oracle’s query optimizer is the most sophisticated on the planet (even ahead of IBM’s, I’d say). But in all the Exadata discussion — and also in a good, comprehensive review of Oracle’s data warehouse technology — I haven’t seen any claims that Oracle has tackled the hard problems of parallel analytics.

Yes, Oracle is now getting data off of multiple disks onto multiple processors at once, without SAN bottlenecks, and doing some local filtering. That’s the heart of the Exadata storage story, and it’s indeed a huge advance over Oracle’s prior technology. But what happens to the data after that? It’s sent over to a RAC cluster. And unless I’m terribly mistaken, any further processing will be done on just a single node in that cluster.

Categories: Data warehousing, Oracle, Parallelization

9 Comments

September 24, 2008

Exadata: Oracle finally answers the data warehouse challengers

Oracle, in partnership with HP, has announced a new data warehouse appliance product line, cleverly branded “Exadata.” The basic idea seems to be that database processing is split among two sets of servers:

(The new stuff) A set of back-end servers — the Oracle Exadata Storage Servers — that gets data off of disk and does some preliminary query processing.
(The old stuff) A conventional Oracle RAC cluster on the front-end.

Numbers are being thrown around suggesting that, unlike prior Oracle offerings, the Oracle Exadata-based appliance at least has scalability and price/performance worth comparing to Teradata — hey, Exa is bigger than Tera! — Netezza, et al.

Kevin Closson, who evidently worked on the project, offers the most useful and detailed description of Oracle Exadata I’ve seen so far. In particular, he and Oracle seem to claim: Read more

Categories: Data warehousing, Exadata, Oracle, Parallelization

18 Comments

September 6, 2008

SANs vs. DAS in MPP data warehousing

Generally speaking:

SANs (Storage Area Networks) are pulling ahead of DAS (Direct Attached Storage).
Much of the growth in storage is due to data warehousing.
MPP (Massively Parallel Processing) is pulling ahead of SMP (Symmetric MultiProcessing) for high-end data warehousing.
MPP architectures are commonly shared-nothing.
Shared-nothing entails DAS.

But if you think about it, those facts don’t exactly add up. Read more

Categories: Calpont, Parallelization, Storage, Vertica Systems

24 Comments

September 5, 2008

Dividing the data warehousing work among MPP nodes

I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.

Categories: Aster Data, Calpont, Exasol, Greenplum, Parallelization, Theory and architecture, Vertica Systems

22 Comments

September 5, 2008

More on known MapReduce application areas

In surveying MapReduce applications to date, I said that they fell mainly into three overlapping categories:

Text tokenization, indexing, and search
Creation of other kinds of data structures (e.g., graphs)
Data mining and machine learning

and really should have included a fourth:

Data transformation

Nokia just released another MapReduce implementation, Disco, and its list of applications to date fits right into that template. The relevant quote is:

This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.

Categories: MapReduce

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in