Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

Any subcategory
Database diversity
Explicit support for specific data types
(in Text Technologies) Text search

March 21, 2007

Compression in columnar data stores

We have lively discussions going on columnar data stores vs. vertically partitioned row stores. Part is visible in the comment thread to a recent post. Other parts come in private comments from Stuart Frost of DATAllegro and Mike Stonebraker of Vertica et al.

To me, the most interesting part of what the Vertica guys are saying is twofold. One is that data compression just works better in column stores than row stores, perhaps by a factor of 3, because “the next thing in storage is the same data type, rather than a different one.” Frankly, although Mike has said this a couple of times, I haven’t understood yet why row stores can’t be smart enough to compress just as well. Yes, it’s a little harder than it would be in a columnar system; but I don’t see why the challenge would be insuperable.

The second part is even cooler, namely the claim that column stores allow the processors to operate directly on compressed data. But once again, I don’t see why row stores can’t do that too. For example, when you join via bitmapped indices, exactly what you’re doing is operating on highly-compressed data.

Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, DATAllegro, Vertica Systems

2 Comments

March 19, 2007

DATAllegro vs. Vertica and other columnar systems

Stuart Frost of DATAllegro offered an interesting counter today to columnar DBMS architectures — vertical partitioning. In particular, he told me of a 120 terabyte (growing soon to 250 terabytes) call data record database, in which a few key columns were separated out. Read more

Categories: Columnar database management, Data warehouse appliances, Data warehousing, DATAllegro, Kognitio, Vertica Systems

13 Comments

March 16, 2007

Word of the day: “Compression”

IBM sent over a bunch of success stories recently, with DB2’s new aggressive compression prominently mentioned. Mike Stonebraker made a big point of Vertica’s compression when last we talked; other column-oriented data warehouse/mart software vendors (e.g. ~~Kognitio,~~ SAP, Sybase) get strong compression benefits as well. Other data warehouse/mart specialists are doing a lot with compression too, although some of that is governed by please-don’t-say-anything-good-about-us NDA agreements.

Compression is important for at least three reasons:

It saves disk space, which is a major cost issue in data warehousing.
It saves I/O, which is the major performance issue in data warehousing.
In well-designed systems, it can actually make on-chip execution faster, because the gains in memory speed and movement can exceed the cost of actually packing/unpacking the data. (Or so I’m told; I haven’t aggressively investigated that claim.)

When evaluating data warehouse/mart software, take a look at the vendor’s compression story. It’s important stuff.

EDIT: DATAllegro claims in a note to me that they get 3-4x storage savings via compression. They also make the observation that fewer disks ==> fewer disk failures, and spin that — as it were 🙂 — into a claim of greater reliability.

Categories: Data warehouse appliances, Data warehousing, Database compression, DATAllegro, IBM and DB2, SAP AG, Vertica Systems

3 Comments

March 6, 2007

DBMS market competitive overview (Part 1)

Monash Advantage members just received an exclusive nine-page Monash Letter with a competitive overview of the DBMS industry. The full analysis is exclusive to them, but I’ll give some highlights here.

1. As per my recent “deck-clearing” posts, there’s a lot more competitive opportunity in the DBMS industry than many observers recognize.

2. One reason is the considerable number of separate niches in the DBMS space.

3. Oracle is a classical Geoffrey Moore “gorilla” only in the market for high-end OLTP and mixed-used DBMS. Everything else is up for grabs.

4. As discussed here extensively, simpler appliance-like architectures are beating the overly complex general-purpose DBMS vendors’ solutions for VLDB data warehousing.

5. MPP/shared-nothing architectures are deservedly beating SMP/shared-everything approaches for VLDB data warehousing.

That’s not the only Monash Letter recently released; another one covered online marketing strategy and tactics.

Categories: Data warehouse appliances, Data warehousing, Database diversity, Oracle, Theory and architecture

Leave a Comment

March 6, 2007

Why Oracle and Microsoft will lose in VLDB data warehousing

I haven’t been as clear as I could have been in explaining why I think MPP/shared-nothing beats SMP/shared-everything. The answer is in a short white paper, currently bottlenecked at the sponsor’s end of the process. Here’s an excerpt from the latest draft:

There are two ways to make more powerful computers:

1. Use more powerful parts – processors, disk drives, etc.

2. Just use more parts of the same power.

Of the two, the more-parts strategy much more cost-effective. Smaller* parts are much more economical, since the bigger the part, the harder and more costly it is to avoid defects, in manufacturing and initial design alike. Consequently, all high-end computers rely on some kind of parallel processing.

*As measured in terms of capacity, transistor count, etc., not physical size. Read more

Categories: Data warehouse appliances, Data warehousing, DATAllegro, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, Teradata, Theory and architecture, Vertica Systems

7 Comments

February 27, 2007

Opportunities for disruption in the OLTP database management market (deck-clearing post #2)

The standard Clayton Christensen “Innovator’s Dilemma” disruption narrative goes something like this:

Market leaders have many advantages, including top technology.
Followers come up with good technology too.
The leaders stay ahead by making their products ever better and more complex.
The followers sell into new or non-mainstream markets, at prices the leaders can’t match. So they dominate new markets.
Old markets turn into low-margin commodity-fests.
Old leaders are screwed.

And it’s really hard for market leaders to avert this sad fate, because the short- and intermediate-term margin hit would be too great.

I think the OLTP DBMS market is ripe for that kind of disruption – riper than commentators generally realize. Here are some key potential drivers:
Read more

Categories: ANTs Software, Data warehousing, EnterpriseDB and Postgres Plus, IBM and DB2, Intersystems and Cache', Microsoft and SQL*Server, Mid-range, MySQL, OLTP, Open source, Oracle, Progress, Apama, and DataDirect, Theory and architecture

7 Comments

February 23, 2007

Really big databases

Business Intelligence Lowdown has a well-dugg post listing what it claims are the 10 largest databases in the world. The accuracy leaves much to be desired, as is illustrated by the fact that #10 on the list is only 20 terabytes, while entirely unmentioned is eBay’s 2-petabyte database (mentioned here, and also here). Read more

Categories: Data warehouse appliances, Data warehousing, DATAllegro, Greenplum, IBM and DB2, Netezza, Oracle, SAS Institute, Teradata, Theory and architecture

4 Comments

February 21, 2007

If you can’t trust the storage vendors …

… isn’t that another reason to go with massively parallel systems?

StorageMojo has a great post on storage myth and reality.

Want to continue getting great research about DBMS, analytics, and other technologies related to data management? Then subscribe to our feed, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.

Categories: Data warehouse appliances, Data warehousing, Theory and architecture

Leave a Comment

February 9, 2007

Do modern databases have too many tables?

Mike Robinson thinks modern databases have too many tables. However, I’m not sure about his argument. He argues that more tables = more code, but is that really true? Or are they just a good framework from which to modularize code? Some of his specifics might be perhaps addressed by updatable views. And other of his complaints were about performance hacks (caches, history tables), that have little to do with database normalization.

Frankly, the kind of application he describes is one I think should be bought from a third-party vendor, who probably should indeed use lots of tables. I agree that relational fundamentalism is way overblown, but perhaps for different reasons than Mike does.

Categories: Theory and architecture

Leave a Comment

January 31, 2007