Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

June 15, 2007

RDF “definitely has legs”

Thus spake Mike Stonebraker to me, on a call we’d scheduled to talk about several other things altogether. This was one day after I was told at the Text Analytics Summit that the US government is going nuts for RDF. And I continue to get confirmation of something I first noted last year — Oracle is pushing RDF heavily, especially in the life sciences market.

Evidently, the RDF data model is for real … unless, of course, you’re the kind of purist who cares to dispute whether RDF is a true “data model” at all.

June 14, 2007

Bracing for Vertica

The word from Vertica is that the product will go GA in the fall, and that they’ll have blow-out benchmarks to exhibit.

I find this very credible. Indeed, the above may even be something of an understatement.

Vertica’s product surely has some drawbacks, which will become more apparent when the product is more available for examination. So I don’t expect row-based appliance innovators Netezza and DATAllegro to just dry up and blow away. On the other hand, not every data warehousing product is going to live long and prosper, and I’d rate Vertica’s chances higher than those of several competitors that are actually already in GA.

June 8, 2007

Transparent scalability

I’ve been a DBMS industry analyst, in one guise or another, since 1981. So by now I’ve witnessed a whole lot of claims and debates about scalability. And there’s one observation I’d like to call out.

What matters most isn’t what kind of capacity or throughput you can get with heroic efforts. Rather, what matters most is the capacity and throughput you get without any kind of special programming or database administraton.

Of course, when taken to extremes, that point could become silly. DBMS are used by professionals, and requiring a bit of care and tuning is par for the course. But if you have a choice between two systems that can get the job done for you, one of which requires you to perform unnatural acts and one doesn’t – go for the one that works straightforwardly. Your overall costs will wind up being much lower, and you’ll probably get a lot more useful work done. A system that has to strain even to meet known requirements will probably fail altogether at meeting the as-yet-unknown ones that are sure to arise down the road.

Want to continue getting great research about DBMS, analytics, data integration, and other technologies related to data management? Get a FREE subscription by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.

May 29, 2007

The petabyte machine

EMC has announced a machine — a virtual tape library — that supposedly stores 1.8 petabytes of data. Even though that’s only 584 terabytes uncompressed, it shows that the 1 petabyte barrier will be broken soon no matter how unhyped the measurement.

I just recently encountered some old notes in which Sybase proudly announced a “1 gigabyte challenge.” The idea was that 1 gig was a breakthrough size for business databases.

Time flies.

April 6, 2007

Lessons from EnterpriseDB

I had a nice conversation yesterday with Jim Mlodgenski of EnterpriseDB, covering both generalities and EnterpriseDB-specific stuff. Many of the generalities were predictable, and none were terribly shocking. Even so, I am dressed as Captain Obvious, and shall repeat a few of the ones I found interesting below:

Read more

March 26, 2007

White paper — Index-Light MPP Data Warehousing

Many of my thoughts on data warehouse DBMS and appliances have been collected in a white paper, sponsored by DATAllegro. As in a couple of other white papers — collected here — I coined a phrase to describe the core concept: Index-light. MPP row-oriented data warehouse DBMSs certainly have indices, which are occasionally even used. But the approaches to database design that are supported or make sense to use are simply different for DATAllegro, Netezza (the most extreme example of all) or Teradata than for Oracle or Microsoft. And the differences are all in the direction of less indexing.

Here’s an excerpt from the paper. Please pardon the formatting; it reads better in the actual .PDF Read more

March 25, 2007

Oracle, Tangosol, objects, caching, and disruption

Oracle made a slick move in picking up Tangosol, a leader in object/data caching for all sorts of major OLTP apps. They do financial trading, telecom operations, big web sites (Fedex, Geico), and other good stuff. This is a reminder that the list of important memory-centric data handling technologies is getting fairly long, including:

And that’s just for OLTP; there’s a whole other set of memory-centric technologies for analytics as well.

When one connects the dots, I think three major points jump out:

  1. There’s a lot more to high-end OLTP than relational database management.
  2. Oracle is determined to be the leader in as many of those areas as possible.
  3. This all fits the market disruption narrative.

I write about Point #1 all the time. So this time around let me expand a little more on #2 and #3.
Read more

March 24, 2007

Will database compression change the hardware game?

I’ve recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon as processor power increases; 10X or more is not unrealistic. True, this applies mainly to data warehouses, but that’s where the big database growth is happening. And new kinds of data — geospatial, telemetry, document, video, whatever — are highly compressible as well.

This trend suggests a few interesting possibilities for hardware, semiconductors, and storage.

  1. The growth in demand for storage might actually slow. That said, I frankly think it’s more likely that Parkinson’s Law of Data will continue to hold: Data expands to fill the space available. E.g., video and other media have near-infinite potential to consume storage; it’s just a question of resolution and fidelity.
  2. Solid-state (aka semiconductor or flash) persistent storage might become practical sooner than we think. If you really can fit a terabyte of data onto 100 gigs of flash, that’s a pretty affordable alternative. And by the way — if that happens, a lot of what I’ve been saying about random vs. sequential reads might be irrelevant.
  3. Similarly, memory-centric data management is more affordable when compression is aggressive. That’s a key point of schemes such as SAP’s or QlikTech’s. Who needs flash? Just put it in RAM, persisting it to disk just for backup.
  4. There’s a use for faster processors. Compression isn’t free. What you save on disk space and I/O you pay for at the CPU level. Those 5X+ compression levels do depend on faster processors, at least for the row store vendors.
March 24, 2007

Mike Stonebraker on database compression — comments

In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):

The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.

It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.

In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.

Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.

Read more

March 24, 2007

Mike Stonebraker explains column-store data compression

The following is by Mike Stonebraker, CTO of Vertica Systems, copyright 2007, as part of our ongoing discussion of data compression. My comments are in a separate post.

Row Store Compression versus Column Store Compression

I Introduction

There are three aspects of space requirements, which we discuss in this short note, namely:

structural space requirements

index space requirements

attribute space requirements.

Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.