March 26, 2007

White paper — Index-Light MPP Data Warehousing

Many of my thoughts on data warehouse DBMS and appliances have been collected in a white paper, sponsored by DATAllegro. As in a couple of other white papers — collected here — I coined a phrase to describe the core concept: Index-light. MPP row-oriented data warehouse DBMSs certainly have indices, which are occasionally even used. But the approaches to database design that are supported or make sense to use are simply different for DATAllegro, Netezza (the most extreme example of all) or Teradata than for Oracle or Microsoft. And the differences are all in the direction of less indexing.

Here’s an excerpt from the paper. Please pardon the formatting; it reads better in the actual .PDF Read more

Categories: Data warehouse appliances, Data warehousing, DATAllegro, EMC, Theory and architecture

4 Comments

March 25, 2007

Oracle, Tangosol, objects, caching, and disruption

Oracle made a slick move in picking up Tangosol, a leader in object/data caching for all sorts of major OLTP apps. They do financial trading, telecom operations, big web sites (Fedex, Geico), and other good stuff. This is a reminder that the list of important memory-centric data handling technologies is getting fairly long, including:

Object caching (e.g., Tangosol, Progress ObjectStore)
In-memory RDBMS (e.g., Oracle TimesTen, Solid BoostEngine, McObject eXtremeDB)
Stream processing (e.g., Progress Apama, Streambase)

And that’s just for OLTP; there’s a whole other set of memory-centric technologies for analytics as well.

When one connects the dots, I think three major points jump out:

There’s a lot more to high-end OLTP than relational database management.
Oracle is determined to be the leader in as many of those areas as possible.
This all fits the market disruption narrative.

I write about Point #1 all the time. So this time around let me expand a little more on #2 and #3.
Read more

Categories: Cache, Database diversity, Memory-centric data management, MySQL, Netezza, OLTP, Oracle, Oracle TimesTen, Progress, Apama, and DataDirect, solidDB, StreamBase, Streaming and complex event processing (CEP), Theory and architecture

4 Comments

March 24, 2007

Will database compression change the hardware game?

I’ve recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon as processor power increases; 10X or more is not unrealistic. True, this applies mainly to data warehouses, but that’s where the big database growth is happening. And new kinds of data — geospatial, telemetry, document, video, whatever — are highly compressible as well.

This trend suggests a few interesting possibilities for hardware, semiconductors, and storage.

The growth in demand for storage might actually slow. That said, I frankly think it’s more likely that Parkinson’s Law of Data will continue to hold: Data expands to fill the space available. E.g., video and other media have near-infinite potential to consume storage; it’s just a question of resolution and fidelity.
Solid-state (aka semiconductor or flash) persistent storage might become practical sooner than we think. If you really can fit a terabyte of data onto 100 gigs of flash, that’s a pretty affordable alternative. And by the way — if that happens, a lot of what I’ve been saying about random vs. sequential reads might be irrelevant.
Similarly, memory-centric data management is more affordable when compression is aggressive. That’s a key point of schemes such as SAP’s or QlikTech’s. Who needs flash? Just put it in RAM, persisting it to disk just for backup.
There’s a use for faster processors. Compression isn’t free. What you save on disk space and I/O you pay for at the CPU level. Those 5X+ compression levels do depend on faster processors, at least for the row store vendors.

Categories: Data warehousing, Database compression, Memory-centric data management, QlikTech and QlikView, SAP AG

6 Comments

March 24, 2007

Mike Stonebraker on database compression — comments

In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):

The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.

It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.

In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.

Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.

Categories: Columnar database management, Data warehousing, Database compression, Sybase, Vertica Systems

8 Comments

March 24, 2007

Mike Stonebraker explains column-store data compression

Row Store Compression versus Column Store Compression

I Introduction

There are three aspects of space requirements, which we discuss in this short note, namely:

structural space requirements

index space requirements

attribute space requirements.

Categories: Data warehousing, Database compression, Michael Stonebraker, Theory and architecture, Vertica Systems

7 Comments

March 21, 2007

Compression in columnar data stores

We have lively discussions going on columnar data stores vs. vertically partitioned row stores. Part is visible in the comment thread to a recent post. Other parts come in private comments from Stuart Frost of DATAllegro and Mike Stonebraker of Vertica et al.

To me, the most interesting part of what the Vertica guys are saying is twofold. One is that data compression just works better in column stores than row stores, perhaps by a factor of 3, because “the next thing in storage is the same data type, rather than a different one.” Frankly, although Mike has said this a couple of times, I haven’t understood yet why row stores can’t be smart enough to compress just as well. Yes, it’s a little harder than it would be in a columnar system; but I don’t see why the challenge would be insuperable.

The second part is even cooler, namely the claim that column stores allow the processors to operate directly on compressed data. But once again, I don’t see why row stores can’t do that too. For example, when you join via bitmapped indices, exactly what you’re doing is operating on highly-compressed data.

Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, DATAllegro, Vertica Systems

2 Comments

March 19, 2007

DATAllegro vs. Vertica and other columnar systems

Stuart Frost of DATAllegro offered an interesting counter today to columnar DBMS architectures — vertical partitioning. In particular, he told me of a 120 terabyte (growing soon to 250 terabytes) call data record database, in which a few key columns were separated out. Read more

Categories: Columnar database management, Data warehouse appliances, Data warehousing, DATAllegro, Kognitio, Vertica Systems

13 Comments

March 17, 2007

The boom in Salesforce.com integration

SaaS integration is in the air.

I recently talked with Pervasive Software about their data integration line. A large part of Pervasive’s new business is Salesforce.com integration, including at some big-name software vendors as customer/partner switch-hitters.
I just rechecked my notes from my January talk with Cast Iron Systems. A large part of Cast Iron’s new business is also integration with Salesforce.com, Netsuite, and other SaaS vendors.
Informatica keeps putting out press releases about Salesforce.com integration, most recently by offering replication in SaaS form itself.

But of course this makes sense. Without good data integration, SaaS applications would be pretty useless, at least at large and medium-sized enterprises.

Categories: Cast Iron Systems, EAI, EII, ETL, ELT, ETLT, Informatica, Pervasive Software, Software as a Service (SaaS)

Netezza under fire

I talk to a lot of data warehouse software and/or appliance start-ups. Naturally, they’re all gunning for Netezza, and regale me with stories about competitive replacements, competitive wins, benchmark wins, and the like. And there have been a couple of personnel departures too, notably development chief Bill Blake. Netezza insists this is because he got a CEO offer he couldn’t refuse, he’s still friendly with the company, development plans are entirely on track, and news of some sort is coming out in a few weeks. Also, Greenplum brags that its Asia/Pacific manager was snagged from Netezza.

On the other hand, Netezza claims lots of sales momentum, and that’s certainly consistent with what I hear from its competitors. Read more

Categories: Business Objects, Data warehouse appliances, Data warehousing, Greenplum, Netezza

Word of the day: “Compression”

IBM sent over a bunch of success stories recently, with DB2’s new aggressive compression prominently mentioned. Mike Stonebraker made a big point of Vertica’s compression when last we talked; other column-oriented data warehouse/mart software vendors (e.g. ~~Kognitio,~~ SAP, Sybase) get strong compression benefits as well. Other data warehouse/mart specialists are doing a lot with compression too, although some of that is governed by please-don’t-say-anything-good-about-us NDA agreements.

Compression is important for at least three reasons:

It saves disk space, which is a major cost issue in data warehousing.
It saves I/O, which is the major performance issue in data warehousing.
In well-designed systems, it can actually make on-chip execution faster, because the gains in memory speed and movement can exceed the cost of actually packing/unpacking the data. (Or so I’m told; I haven’t aggressively investigated that claim.)

When evaluating data warehouse/mart software, take a look at the vendor’s compression story. It’s important stuff.

EDIT: DATAllegro claims in a note to me that they get 3-4x storage savings via compression. They also make the observation that fewer disks ==> fewer disk failures, and spin that — as it were 🙂 — into a claim of greater reliability.

Categories: Data warehouse appliances, Data warehousing, Database compression, DATAllegro, IBM and DB2, SAP AG, Vertica Systems

3 Comments

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

White paper — Index-Light MPP Data Warehousing

Oracle, Tangosol, objects, caching, and disruption

Will database compression change the hardware game?

Mike Stonebraker on database compression — comments

Mike Stonebraker explains column-store data compression

Compression in columnar data stores

DATAllegro vs. Vertica and other columnar systems

The boom in Salesforce.com integration

Netezza under fire

Word of the day: “Compression”

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin