Data warehousing
Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Deal prospects for data warehouse DBMS vendors
The fourth Monash Letter is now posted for Monash Advantage members (just 3 pages this time). It’s about forthcoming M&A in data warehouse DBMS, something that seems likely just because of the large number of current players. Some of the observations are:
- Oracle needs to buy somebody, because of its rather dire product problems at the data warehouse high end. And it’s very much in keeping with their recent behavior to do so.
- Teradata could be acquired sooner than people think. While there are tax considerations preventing an outright sale, these should be obviated if all of the current NCR is taken private. What’s more NCR minus Teradata is exactly the kind of healthy, slow-growth, niche company that private equity loves.
- DATAllegro is a natural merger partner for somebody. Their technical differentiation is almost DBMS-independent, so it could be easy to roll them into a larger overall product strategy. And they have enough market traction to have proved some non-trivial value.
- Kognitio seems desperate these days, with several odd or even underhanded marketing tactics. But they do have MPP bitmap software, something Sybase sorely lacks. So there’s an obvious potential combination between those two.
Categories: Data warehouse appliances, Data warehousing, DATAllegro, Kognitio, Oracle, Sybase, Teradata | 3 Comments |
What’s going on at Calpont?
It’s been quite a while since anything substantive-sounding emerged from Calpont. They now have an odd one-page web site, with essentially no substance other than a tagline suggesting they’re shipping product (not bloody likely) and the names, titles, and email addresses of the president and seven vice-presidents. Only two of those officers were listed on the May, 2006 version of the site. Does anybody have an idea what may or may not be going on?
(Quick refresher: Calpont was developing a SQL processing chip, and designing an appliance around it. Whether this appliance would have disks or be all in-memory changed from time to time, a flexibility that was made possible by the apparent fact that none of these boxes actually shipped.)
Categories: Calpont, Data warehouse appliances, Data warehousing | 2 Comments |
HP Neoview — smoke or fire?
The consistently outstanding blog Serious About Consulting has a detailed article about HP Neoview. I must admit, however, to some skepticism about the Neoview project. Edit: As of September, 2008, that’s a dead link, and the blog has been replaced by spam junk. Part of this is just the fact that a data warehouse appliance outfit that’s never gotten around to briefing me — ever — clearly doesn’t have its marketing act together. 😉 Also, I’ve never heard much about them competitively from anybody except Greenplum.
That said — as Jerry Held reminded me in a recent Vertica-related call, there’s no cosmic architectural reason why they couldn’t make it work. And if anybody’s going to see HP first competitively, it’s going to be Sun/Greenplum and maybe Teradata, and I’ll confess to not having chatted with Teradata for approximately six months.
Where the next query performance crunch may come from
For close to a decade, I’ve been pointing out that true enterprise business intelligence will require a lot of custom KPIs. Basically, each decision-maker needs her own private dashboard and set of alerts, with a bunch of custom metrics that she can tweak to support the way her own personal brain best operates.
To date the BI vendors still haven’t gotten the message … but suppose they did. Depending on the frequency of refresh, the result could be one hell of an analytic processing load.
Want to continue getting great research about DBMS, analytics, and other technologies related to data management? Then subscribe to our feed, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.
Technorati Tags: Business activity monitoring, BAM, key performance indicators, KPIs, dashboards, business intelligence, data warehouses
Categories: Data warehousing | Leave a Comment |
White paper — Index-Light MPP Data Warehousing
Many of my thoughts on data warehouse DBMS and appliances have been collected in a white paper, sponsored by DATAllegro. As in a couple of other white papers — collected here — I coined a phrase to describe the core concept: Index-light. MPP row-oriented data warehouse DBMSs certainly have indices, which are occasionally even used. But the approaches to database design that are supported or make sense to use are simply different for DATAllegro, Netezza (the most extreme example of all) or Teradata than for Oracle or Microsoft. And the differences are all in the direction of less indexing.
Here’s an excerpt from the paper. Please pardon the formatting; it reads better in the actual .PDF Read more
Categories: Data warehouse appliances, Data warehousing, DATAllegro, EMC, Theory and architecture | 4 Comments |
Will database compression change the hardware game?
I’ve recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon as processor power increases; 10X or more is not unrealistic. True, this applies mainly to data warehouses, but that’s where the big database growth is happening. And new kinds of data — geospatial, telemetry, document, video, whatever — are highly compressible as well.
This trend suggests a few interesting possibilities for hardware, semiconductors, and storage.
- The growth in demand for storage might actually slow. That said, I frankly think it’s more likely that Parkinson’s Law of Data will continue to hold: Data expands to fill the space available. E.g., video and other media have near-infinite potential to consume storage; it’s just a question of resolution and fidelity.
- Solid-state (aka semiconductor or flash) persistent storage might become practical sooner than we think. If you really can fit a terabyte of data onto 100 gigs of flash, that’s a pretty affordable alternative. And by the way — if that happens, a lot of what I’ve been saying about random vs. sequential reads might be irrelevant.
- Similarly, memory-centric data management is more affordable when compression is aggressive. That’s a key point of schemes such as SAP’s or QlikTech’s. Who needs flash? Just put it in RAM, persisting it to disk just for backup.
- There’s a use for faster processors. Compression isn’t free. What you save on disk space and I/O you pay for at the CPU level. Those 5X+ compression levels do depend on faster processors, at least for the row store vendors.
Categories: Data warehousing, Database compression, Memory-centric data management, QlikTech and QlikView, SAP AG | 6 Comments |
Mike Stonebraker on database compression — comments
In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):
The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.
It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.
In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.
Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.
Categories: Columnar database management, Data warehousing, Database compression, Sybase, Vertica Systems | 8 Comments |
Mike Stonebraker explains column-store data compression
The following is by Mike Stonebraker, CTO of Vertica Systems, copyright 2007, as part of our ongoing discussion of data compression. My comments are in a separate post.
Row Store Compression versus Column Store Compression
I Introduction
There are three aspects of space requirements, which we discuss in this short note, namely:
structural space requirements
index space requirements
attribute space requirements.
Categories: Data warehousing, Database compression, Michael Stonebraker, Theory and architecture, Vertica Systems | 7 Comments |
Compression in columnar data stores
We have lively discussions going on columnar data stores vs. vertically partitioned row stores. Part is visible in the comment thread to a recent post. Other parts come in private comments from Stuart Frost of DATAllegro and Mike Stonebraker of Vertica et al.
To me, the most interesting part of what the Vertica guys are saying is twofold. One is that data compression just works better in column stores than row stores, perhaps by a factor of 3, because “the next thing in storage is the same data type, rather than a different one.” Frankly, although Mike has said this a couple of times, I haven’t understood yet why row stores can’t be smart enough to compress just as well. Yes, it’s a little harder than it would be in a columnar system; but I don’t see why the challenge would be insuperable.
The second part is even cooler, namely the claim that column stores allow the processors to operate directly on compressed data. But once again, I don’t see why row stores can’t do that too. For example, when you join via bitmapped indices, exactly what you’re doing is operating on highly-compressed data.
Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, DATAllegro, Vertica Systems | 2 Comments |
DATAllegro vs. Vertica and other columnar systems
Stuart Frost of DATAllegro offered an interesting counter today to columnar DBMS architectures — vertical partitioning. In particular, he told me of a 120 terabyte (growing soon to 250 terabytes) call data record database, in which a few key columns were separated out. Read more