Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
- Any subcategory
- Database diversity
- Explicit support for specific data types
- (in Text Technologies) Text search
Are row-oriented RDBMS obsolete?
If Mike Stonebraker is to be believed, the era of columnar data stores is upon us.
Whether or not you buy completely into Mike’s claims, there certainly are cool ideas in his latest columnar offering, from startup Vertica Systems. The Vertica corporate site offers little detail, but Mike tells me that the product’s architecture closely resembles that of C-Store, which is described in this November, 2005 paper.
The core ideas behind Vertica’s product are as follows. Read more
Mike Stonebraker Blasts “One Size Fits All”
When it comes to DBMS inventors, Mike Stonebraker is the next closest thing to Codd. And he’s become a huge non-believer in the idea that one DBMS architecture meets all needs.
Frankly, there isn’t much in that paper that hasn’t already been said in this blog, except for the part that is specifically relevant to one of his startups, StreamBase. Still, it’s nice to have the high-powered agreement.
More recently, the argument in that paper has been extended with a benchmark-filled follow-up based on another Stonebraker startup, Vertica.
Categories: Columnar database management, Database compression, StreamBase, Theory and architecture, Vertica Systems | Leave a Comment |
Federation in the MySQL empire
Marten Micklos, CEO of MySQL, gave a recent speech speculating about a big federated “database in the sky,” providing all sorts of Web 2.0 benefits. Apparently, the idea isn’t at all fleshed out yet. Even so, I have a nagging suspicion he’s pointing in somewhat the wrong direction.
That’s because I think federating relational databases is a generically bad idea. You can federate sets of services, and you can generate services from relational databases – and that’s where DBMS2 (DataBase Management System Services) got its name. This is a superior approach to direct database federation, for two main reasons. (By “direct federation,” I mean some sort of structure in which there’s a giant virtual database whose schema more or less directly incorporates much of the schema of each individual database.)
Categories: MySQL, Open source, Theory and architecture | 8 Comments |
Relational data warehouse Expansion (or Explosion) Ratios
One of the least understood aspects of data warehouse technology is what may be called the
Expansion Ratio = (Total disk space used, except for mirroring) / (Size of the base database).
This is similar to the explosion ratio discussed in the OLAP Report’s justly famous discussion of database explosion, but I’m going with my own terminology because I don’t want to be tied to their precise terminology, nor to their technical focus. Expansion Ratios are hotly debated, with some figures being:
- Teradata claims an Expansion Ratio of 8-9X for Oracle, 6X for DB2 (open system version), and 2.5X for Teradata. The underlying source is data warehouses they’ve replaced, so there may be a bias toward out-of-control warehouses on the part of their competitors.
- An anonymous appliance vendor exec said to me off the top of his head that Oracle has 6-8X Expansion Ratios.
- Oracle’s TPC-H submissions in the largest size range (10 terabytes) have 9.7-10.5X Expansion Ratios, if I’m reading the TPCs correctly.
- Oracle cites a survey of 8 customers with 10-60 Tb database size in which the Expansion Ratio works out to 1.6X. (More on this anomalous result below.)
I don’t have actual figures from Netezza and DATallegro, but I imagine they’d come out lower than 2X, possibly well below.
Categories: Data warehouse appliances, Data warehousing, Database compression, DATAllegro, IBM and DB2, Netezza, Oracle, Teradata | 9 Comments |
More on data warehouse architecture choices
The very name of this blog comes from the kind of “horses for courses” data store strategy implied by my recent post on different kinds of data warehouse uses. A number of other commentators have recently made similar points, although they may not agree with every detail. For example, William McKnight pretty much makes the pure DBMS2 argument, pointing out that a partially virtual warehouse is often superior to a fully centralized physical one. And Andy Hayler of Kalido says pretty much the same thing, although he strongly calls out his difference in emphasis from William’s view.
A tip of the hat to Mark Rittman for pointing me to those two and others.
Categories: Data warehouse appliances, EAI, EII, ETL, ELT, ETLT, Theory and architecture | Leave a Comment |
SAP’s BI Accelerator
I wrote about SAP’s BI Accelerator quite a bit in my white paper on memory-centric data management, but otherwise I seem not to have posted much about it here. In essence, it’s a product that’s all RAM-based, and generally geared for multi-hundred-gigabyte data marts. The basic design is a compression-heavy column-based architecture, evolved from SAP’s text-indexing technology TREX. Like data warehouse appliances, it eschews indexing, relying instead on blazingly fast table scans.
I asked Lothar Schubert of SAP how BIA was doing in the market in its early going. This was his response:
Categories: Analytic technologies, Business intelligence, Data warehouse appliances, Data warehousing, Database compression, Memory-centric data management, SAP AG | 8 Comments |
I say “sequential”, you say …
I talked with Teradata today, and they called me on my use of the term “sequential.” Basically, if there’s any head movement for disk seeks, some computer science researchers wouldn’t call it “sequential.” I didn’t know that; I was just familiar with the less precise usage of the term in some vendors’ marketing and discussions.* OK, I’ll make up a new, more precise term instead. How about “coarse-grained”?
*And so we have another instance of Monash’s First Law of Commercial Semantics: Bad jargon drives out good.
Categories: Teradata, Theory and architecture | 8 Comments |
No locks, no logs — no problem?
There’s another cool-sounding part to the Netezza story, which straddles their chips and their software: The FPGA takes over the work of assuring database consistency. If the system attempts to read and write a record at the same time, the FPGA keeps thing straight. This eliminates the need for locks — at least if you don’t care about transactional integrity — and some of the reason for logs. (I guess that in lieu of any kind of rollback/rollforward they just rely on failover to mirrored disks.)
This isn’t exactly the way one would want to do OLTP, and in general my head is shaking as I write this — but it sure seems to suffice for some rather demanding data warehouse users.
Categories: Data warehouse appliances, Netezza, Theory and architecture | 2 Comments |
Is data warehousing now all about sequential access?
A lot of evidence is pointing to a major paradigm shift in data warehouse RDBMS, along the lines of:
Old way: Assume I/O is random; lower total execution time by improving selectivity and thus lowering the amount of I/O.
New way: Drive the amount of random I/O to near zero, and do as much sequential I/O as necessary to achieve this goal.
Examples include:
- Data warehouse appliances (see especially this discussion of DATallegro’s architecture)
- Columnar systems (see Nathan Myer’s first comment in this discussion of the much-hyped Required Technologies prototype)
- Memory-centric systems, notably SAP’s BI Accelerator
Categories: Data warehouse appliances, DATAllegro, Memory-centric data management, SAP AG, Theory and architecture, TransRelational | 4 Comments |
Business Objects on EIM, ETL, etc.
I chatted with some Business Objects ETL/EIM (Enterprise Information Management) folks today, in a call that was a direct response to what I heard from and posted about Informatica. The core of the Business Objects story can be summarized (albeit brutally!) like this: