Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
- Any subcategory
- Database diversity
- Explicit support for specific data types
- (in Text Technologies) Text search
Oracle Database Machine performance and compression
Greg Rahn was kind enough to recite in his blog what Oracle has disclosed about the first Exadata testers. I don’t track hardware model details, so I don’t know how the testers’ respective current hardware environments compare to that of the Oracle Database Machine.
Each of the customers cited below received “half” an Oracle Database Machine. As I previously noted, an Oracle Database Machine holds either 14.0 or 46.2 terabytes of uncompressed data. This suggests the 220 TB customer listed below — LGR Telecommunications — got compression of a little under 10:1 for a CDR (Call Detail Record) database. By comparison, Vertica claims 8:1 compression on CDRs.
Greg also writes of POS (Point Of Sale) data being used for the demo. If you do the arithmetic on the throughput figures (13.5 vs. a little over 3), compression was a little under 4.5:1. I don’t know what other vendors claim for POS compression.
Here are the details Greg posted about the four most open Oracle Database Machine tests: Read more
Categories: Data warehouse appliances, Data warehousing, Database compression, Exadata, Oracle, Telecommunications | 9 Comments |
Vertica finally spells out its compression claims
Omer Trajman of Vertica put up a must-read blog post spelling out detailed compression numbers, based on actual field experience (which I’d guess is from a combination of production systems and POCs):
- CDR – 8:1 (87%)
- Consumer Data – 30:1 (96%)
- Marketing Analytics – 20:1 (95%)
- Network logging – 60:1 (98%)
- Switch Level SNMP – 20:1 (95%)
- Trade and Quote Exchange – 5:1 (80%)
- Trade Execution Auditing Trails – 10:1 (90%)
- Weblog and Click-stream – 10:1 (90%)
It’s clear what Omer means by most of those categories from reading the post, but I’m a little fuzzy on what “Consumer Data” or “Marketing Analytics” comprise in his taxonomy. Anyhow, Omer’s post is a huge improvement over my recent one — based on a conversation with Omer 🙂 — which featured some far less accurate or complete compression numbers.
Omer goes on to claim that trickle-feed data is harder for rival systems to compress than it is for Vertica, and generally to claim that Vertica’s compression is typically severalfold better than that of competitive row-based systems.
Categories: Database compression, Vertica Systems, Web analytics | 5 Comments |
Database compression is heavily affected by the kind of data
I’ve written often of how different kinds or brands of data warehouse DBMS get very different compression figures. But I haven’t focused enough on how much compression figures can vary among different kinds of data. This was really brought home to me when Vertica told me that web analytics/clickstream data can often be compressed 60X in Vertica, while at the other extreme — some kind of floating point data, whose details I forget for now — they could only do 2.5X. Edit: Vertica has now posted much more accurate versions of those numbers. Infobright’s 30X compression reference at TradeDoubler seems to be for a clickstream-type app. Greenplum’s customer getting 7.5X — high for a row-based system — is managing clickstream data and related stuff. Bottom line:
When evaluating compression ratios — especially large ones — it is wise to inquire about the nature of the data.
Categories: Data warehousing, Database compression, Greenplum, Infobright, Vertica Systems, Web analytics | 4 Comments |
Infobright’s open source move has a lot of potential
Infobright announced today that it’s going full-bore into open source – specifically in the MySQL ecosystem — with the licensing approach, pricing, distribution strategy, and VC money from Sun that such a move naturally entails. I think this is a great idea, for a number of reasons: Read more
Categories: Columnar database management, Data warehousing, Infobright, MySQL, Open source | 4 Comments |
Dividing the data warehousing work among MPP nodes
I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.
Categories: Aster Data, Calpont, Exasol, Greenplum, Parallelization, Theory and architecture, Vertica Systems | 22 Comments |
Head to head blog debate between EMC, NetApp, and HP
Chuck Hollis of EMC started a fierce debate with a blog post on how to measure effective storage capacity. Competitors from NetApp and HP responded in often sarcastic detail in the comment thread, Hollis shot back, and the volleying continued for quite a while.
I’m not a storage maven, and I don’t understand all the details of that stuff. If you’re like me in that regard, you may find the post worth skimming just to see what some of the choices, trade-offs, and complications are in designing and measuring storage systems. Stephen Foskett’s related post is also worth a look in that regard.
My recent foray into measuring disk storage pales by comparison.
Categories: Storage, Theory and architecture | 3 Comments |
Estimating user data vs. spinning disk
There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:
- Indexes and temporary working space. That’s what I emphasized a couple of years ago in my post about Expansion Ratios.
- Compression. I write about database compression a lot.
- Disk redundancy. I usually gloss over that one, but I’ll try to make amends in this post.
- Replication other than that which is primarily designed for redundancy. I usually gloss over that one too, and I think it’s safe to continue doing so. That’s because data warehouse replication – at least in most of the system architectures I know of – generally divides into three categories:
- a lot like redundancy
- a lot like an index
- only a minor issue (e.g., when small fact tables are replicated across each node of an MPP cluster)
Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.
Greenplum’s single biggest customer
Greenplum offered a bit of clarification regarding the usage figures I posted last night. Everything on the list is in production, except that:
- One Greenplum customer is at 400 terabytes now, and upgrading to >1 petabyte “as we speak.”
- Greenplum’s other soon-to-be >1 petabyte customer isn’t in production yet. (Greenplum previously told me that customer was in the process of loading data right now.)
Categories: Data warehousing, Fox and MySpace, Greenplum, Petabyte-scale data management, Specific users | 3 Comments |
Greenplum is in the big leagues
After a March, 2007 call, I didn’t talk again with Greenplum until earlier this month. That changed fast. I flew out to see Greenplum last week and spent over a day with president/co-founder Scott Yara, CTO/co-founder Luke Lonergan, marketing VP Paul Salazar, and product management/marketing director Ben Werther. Highlights – besides some really great sushi at Sakae in Burlingame – start with an eye-opening set of customer proof points, such as: Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Greenplum, Petabyte-scale data management, PostgreSQL | 19 Comments |
The Explosion in DBMS Choice
If there’s one central theme to DBMS2, it’s that modern DBMS alternatives should in many cases be used instead of the traditional market leaders. So it was only a matter of time before somebody sponsored a white paper on that subject. The paper, sponsored by EnterpriseDB, is now posted along with my other recent white papers. Its conclusion — summarizing what kinds of database management system you should use in which circumstances — is reproduced below.
Many new applications are built on existing databases, adding new features to already-operating systems. But others are built in connection with truly new databases. And in the latter cases, it’s rare that a market-leading product is the best choice. Mid-range DBMS (for OLTP) or specialty data warehousing systems (for analytics) are usually just as capable, and much more cost-effective. Exceptions arise mainly in three kinds of cases:
- Small enterprises with very limited staff.
- Large enterprises that have negotiated heavily-discounted deals for a market-leading product.
- Super-high-end OLTP apps that need absolute top throughput (or security certifications, etc.)
Otherwise, the less costly products are typically the wiser choice. Read more
Categories: Database diversity | 7 Comments |