February 15, 2008

Database management system choices – relational data warehouse

This is the third of a five-part series on database management system choices. For the first post in the series, please click here.

High-end OLTP relational database management system vendors try to offer one-stop shopping for almost all data management needs. But as I noted in my prior post, their product category is facing two major competitive threats. One comes from specialty data warehouse database management system products. I’ve covered those extensively in this blog, with key takeaways including:

Let me expand on that last point. Different features may or may not be important to you, depending on whether your precise application needs include:

One more complication: There also are a number of products that accelerate data analysis outside the DBMS, usually in a memory-centric way. Ones I’ve written about in the past include QlikView, SAP BI Accelerator, Applix, and the whole category of complex event/stream processing.

The complete series

Comments

19 Responses to “Database management system choices – relational data warehouse”

  1. Daniel Weinreb on February 16th, 2008 10:08 am

    Do you have much idea how much benefit Netezza gets from its hardware assist? Do they explain it technically anywhere that the general public can access?

    I wonder about this because over the years, I often hear about products with specialized hardware assists, and in the long run it turns out that they aren’t really valuable enough to justify their cost and the fact that you can’t use stock hardware. (Please omit comments about the irony of a Symbolics co-founder saying something like this; I am referring to more recent history than that, anyway.) For example, one of my friends co-founded a company to make network file servers that would use hardware to speed up the “inner loops” (performance critical parts). After a lot of careful design work, they finally concluded that there wasn’t actually any opportunity for specialized hardware that made engineering and business sense (and they had to lay off a lot of hardware guys, sadly, but the company eventually did well). So I just have a bit of a tendency to be skeptical about specialized hardware.

    On the other hand, where it does work, it’s often quite cool…

  2. Curt Monash on February 16th, 2008 3:23 pm

    Dan,

    So far, they make a compelling case.

    The really short version is that they offload a lot of work from the CPU to a cheaper and cooler FPGA. But beyond that, the FPGA handles records as they stream, rather than taking them into cache and processing them afterwards, so the FPGA does a lot less work itself than it’s saving the CPU.

    CAM

  3. Greg Rahn on February 17th, 2008 1:56 am

    Curt,

    For what technical reasons do you feel it is “almost inconceivable to use Oracle for a 100+ terabyte data warehouse”?

    It has been almost 2.5 years, but in the 2005 WinterCorp TopTen Survey Yahoo! had a 100TB Oracle data warehouse [1] and I’m sure it has grown since then. Also, as of Q2 2007, AT&T has two separate 224TB Oracle data warehouses [2]. So why do you not mention Oracle as part of your “Absolute scalability” bullet point given that there are several multi-hundred terabyte Oracle data warehouses?

    [1] http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf
    [2] http://www.lgr.biz/index.php?option=com_content&view=article&id=67&Itemid=54

  4. Curt Monash on February 17th, 2008 3:37 am

    Every time I check out an Oracle warehouse in the tens of terabytes range or higher, it turns out to involve very unnatural acts.

    As for your examples, the second one refers to AT&T data warehouses supposedly running on Oracle — but the details are on a broken link. Your first link (correctly, IMO) lists only AT&T warehouses running on Daytona, not Oracle.

    Leaving AT&T out, your links suggest one Oracle data warehouse over 25 terabytes (I actually know that there are a few others). And it’s at Yahoo, which thinks nothing of — well, of unnatural acts to get database management systems to run in loosely couple clusters via custom code.

    No doubt there are a few genuine cases of Oracle data warehouses with tens of terabytes of user data running reasonably smoothly. But I do mean a “few”, except perhaps in the low end of that range.

    And don’t be at all sure that databases mentioned a couple of years ago have grown *on their original platforms*. For example, Yahoo has bought data warehouse appliances.

    Finally, please note that it is hard to find a Winter survey later than 2005. And please further note that the world’s largest data warehouses are definitely not in the survey.

    CAM

  5. Greg Rahn on February 17th, 2008 4:18 am

    You didn’t really answer my first question. I’m interested in the technicals. What Oracle warehouses have you checked out and can you define “unnatural acts”?

    The second link works (but the case study link is broke) and brief clearly states: “The CDRlive system at AT&T currently comprises two separate 224TB Oracle databases”, so where does *supposedly* come into play?

    I do recognize that the survey is from 2005 (and there isn’t a newer one) and not every customer or vendor participated in it, but that shouldn’t discount the facts (there was equal opportunity to participate). Just to note, there is another Oracle database at 222TB in that report, the one at Max-Planck-Institute for Meteorology. Also to clarify, the AT&T Daytona the 224TB AT&T Oracle DB are different. The latter is from Cingular.

    Just because you say Yahoo bought a data warehouse appliance doesn’t mean the that database (from 2005) doesn’t run on Oracle today.

    Just to be clear, I’m not claiming any db is better or worse, I’d just like to see supporting evidence of some of your claims, but perhaps this info is secret and we’ll just have to take your word.

  6. Curt Monash on February 17th, 2008 12:07 pm

    Greg,

    The fact that something appears in a Winter Corp survey is only very weak evidence of its accuracy.

    Dick Winter et al. are good guys. But they publish that survey even though they know it’s largely false. The big limitation is that for whatever reasons of methodology, they mainly find sites vendors tell them about and get customer consent to have mentioned. Hence no Teradata sites, for instance.

    And I frankly find the Yahoo claim not-so-credible, until somebody proves to me that Yahoo actually disclosed it, and it’s really a data warehouse as we understand the term.

    By the way — do you know whether those Winter Corp claims are about user data? Or total disk? Or what? I imagine that aspect of their methodology IS disclosed, but I haven’t checked.

    CAM

  7. Greg Rahn on February 17th, 2008 4:09 pm

    The Winter Corp number is user data, you can find how that is calculated by the provided scripts. I don’t really have interest in debating the creditability of the Winter Crop survey. It is what it is.
    The survey details are documented here:
    http://www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTen_Survey_FAQ.htm
    http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005_TTSurvey_Validation_Preview.htm

    Even if you have your doubts about the Yahoo claim, it has been made public. Can you provide the public references for your Teradata, DATAllegro, and SAS claims of 100+ TB warehouses? I’d be very interested in knowing more about these.

  8. Curt Monash on February 17th, 2008 7:45 pm

    Telling me — unless under NDA — constitutes making a claim public.

    As for “unnatural acts” — can you write the SQL one would think you can write, or do you have to union over the result sets from a partitioned set of tables to get your results?

  9. Curt Monash on February 17th, 2008 7:51 pm

    And looking at your latest Winter Corp links — what I and most other people call “user data” seems to be what they’re calling “normalized data”. In Oracle data warehouses, that’s commonly 1/5 – 1/10 as much as what Winter Corp calls “user data”.

    CAM

  10. Greg Rahn on February 17th, 2008 11:23 pm

    When there are documented references to your 100+ TB db claims, please share them. I’m interested in the big data arena as I’m sure other readers are.

    WRT “unnatural acts”: I’ve never had to do this, nor heard of anyone who has, and would not see why it would need to be done. Seems to me like a case of poor SQL writing skills rather than an Oracle product limitation. If you have a test case I’d gladly look at it offline.

    Can you give the definition that you use for “user data” and explain the difference between the Winter Corp definitions?
    Why is this number 1/5 – 1/10 as much in Oracle data warehouses? Is this ratio also applied to other data warehouse platforms?

  11. Curt Monash on February 18th, 2008 1:37 am

    User data = data. Not work space. Not indexes. Not aggregate tables. Not disks used for mirroring. Not the disk space savings from compression. Just raw data.

    http://www.dbms2.com/2006/09/28/expansion-ratio-data-warehouse-explosion/

    As for naming the big databases — I’m sorry. Most of the info I get about particular users is under possible NDA, and I often don’t bother checking what has or hasn’t made it into press releases and the like. However, I do know DATAllegro has a press release about a multihundred terabyte site. Dataupia put out a release about a big user very recently. (If memory serves, they called it 150 terabytes, but upon query from me said it was 120 Tb of user data. I haven’t posted about it yet.) eBay, which I just posted about, is not — I’m pretty sure — running its main warehouses on Oracle or SQL Server. If I get clarity for public consumption, I’ll update accordingly. Walmart has huge data warehouses, and is a long-time Teradata customer. Ditto K-Mart pre-Sears-merger. I posted about a 35 Tb Sun/Greenplum site where Oracle had been thrown out. And so on.

    If you really want to learn and not just argue, there are a lot of pointers out there. If nothing else, just take the obvious suspects for the largest warehouses, leave out whoever you think will be bewildering due to decentralized IT decision making or secrecy (hint: start that list with Amazon and the intelligence community), and start seeing who’s known to be their vendor.

    Have you even read my prior posts on these subjects? There’s a reasonably decent search capability on my sites.

    CAM

  12. Greg Rahn on February 18th, 2008 3:01 am

    Thanks for info. I apologize if you think I want to argue – not the case at all. The engineer in me just likes to see supporting evidence and some more details surrounding some of your statements.

    Cheers.

  13. Curt Monash on February 18th, 2008 3:26 am

    Greg,

    I’ve been an analyst since 1981, and I’ve always dealt with huge amounts of semi-confidential and only partly-verified info. (My stock analyst years were at the beginning of that period, long before Reg FD.) A big part of an analyst’s skill lies in figuring out what’s actually for real.

    I’m usually right. Not always, to be sure. But usually. At least about reading the current state of the market and technology. Predictions are sometimes a different matter …

    Best,

    CAM

  14. Tom Briggs on February 19th, 2008 11:13 am

    Any reason you didn’t mention Dataupia in the transparency/compatibility bullet? That is their whole play, isn’t it?

  15. Curt Monash on February 19th, 2008 1:18 pm

    Tom,

    I haven’t talked enough with Dataupia to make many en passant comments about them.

    E.g., I just caught an error in the bullet you commented on — I had implied ParAccel’s Oracle compatibility was already available, when in fact that’s a “roadmap” item, with a vague delivery timeframe target that they freely admit isn’t set in stone. I knew enough to edit that, without hesitation. For Dataupia I wouldn’t be as sure about what’s real and what isn’t.

    CAM

  16. john mc tighe on February 21st, 2008 1:44 pm

    Hi Curt,

    One nice feature that Teradata now offer is the ability to mix “hot” (frequently accessed) and “cold” (rarely accessed) data on the same disc spindles.
    This gets around the limitations of disk reader speeds vs increased density discs.
    If say only 20% of the data on a spindle is hot then it stands to reason that the data can be read from many spindles that bit quicker than trying to read all the hot data from fewer spindles.
    Maybe other products have similar functionality…
    BTW I currently work with SQL Server so I am not trying to promote TD ! Too expensive IMHO.

    John

  17. Curt Monash on February 21st, 2008 4:26 pm

    John,

    I’d missed that Teradata feature. Thanks!

    DATAllegro does “multi-temperature” in another way. In essence, they let you put two different brands of disk in the same installation, one of which is cheaper/bigger/slower than the other. Then they manage it all as a fairly seamless logical whole, as part of their grid strategy.

    CAM

  18. Database management system choices — overview | DBMS2 -- DataBase Management System Services on August 25th, 2008 3:42 am

    […] Part 3: Database management system choices – relational data warehouse […]

  19. Mike Stonebraker’s DBMS taxonomy | DBMS 2 : DataBase Management System Services on July 8th, 2012 4:05 am

    […] Part 3: Database management system choices – relational data warehouse […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.