April 28, 2009

Data warehouse storage options — cheap, expensive, or solid-state disk drives

This is a long post, so I’m going to recap the highlights up front. In the opinion of somebody I have high regard for, namely Carson Schmidt of Teradata:

In other news, Carson likes 10 Gigabit Ethernet, dislikes Infiniband, and is “ecstatic” about Intel’s Nehalem, which will be the basis for Teradata’s next generation of servers.

Here’s the longer version.

Oliver Ratzesberger of eBay made the interesting comment to me that 15K RPM disk drives could have 10X or more the performance of 7200 RPM ones, a difference that clearly is not explained just by rotational speed. He said this was due to the large number of retries required by the cheaper drives, which eBay had tested as being in the 5-8X range on its particular equipment, for an overall 10X+ difference in effective scan rates. When I continued to probe, Oliver suggested that the guy I really should talk with is Carson Schmidt of Teradata, advice I took eagerly based on past experience.

Yesterday, Carson — who was unsurprised at Oliver’s figures* — patiently explained to me his views of the current differences between cheap and expensive disk drives. (Carson uses the terms “near-line” and “enterprise-class”.) Besides price, cheap drives optimize for power consumption, while expensive drives optimize for performance and reliability. Currently, for Teradata, “cheap” equates to SATA, “expensive” equates to Fibre Channel, and SAS 1.0 isn’t used. But SAS 2.0, coming soon, will supersede both of those interfaces, as discussed below.

*Carson did note that the performance differential varied significantly by the kind of workload. The more mixed and oriented to random reads the workload is, the bigger the difference. If you’re just doing sequential scans, it’s smaller. Oliver’s order-of-magnitude figure seemed to be based on scan-heavy tasks.

As I understand Carson’s view, mechanical features sported only by expensive drives include:

Electronic features of expensive storage includes:

Finally, there is firmware, in which expensive disk drives seem to have two major kinds of advantages:

Apparently, this isn’t even possible for SATA and SAS 1.0 disk drives, but is common for drives that use the Fibre Channel interface, and will also be possible in the forthcoming SAS 2.0 standard. (As you may have guessed, I’m a little fuzzy on the details of this firmware stuff.)

In Carson’s view, the disk drive industry has consolidated to the point that there are two credible vendors of expensive/enterprise-class disk drives: Seagate and Hitachi. What Teradata actually uses in its own systems right now is:

All this is of course subject to change. In the short term that mainly means the possible use of alternate suppliers. As the Teradata product line is repeatedly refreshed, however, greater changes will occur. Some of the biggest are:

I got the impression that at least the first three of these developments are expected soon, perhaps within a year.

And in a few years all of this will be pretty moot, because solid-state drives (SSDs) will be taking over. Carson thinks SSDs will have a 100X performance benefit versus disk drives, a figure that took me aback. However, he’s not yet sure about how fast SSDs will mature. Also complicating things is a possible transition some years down the road from SLC (Single-Level Cell) to MLC (Multi-Level Cell) SSDs. MLC SSDs which store multiple bits of information at once, are surely denser than SLC SSDs. I don’t know whether they’re more power efficient as well.

The main weirdnesses Carson sees in SSDs are those I’ve highlighted in the following quote from Wikipedia:

One limitation of flash memory is that although it can be read or programmed a byte or a word at a time in a random access fashion, it must be erased a “block” at a time. …

Another limitation is that flash memory has a finite number of erase-write cycles. … This effect is partially offset in some chip firmware or file system drivers by counting the writes and dynamically remapping blocks in order to spread write operations between sectors; this technique is called wear leveling. Another approach is to perform write verification and remapping to spare sectors in case of write failure, a technique called bad block management (BBM).

And finally, I unearthed a couple of non-storage tidbits, since I was talking with Carson anyway:

Comments

16 Responses to “Data warehouse storage options — cheap, expensive, or solid-state disk drives”

  1. Mark Callaghan on April 28th, 2009 3:18 pm

    Thanks for the details. Have any independent studies been published to validate the claims about possible 10x performance differences?

  2. Curt Monash on April 28th, 2009 3:27 pm

    Mark,

    I don’t have serious data beyond what I posted. I’m hoping other folks with information will jump into the discussion.

  3. Mark Callaghan on April 29th, 2009 9:33 am

    I am sure they are right as long as their claim includes ‘could have’. I get plenty of mail and email telling me I could be a lottery winner.

    I get 100+ IOPs and 50 MB/second from consumer grade 7200 RPM SATA disks at home, so a 15k disk needs to do 1000 IOPs and 500 MB/second to be 10X better or my cheap disk needs to do retries on almost every request. I don’t think that is typical. Maybe they had a bad batch of disks or very old disks.

    There is SMART monitoring on disks that counts retries and other stats and there have been a few large scale studies based on this data. So the data is there, but it isn’t easy to accumulate at a large scale.

    This paper is a good start and has references to other good papers.

    http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Flabs.google.com%2Fpapers%2Fdisk_failures.pdf

  4. cruppstahl’s blog » Why cheap hard disks are slower than expensive disks on April 30th, 2009 9:58 am

    [...] This post is about harddisks and why cheap (SATA) harddisks are much slower than expensive ones (Fibre channel/SAS). [...]

  5. Curt Monash on April 30th, 2009 10:53 am

    Mark,

    I don’t understand why you’re extrapolating from your home system to eBay’s data warehouse. Are the workloads similar?

    In particular, Oliver tells me the problem usually doesn’t arise when there’s only one query running, especially if the query can be satisfied by quasi-sequential scans. How many simultaneous queries did you run your test with?

    Thanks,

    CAM

  6. Michael McIntire on April 30th, 2009 1:44 pm

    All: the 10x numbers have to be colored by the application behavior. As an application Teradata is a hash distributed architecture, meaning that all IO (ALL) is random. Teradata also happens to exploit massive amounts of IO – essentially highly optimized software specifically designed to exploit best in class brute force hardware.

    When high session concurrency is factored into how the database operates, this results in a very large number of different IO paths in addition to the random placement of the data. There are certainly efforts by the entire tech stack to geographically colocate like data, but it is this random IO with concurrency environment which causes much, much greater head movement.

    When you compute seek time and rotational time together in a 100% random block read environment – 10x is simple to see. At 7200 RPM, it takes 2x longer to read a track of data on a SATA drive – which also is likely to have several times more data per track, most of which is not needed for random IO.

    With SATA Seek Latency of ~4x the FC drives, the two combine for something larger than 10x… this does not even count the compute and algorithmic issues or where in the tech stack the computation occurs. A highly simplistic and not entirely realistic example, but it should illustrate the point.

    SATA disk systems compete very well in low concurrency large sequential block environments, which is entirely opposite the Teradata environment. So, the 10x number being quoted here is not a surprise.

  7. Mark Callaghan on May 2nd, 2009 9:46 am

    @Michael – you are the first person to ever claim that a 15k enterprise-grade SAS disk can do 10x more IOPs than a 7200 RPM consumer-grade SATA disk. Congratulations.

    @Curt – workload has nothing to do with it. Oliver has made a controversial claim with no substantiation. That is marketing and nothing else.

  8. Curt Monash on May 2nd, 2009 6:42 pm

    Mark,

    Maybe the eBay guys diagnosed their situation correctly and maybe they didn’t, but I can’t begin to fathom your basis for saying that workload has nothing to do with it.

    CAM

  9. Mark Callaghan on May 7th, 2009 1:09 am

    Curt,

    I agree with you that workload has something to do with performance. Ignore the poor wording. I mean that you won’t get 10X more MB/s or IOPs from 15k SAS versus 7200 RPM SATA. Teradata has done clever things with track aligned reads to optimize disk performance. I would much rather read about that.

  10. Robert Young on May 12th, 2009 8:48 am

    Check Andandtech for reviews of SSD. The latest is from 20 March 2009. Deals explicitly with some of the issues here. An earlier review dealt with the “block” write versus read.

    The value of SSD is not going to be in highly redundant, flat-file (called whatever you want) style datastores; price will be too high. The value will be in high NF relational databases. Now, in my opinion (which you can read, and I am not alone), SSD will be the motivator that merges back OLTP with its various replicants. SSD, and the flash versions (both MLC and SLC) are only the latest low-end implementations (check Texas Memory Systems for one example of industrial strength SSD), removes the join penalty from 3/4/5NF databases.

    The bottleneck will be in finding folks with enough smarts to embrace (again) Dr. Codd’s vision. The xml folk are not those kind of folk. My candidate is Larry Ellison. The reason is that the Oracle architecture, MVCC, is superior for OLTP (IBM finally just capitulated with entrpriseDB). With SSD, he can use the Oracle database, appropriately normalized, to support both without stars and snowflakes. A true one stop solution.

  11. Curt Monash on May 12th, 2009 10:22 am

    Robert,

    I understand the appeal of saying something like “The reason we need to be aware of physical design is largely complex query performance. Complex query performance is an issue mainly because of I/O. If we have better storage technology, that problem goes away, and we can start ignoring physical design the way the theorists have always wanted us to.”

    But I think we’re a long way from reaching that ideal, at best. Data warehouses are BIG, and getting bigger. They’ll push the limits of hardware technology for a long time to come.

  12. Revisiting disk vibration as a data warehouse performance problem | DBMS2 -- DataBase Management System Services on May 8th, 2010 12:06 am

    [...] April, I wrote about the problems disk vibration can cause for data warehouse performance. Possible performance hits exceeded 10X, wild as that [...]

  13. Curt Sampson on May 8th, 2010 2:33 am

    Or someone can just go write a sensible DBMS that doesn’t force you to link the logical format with the physical. There’s no reason that several normalized relations can’t be stored as a single denormalized table on disk, if that happens to be best for the query load. Column-oriented systems are an example of a different storage method under a relational front-end, though they suffer just as badly from not being able to store things in a row-oriented manner when that makes more sense.

  14. eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more | DBMS 2 : DataBase Management System Services on October 9th, 2010 10:39 am

    [...] Teradata and Greenplum, Oliver previously indicated he was inclined to attribute this more to specific Sun Thumper hardware/storage choices than to [...]

  15. sai on April 24th, 2011 11:04 am

    Dear all,

    Does anybody know what is the cost of additional storage of 1 TB added to an existing Warehouse. My client company is having SYbase IQ Datawarehouse, and I’m just curious to know what would be the incremental cost of 1TB, coz they might add upto 3.

    Regards,

    Sai.

  16. Curt Monash on April 24th, 2011 11:11 am

    I think you’d do best to check with Sybase on that. Prices change too often for me to have that memorized.

    On the plus side, they often have fairly clear web pages with their list pricing.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.