This is a long post, so I’m going to recap the highlights up front. In the opinion of somebody I have high regard for, namely Carson Schmidt of Teradata:
- There’s currently a huge — one order of magnitude — performance difference between cheap and expensive disks for data warehousing workloads.
- New disk generations coming soon will have best-of-both-worlds aspects, combining high-end performance with lower-end cost and power consumption.
- Solid-state drives will likely add one or two orders of magnitude to performance a few years down the road. Echoing the most famous logjam in VC history — namely the 60+ hard disk companies that got venture funding in the 1980s — 20+ companies are vying to cash in.
In other news, Carson likes 10 Gigabit Ethernet, dislikes Infiniband, and is “ecstatic” about Intel’s Nehalem, which will be the basis for Teradata’s next generation of servers.
Here’s the longer version.
Oliver Ratzesberger of eBay made the interesting comment to me that 15K RPM disk drives could have 10X or more the performance of 7200 RPM ones, a difference that clearly is not explained just by rotational speed. He said this was due to the large number of retries required by the cheaper drives, which eBay had tested as being in the 5-8X range on its particular equipment, for an overall 10X+ difference in effective scan rates. When I continued to probe, Oliver suggested that the guy I really should talk with is Carson Schmidt of Teradata, advice I took eagerly based on past experience.
Yesterday, Carson — who was unsurprised at Oliver’s figures* — patiently explained to me his views of the current differences between cheap and expensive disk drives. (Carson uses the terms “near-line” and “enterprise-class”.) Besides price, cheap drives optimize for power consumption, while expensive drives optimize for performance and reliability. Currently, for Teradata, “cheap” equates to SATA, “expensive” equates to Fibre Channel, and SAS 1.0 isn’t used. But SAS 2.0, coming soon, will supersede both of those interfaces, as discussed below.
*Carson did note that the performance differential varied significantly by the kind of workload. The more mixed and oriented to random reads the workload is, the bigger the difference. If you’re just doing sequential scans, it’s smaller. Oliver’s order-of-magnitude figure seemed to be based on scan-heavy tasks.
As I understand Carson’s view, mechanical features sported only by expensive drives include:
- Smaller media, more platters, and more disk heads
- Faster rotational speeds
- Enclosures that do a better job of damping vibration from disk rotation or fans.
Electronic features of expensive storage includes:
- More CPU (at least 2X)
- More RAM (also at least 2X), which is useful for caching.
- Dual ports for networking. Teradata doesn’t just use dual storage ports for reliability; it load balances across them and sometimes gets significantly enhanced performance.
Finally, there is firmware, in which expensive disk drives seem to have two major kinds of advantages:
- Command scheduling/queuing, which Carson believes provides a benefit at least comparable to the 2X derived from different rotational speeds.
- Better data integrity checking, in line with the T10 DIF standard. Not only does this seem to give much higher reliability, but it can be done closer to the platter, yielding a performance advantage.
Apparently, this isn’t even possible for SATA and SAS 1.0 disk drives, but is common for drives that use the Fibre Channel interface, and will also be possible in the forthcoming SAS 2.0 standard. (As you may have guessed, I’m a little fuzzy on the details of this firmware stuff.)
In Carson’s view, the disk drive industry has consolidated to the point that there are two credible vendors of expensive/enterprise-class disk drives: Seagate and Hitachi. What Teradata actually uses in its own systems right now is:
- In Teradata’s high-end 5550 line — Seagate Fiber Channel 3-1/2″ drives
- In Teradata’s mid-range 2550 line — SAS drives from Seagate and perhaps also Hitachi. I get the impression these have some of the electromechanical features of expensive drives, but not the firmware.
- In Teradata’s low-end 1550 line — Hitachi 1-TB cheap drives.
All this is of course subject to change. In the short term that mainly means the possible use of alternate suppliers. As the Teradata product line is repeatedly refreshed, however, greater changes will occur. Some of the biggest are:
- A new SAS 2.0 standard will allow enterprise-class firmware for cheaper disks.
- The form factor for high-end disk drives will shrink from 3 1/2″ drives to 2 1/2″ drives of 1/2 the volume.
- The rotation speed sweet spot may actually decrease, to 10K RPM, with offsetting improvements to seek and latency so as not to cut performance. Power consumption benefits will ensue.
- There probably will be multi-TB SAS drives — “fat SAS.” SATA may be enhanced to compete with those. And by the way, SAS and SATA are electrically compatible, and hence could be combined in the same system.
I got the impression that at least the first three of these developments are expected soon, perhaps within a year.
And in a few years all of this will be pretty moot, because solid-state drives (SSDs) will be taking over. Carson thinks SSDs will have a 100X performance benefit versus disk drives, a figure that took me aback. However, he’s not yet sure about how fast SSDs will mature. Also complicating things is a possible transition some years down the road from SLC (Single-Level Cell) to MLC (Multi-Level Cell) SSDs. MLC SSDs which store multiple bits of information at once, are surely denser than SLC SSDs. I don’t know whether they’re more power efficient as well.
The main weirdnesses Carson sees in SSDs are those I’ve highlighted in the following quote from Wikipedia:
One limitation of flash memory is that although it can be read or programmed a byte or a word at a time in a random access fashion, it must be erased a “block” at a time. …
Another limitation is that flash memory has a finite number of erase-write cycles. … This effect is partially offset in some chip firmware or file system drivers by counting the writes and dynamically remapping blocks in order to spread write operations between sectors; this technique is called wear leveling. Another approach is to perform write verification and remapping to spare sectors in case of write failure, a technique called bad block management (BBM).
And finally, I unearthed a couple of non-storage tidbits, since I was talking with Carson anyway:
- Carson has become a 10 GigE “bigot”, and Teradata will soon certify 10 Gigabit Ethernet cards for connectivity to external systems. Carson’s interest in Infiniband, never high, went entirely away after Cisco decommitted to it. Obviously, this stands in contrast to the endorsements of Infiniband for data warehousing by Oracle and Microsoft.
- Intel’s Nehalem will be the basis for Teradata’s next server product. Carson is “ecstatic” with Intel at the moment, which is different from his stance at other times.