I love talking with Carson Schmidt, chief of Teradata’s hardware engineering (among other things), even if I don’t always understand the details of what he’s talking about. It had been way too long since our last chat, so I requested another one. We were joined by Keith Muller, who I presume is pictured here. Takeaways included:
- Teradata performance growth was slow in the early 2000s, but has accelerated since then; Intel gets a lot of the credit (and blame) for that.
- Carson hopes for a performance “discontinuity” with Intel Ivy Bridge.
- Teradata is not afraid to use niche special-purpose chips.
- Teradata’s views can be taken as well-informed endorsements of InfiniBand and SAS 2.0.
Teradata graciously allowed me to post a few of the slides we used in our chat. Other details will have to remain fuzzier.
For context, I’ll start with a high-level view of Teradata’s hardware choices.
- As illustrated in a different, year-old slide deck, Teradata has four main rack-based product lines.
- The nodes in (the latest models) of all four Teradata product lines have:
- Very different kinds and quantity of storage attached to them, but …
- … identical CPUs (2 sockets, 6 cores per socket).
- “Slightly” different amounts of RAM.
- Different numbers of I/O cards in line with the differences in storage.
- Teradata systems currently use Intel Westmere CPUs. Intel Sandy Bridge is coming soon.
- Teradata uses the SAS 2.0 interface for all storage, even solid-state.
- Teradata uses four different kinds of networking gear: 1 Gigabit Ethernet, 10 Gigabit Ethernet, InfiniBand, and proprietary (BYNET 4.0).
- Teradata’s SSD (Solid-State Drive) supplier is still Pliant, which has meanwhile been bought by SanDisk.
- Teradata likes 2.5″ 10,000 RPM HDD (Hard Disk Drives) better than alternatives.
When compared to what I heard in prior conversations with Carson, the SAS 2.0 and 2.5″ 10,000 RPM choices were expected, but his fondness for InfiniBand is a big surprise. But I hadn’t previously realized:
- The new drives use 40% less power than 3.5″ 15,000 RPM drives
- Disk drives consume 80% of all the power used by a Teradata system.
A slide Teradata didn’t send over for posting indicated per-node storage choices as follows:
- Teradata 1650 (Extreme Data Appliance) — 72 2TB 7.2K RPM HDDs
- Teradata 2690 (Data Warehouse Appliance) — 24 600GB 10K RPM HDDs
- Teradata 4600 (Extreme Performance Appliance) — 8 300GB SSDs
- Teradata 6690* (Active Enterprise Data Warehouse) — 160 600 GB HDDs plus 15 400 GB SSDs
That’s a range in raw storage from 2.4 TB at the low end up to 144 TB at the high. Slightly different versions of the information can be found in a Teradata brochure.
*The Teradata 6690 was just announced Thursday.
Teradata showed me a slide with some interesting performance numbers, but unfortunately withheld permission to actually use them. That said, performance factors of significance related to Intel CPUs include at least:
- Cores per chip.
- Performance per core.
- Memory capacity and bandwidth.
For example, Carson said that Sandy Bridge has (vs. Westmere):
- 4 channels/socket, up from 3.
- Somewhat faster channels.
- 50% increase in DIMM (Dual In-line Memory Module) sites.
So the increase in RAM capacity is
1.5 x (whatever memory gains come from transistor density improvements in RAM in the usual Moore’s Law way).
Total throughput from RAM isn’t growing as fast, but it’s also up significantly. The upshot is that Teradata is headed pretty quickly toward 1 TB of RAM/node.
The part of all this that’s weirdest to me is using storage bus technology (SAS) rather than PCIe for solid-state I/O, but the way Carson explains it makes sense. Teradata gets 400-450 megabytes/SSD/sec — which evidently is close to the SSD’s rated capacity — using its favorite storage bus technology (SAS), which it deems Plenty Good Enough. Teradata’s trick for achieving this throughput is to have as many “units of parallelism”* talk to the same SSD at once, keeping all 48 of its disk-head-equivalents busy at once.** By way of contrast, Teradata handles disks in a more strictly shared-nothing manner. The other part of the not-PCIe decision is that with PCIe, Teradata has trouble finding enough slots per node to do everything it needs. I suspect Carson suspects that’s a solvable problem, but Teradata is content to go with technology it’s familiar with.
*Specifically, what Teradata calls “AMPs”, of which there might be 24-48 on a 12-core node; 10-12 AMPs might talk to one SSD at once, and some could even be from different nodes.
**On the SSDs Teradata uses, each “head” is a little processor. Carson also said something about 64-deep queues on the SSD vs. 8-deep on disk, but that part sailed — as it were — over my head.
Other notes on SAS and alternatives include:
- SAS doesn’t cost that much more than SATA, is faster, and may be more reliable as well.
- While there are use cases in which SAS doesn’t support sufficiently long or diverse cables, that’s not an issue for Teradata, whose cables are copper, and never get over 3 meters long.
- I was briefly concerned that SAS speeds only double every 4 years or so, but Carson assured me that this was speed/”lane”, and the number of lanes “increases” over time too.
- Carson was so dismissive of Fibre Channel that I didn’t press for details.
The whole thing for now seems to work out to sustainable speeds of about 6200 megabytes/second/server for full table scans and the like, with further increases coming soon.
While Teradata tries to communicate with storage at close to rated capacity, it wants to keep its node-to-node interconnect underutilized, for reasons of latency and predictable performance. InfiniBand offers 8000ish megabyte/second capability, of which Teradata envisions only using 500-1000 MB/sec. But that’s enough to alleviate bottlenecks for data load, replication, and so on.
Reasons Carson has flipped from “hating” multiple aspects of InfiniBand to being “excited” about it include:
- Intel’s involvement.
- What he perceives as greater standardization on InfiniBand by the HPC (High-Performance Computing) community.
- The resolution of various problems with cables and connectors.
However, Teradata also in various systems uses 10 GB Ethernet, 1 GB Ethernet, and its legacy custom networking silicon.
Finally, my comment about Teradata not fearing custom silicon is based on three things:
- The legacy BYNET chips.
- The firmware, special controllers, special processors and so on that get mentioned when we talk about HDDs or SSDs.
- A new effort in compression.
That compression is via Hifn,* a division of Exar, which seems to specialize in compression chips. They’ve come up with something that does single-pass block-level compression, plugs into the motherboard via PCIe, and is shipping in the Teradata 2690 today.
*When Hifn was an independent company, it was run by Al Sisto, Ingres’ original VP of sales (or close to it). Small world.
Apparently the Hifn chip compresses things about 2X, vs. 3X for the software-based block-level compression in the 6600 series. Of course, Teradata has other compression as well. But it’s a subject I’m foggy on, not least because the idea of doing block-level rather than token/dictionary or some other columnar kind of compression strikes me as limited and wrong. (If block-level compression happened on top of columnar compression, then we’d be talking …) Anyhow, Teradata’s figures for converting raw disk to user data numbers are that you lop off 50% for RAID, take 70% of what remains (to leave room for temp space and whatever), and multiply the resulting 35% by your compression figure.