March 9, 2012

Hardware and components — lessons from Teradata

I love talking with Carson Schmidt, chief of Teradata’s hardware engineering (among other things), even if I don’t always understand the details of what he’s talking about. It had been way too long since our last chat, so I requested another one. We were joined by Keith Muller, who I presume is pictured here. Takeaways included:

Teradata graciously allowed me to post a few of the slides we used in our chat. Other details will have to remain fuzzier.

For context, I’ll start with a high-level view of Teradata’s hardware choices.

When compared to what I heard in prior conversations with Carson, the SAS 2.0 and 2.5″ 10,000 RPM choices were expected, but his fondness for InfiniBand is a big surprise. But I hadn’t previously realized:

A slide Teradata didn’t send over for posting indicated per-node storage choices as follows:

That’s a range in raw storage from 2.4 TB at the low end up to 144 TB at the high. Slightly different versions of the information can be found in a Teradata brochure.

*The Teradata 6690 was just announced Thursday.

Teradata showed me a slide with some interesting performance numbers, but unfortunately withheld permission to actually use them. That said, performance factors of significance related to Intel CPUs include at least:

For example, Carson said that Sandy Bridge has (vs. Westmere):

So the increase in RAM capacity is

1.5 x (whatever memory gains come from transistor density improvements in RAM in the usual Moore’s Law way).

Total throughput from RAM isn’t growing as fast, but it’s also up significantly. The upshot is that Teradata is headed pretty quickly toward 1 TB of RAM/node.

The part of all this that’s weirdest to me is using storage bus technology (SAS) rather than PCIe for solid-state I/O, but the way Carson explains it makes sense. Teradata gets 400-450 megabytes/SSD/sec — which evidently is close to the SSD’s rated capacity — using its favorite storage bus technology (SAS), which it deems Plenty Good Enough. Teradata’s trick for achieving this throughput is to have as many “units of parallelism”* talk to the same SSD at once, keeping all 48 of its disk-head-equivalents busy at once.** By way of contrast, Teradata handles disks in a more strictly shared-nothing manner. The other part of the not-PCIe decision is that with PCIe, Teradata has trouble finding enough slots per node to do everything it needs. I suspect Carson suspects that’s a solvable problem, but Teradata is content to go with technology it’s familiar with.

*Specifically, what Teradata calls “AMPs”, of which there might be 24-48 on a 12-core node; 10-12 AMPs might talk to one SSD at once, and some could even be from different nodes.

**On the SSDs Teradata uses, each “head” is a little processor. Carson also said something about 64-deep queues on the SSD vs. 8-deep on disk, but that part sailed — as it were — over my head.

Other notes on SAS and alternatives include:

The whole thing for now seems to work out to sustainable speeds of about 6200 megabytes/second/server for full table scans and the like, with further increases coming soon.

While Teradata tries to communicate with storage at close to rated capacity, it wants to keep its node-to-node interconnect underutilized, for reasons of latency and predictable performance. InfiniBand offers 8000ish megabyte/second capability, of which Teradata envisions only using 500-1000 MB/sec. But that’s enough to alleviate bottlenecks for data load, replication, and so on.

Reasons Carson has flipped from “hating” multiple aspects of InfiniBand to being “excited” about it include:

However, Teradata also in various systems uses 10 GB Ethernet, 1 GB Ethernet, and its legacy custom networking silicon.

Finally, my comment about Teradata not fearing custom silicon is based on three things:

That compression is via Hifn,* a division of Exar, which seems to specialize in compression chips. They’ve come up with something that does single-pass block-level compression, plugs into the motherboard via PCIe, and is shipping in the Teradata 2690 today.

*When Hifn was an independent company, it was run by Al Sisto, Ingres’ original VP of sales (or close to it). Small world.

Apparently the Hifn chip compresses things about 2X, vs. 3X for the software-based block-level compression in the 6600 series. Of course, Teradata has other compression as well. But it’s a subject I’m foggy on, not least because the idea of doing block-level rather than token/dictionary or some other columnar kind of compression strikes me as limited and wrong. (If block-level compression happened on top of columnar compression, then we’d be talking …) Anyhow, Teradata’s figures for converting raw disk to user data numbers are that you lop off 50% for RAID, take 70% of what remains (to leave room for temp space and whatever), and multiply the resulting 35% by your compression figure.

Comments

13 Responses to “Hardware and components — lessons from Teradata”

  1. Paul Johnson on March 9th, 2012 5:03 am

    “sustainable speeds of about 6200 gigabytes/second/server for full table scans and the like” – over 6TB/sec/node scan speed using 6Gb Quad SAS adapters sounds like a *lot* of adapters ;-)

  2. unholyguy on March 9th, 2012 11:44 am

    Pretty sure the block level compression happens in addition to the token based MVC compression. MVC is still pretty manual though.

  3. Curt Monash on March 9th, 2012 3:44 pm

    Whoops. Looks like I was off by a few orders of magnitude. :) Fixed!

  4. Joe on March 9th, 2012 9:35 pm

    This is a great article. Very informative and lots to digest in here.

    I don’t mean to nitpick but following on the first comment… should that read megaByte or megaBit per second in I/O bandwidth?

    It’s interesting to hear that 80% of the appliance’s power draw is on account of disk.

    Again great article and thanks for writing/sharing.

  5. Curt Monash on March 9th, 2012 9:59 pm

    Joe,

    I won’t swear I heard right on bytes/bits. Perhaps somebody from Teradata can confirm.

  6. Paul Johnson on March 15th, 2012 1:42 pm

    @unholyguy – MVC is still the place to start for Teradata compression. The newer choices such as block level compression should be used in addition to MVC.

    The cost of MVC is manual and upfront/ongoing with the analysis and DDL changes, whereas the other compression options use CPU cycles – easier but potentially costlier.

    There is an excellent orange book covering Teradata’s compression choices and how they should be used together.

    @curt – 6GB/sec/node sounds much more like it, although a tad disappointing :-(

    @joe – it’s GBytes…the amended speed of around 6GB/s is available on the previous generation of fibre channel based systems running quad port 4Gb/s HBAs. So long as you have enough PCI slots and disks attached it soon adds up to decent bandwidth.

  7. Kevin Closson on March 18th, 2012 1:35 pm

    Good article.

    Another thing to point out about Sandy Bridge is the high-end SKUs have dual QPI links between the sockets. That’s a significant performance feature for data flow between processes that don’t happen to execute on the same socket.

    There are a lot of Sandy Bridge SKUs. Did Teradata happen to mention which specific part they intend to use in the up-coming refresh?

  8. Curt Monash on March 18th, 2012 8:58 pm

    Thanks, Kevin.

    Nope, they weren’t that specific.

  9. VLDB Solutions » Teradata 6690 Announced on April 2nd, 2012 8:48 am
  10. The Teradata Aster Big Analytics Aster/Hadoop appliance | DBMS 2 : DataBase Management System Services on October 17th, 2012 10:32 pm

    [...] want to compare the hardware specs for the Teradata Aster Big Analytics Appliance to those for four different Teradata systems (March, [...]

  11. Notes on Teradata systems | DBMS 2 : DataBase Management System Services on April 15th, 2013 2:53 am

    [...] previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement [...]

  12. keyword phrase repeated on May 10th, 2014 6:01 am

    keyword phrase repeated…

    Hardware and components – lessons from Teradata | DBMS 2 : DataBase Management System Services…

  13. prashanth on September 7th, 2014 7:38 am

    Hi

    can i have brief explanation of the DIMM in teradata

    thanks
    prashanth

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.