July 27, 2009

XtremeData announces its DBx data warehouse appliance

XtremeData is announcing its DBx data warehouse appliance today. Highlights include:

XtremeData has kindly permitted me to post its DBx launch slide deck. Three specific POC/prospect price/performance comments may be found on Slide 9.

XtremeData says that the clock speed on the FPGAs it uses is 200 megahertz, clearly much less than an Intel-compatible CPU’s. However, XtremeData also says 100s or 1000s of steps can be done at once on an FPGA. The reason for this seems to be “pipelining” much more than on-chip parallel streams. XtremeData’s explanation seemed to focus on the point that many rows of data could be processed independently of each other, and hence at once. I’m not wholly convinced that this is a standard use of the word “pipelining”. The point may be moot anyway, in that XtremeData’s reported performance advantages are nowhere what one would get by naively assuming DBx can do ~1000 times as many steps per clock cycle at 1/12th – 1/16th a normal clock speed.

Related link

Comments

34 Responses to “XtremeData announces its DBx data warehouse appliance”

  1. Hans on July 27th, 2009 9:58 am

    Sorry, you lost me after “no compression”. Honestly, I would be pretty skeptical about their price/performance claims if they are operating over uncompressed data.

  2. Curt Monash on July 27th, 2009 11:37 am

    The price is what it is, but you can go to their site to look at specs to check whether you find it at all plausible.

    The performance claims on the slide I called out aren’t really that dramatic.

  3. Jos van Dongen on July 27th, 2009 1:57 pm

    Dear Curt,

    Good overview; interesting product at a compelling price point. Re. the pricing: you talk about ‘user data’ where the slide deck talks about ‘usable data’. In case of the latter, the price/TB would indeed drop with compression, otherwise it won’t I think. Competitors like Vertica price per TB of user (input) data, not stored/usable data. Maybe good to clarify the difference (if there is any)

    regards, Jos

  4. Geno Valente on July 27th, 2009 2:28 pm

    The price is “per TB of user data” – input. dbX 1008 has 96 disk drives, 1TB each or 96TB of raw disk. 30TB of that is for “user data”. Priced at $600K, this is exactly $20K/TB of User Data. I hope that helps.

  5. Curt Monash on July 27th, 2009 2:44 pm

    Thanks, Geno.

    Jos — XtremeData and I were talking about user data, so I knew that was what they meant. And “usable data” isn’t a common phrase; perhaps you were mixing it up with “usable space”.

  6. Justin Swanhart on July 27th, 2009 4:05 pm

    An FPGA can achieve pipeline parallelism that is much more effective than the pipeline parallelism that you see on traditional stored program CPUs.

    All traditional CPUs are subject to the von Neumann bottleneck because they are instruction flow processors. In order to attempt to limit the effect of the bottleneck, smart instruction flow processors set up a short pipeline, which is usually, at most a few processor instructions long. The CPU does “branch prediction” to figure out the “most likely” pipeline, but things get really expensive when this guess is wrong.

    A dataflow processor on the other hand doesn’t have registers and it doesn’t do branch prediction. FPGAs are hardware reconfigurable at run time. The chip may be reprogrammed on the fly to handle the data flow requirements.

    Because the pipeline is fixed, there is no branch prediction, and no copying of data in and out of registers. One low megahertz fpga can do the work of dozens or even hundreds of commodity CPUs, because the work is done in pipeline in parallel, like an assembly line.

    The analogy to an assembly line is very good actually.

    In a normal CPU, one worker repeatedly goes to a workbench for the next part, returns to the workbench, attaches his part and repeats until he has a finished product. In an FPGA the worker sits in place and waits for the work-in-progress and his part-to-add to both arrive. He adds his part, and the work-in-progress then flows to the next worker who does the same. In the same amount of time it took to build one item in the normal CPU, dozens or even hundreds of items are built on the assembly line.

  7. Paul Johnson on July 27th, 2009 4:42 pm

    Not another MPP offering!

    Looks like it has the same low per TB pricing as Dataupia (~$20K/TB), but starts at 8 nodes and 30TB = $600k entry price.

    Postgres and FPGAs like Netezza…could be interesting.

    Does this happen even when not logically required e.g. for colocated fact:fact joins:

    “XtremeData’s DBx of course does complete parallel redistribution of data after every intermediate result set.”

  8. Jerome Pineau on July 27th, 2009 7:10 pm

    @Paul: Not another MPP offering!

    I’m sorry I had to LOL at that one 🙂 Tell you what Paul, you want an SMP offering for 25K up to 1TB (real data, no indexing, uncompressed)? Pull it off our website! (We’re not in Bangalore either btw — sorry but that doesn’t give me a warm feeling if I’m a US customer!)

    There is no need for all this h/w acceleration – it’s a software problem, not a silicon one! That’s why no one has been able to come up with anything more interesting than columns, grids or chips in the past decades. These are all one-trick ponies, to rehash my latest posting on http://jeromepineau.blogspot.com

  9. Geno Valente on July 27th, 2009 8:34 pm

    We believe it is both: a software and a hardware problem. In fact, a FPGA can process “real SQL” at 10X the rate and 1/3rd the power of current CPU. CPUs are still needed for many things, but when used properly the FPGA “In Socket Accelerator” is never the bottleneck in the system. dbX is a ‘balanced system” and can process SQL, move, and load balance data at 1GB/s per node (up to 1024 nodes). This is unheard of in the industry.

    Re: Paul J. – if data is already co-located, of course dbX doesn’t move it around needlessly. However, when required, we load balance and redistribute without user intervention.

  10. Jerome Pineau on July 28th, 2009 10:11 am

    “load balance data at 1GB/s per node…This is unheard of in the industry.”

    So, what’s the difference with Infiniband or 1GbE connecting any other fabric?
    Thanks.

  11. Jerome Pineau on July 28th, 2009 10:14 am

    Or is it comparable to 10GbE (you use big-B)?

  12. Jerome Pineau on July 28th, 2009 10:22 am

    @Geno: “dbX 1008 has 96 disk drives, 1TB each or 96TB of raw disk. 30TB of that is for “user data”

    Hang on, you’re saying 66TB of storage is used for non-user data? What is non-user data? Thanks.

  13. Geno Valente on July 28th, 2009 11:49 am

    I’m going to suggest people watch the “ChalkTalks” on our website for the full technical details. http://www.xtremedata.com . Our CTO, Faisal Shah, will be explaining a lot of the above questions in detail. We have one up there now :”SQL in Silicon” (FPGAs vs CPUs) and the others will be posted in the next 24,48 hours.

    In short: We use IB. Our software and hardware know how to utilize this connection to its full extent. Our experience is CPUs cannot build statistics, move data, and perform “real SQL” at the same time at 1GB/s. Thus, we feel the CPU becomes the bottleneck in every solution. This is likely why everyone focuses on compression, or algos that “restrict data movement”. (for them it is too expensive (time) of a task)

    We have the “SQL In Silicon” – In-Socket Accelerator that does this process at full rate. In short, data redistribution is basically “free” in our system. We are not afraid of moving data and do it under the hood so you don’t have too.

    IB = 2GBytes/Sec in and out. Very low overhead.

    1GE = 1Gbit/Sec.. lots of overhead.. I think around 80MB/Sec is about the real bandwidth.

    Also: 66TB is for dbX to use. Temp space, RAID copies, other execution engine “database housekeeping”.

  14. Justin Swanhart on July 28th, 2009 1:56 pm

    back of the napkin:

    96 1TB SATA disks =
    48GB RAID 10
    30 GB user DATA
    16 GB swap space
    8 GB OS volume

  15. Justin Swanhart on July 28th, 2009 1:57 pm

    I say “raid 10” meaning all data is mirrored to at least one other copy in the cluster.

  16. Geno Valente on July 28th, 2009 3:51 pm

    The drives are SAS, and we’d be happy to discuss the detailed break down, exact RAID used, why, what features, and even roadmap with customers if they want/need to know.

  17. Jerome Pineau on July 28th, 2009 7:58 pm

    Ok so, are those puppies yelping in the background during the ChalkTalk lecture?!?

  18. Jerome Pineau on July 28th, 2009 8:11 pm

    Ok so I watched the ChalkTalk lecture — I’m even more convinced now that hardware is not needed so clearly I totally missed the boat here — it seems to me everything you do on your chip we do in software (how many ops do you burn in silicon?). It’s rare when I feel so retarded 🙂

  19. John Czarnowicz on July 29th, 2009 3:54 am

    This raises an interesting question. Are MPP databases by greenplum/netezza/et.al. generally either I/O bound or CPU bound?

    It seems that the most dramatic improvements in this field have resulted from increased data throughput.

  20. Curt Monash on July 29th, 2009 3:57 am

    The point of MPP is to relax an I/O bottleneck that otherwise would be present. Some systems are still I/O bound. Other vendors claim to have gone so far the other way that they’re now CPU-bound. Bandwidth-bound is also a possibility.

    The ideal, of course, is to have everything crumble at once, like the one-horse shay of the poem.

  21. Geno Valente on July 29th, 2009 8:53 am

    My take: Netezza would be I/O bound. Uses 1Ge as links, built from a “tree” of Ethernet switches, head node involved in data redistribution across SPUs. If data is not co-located for a given query, and thus data needs to be moved. System gets choked by ethernet switched and probably the head.

    GP: Head involved in all data redistribution – thus CPU bound (Head is the choke point) if you build it with IB.

    XtremeData: Data Nodes are piers in the execution. Head is not involved in redistribution. Systems has no choke point and is “balanced” as Curt mentions as “Ideal” above. At runtime, we do this balancing and redistribution at every step of a query at 1GByte/s/node.

    Give us 9 JOINS with 9 different keys. The dbX system will do all the data movement work for you and with almost no impact to query performance.

    Again, we are going after the ad hoc analytic market where this is important. The hardest queries (ad hoc) on the largest data (5TB to 5PB)

    (JP: Those aren’t puppies, that is the marker squeaking on the whiteboard)

  22. Curt Monash on July 29th, 2009 10:50 am

    Geno,

    I’m pretty sure that the amount of head-node involvement in data distribution at both Netezza and Greenplum — in currently shipping releases — is either 0, or else so small as not to matter significantly.

    Meanwhile, Netezza repeatedly stresses that the FPGA — which handles I/O — is if anything underutilized.

    So I’d guess they’re CPU bound.

    There certainly are stories of Netezza systems being implemented with lots of underutilized disks, in the more spindles = better performance equation. So in some cases they are indeed I/O-bound. But in others I’d expect CPU is the limit.

    I don’t really believe in the system that is perfectly balanced in all parts for all queries. Different queries stress different parts of the system.

  23. Shawn Fox on July 29th, 2009 11:59 am

    Netezza does not use the head node for data redistribution. Data is moved directly between SPUs via the Ethernet switch. In most cases the head node is hardly used at all and is only rarely the bottleneck for any operation.

    As far as Netezza being i/o, cpu, or network bound, as a user I cannot directly tell what the limiting factor is, but from what I have been told by engineers within Netezza the most common bottleneck is actual disk i/o. With compression in version 4.5 the increased effective i/o rate has more often moved the limiting factor to be the FPGA itself or in some cases the CPU, especially when the data compresses really well.

    I would say that if one of the components of the system is always the bottleneck the system is not well designed. The resource requirements for any given query can vary greatly depending on what the query is doing.

    In extreme cases one query might utilize i/o 100% but barely use CPU at all whereas another query hardly uses i/o but 100% utilizes the CPU. Queries which require that a huge fact table be redistributed would be limited by the network. It is all going to depend on what the query is doing. I would expect to see that behavior on any database whether it is Teradata, Netzza, Greenplum, Oracle, or whoever.

    The goal should be to design a system so that it achieves the greatest average throughput for a large variety of workloads. The ‘average’ workload however is going to vary greatly from one customer to the next which means that the system will most likely be more often limited by a different component for different customers (or different workloads for the same customer).

  24. Jerome Pineau on July 29th, 2009 7:25 pm

    @Shawn: “The goal should be to design a system so that it achieves the greatest average throughput for a large variety of workloads. The ‘average’ workload however is going to vary greatly from one customer to the next which means that the system will most likely be more often limited by a different component for different customers (or different workloads for the same customer).”

    Sorry for reposting a long quote but this is absolutely key and central to what XSPRADA has been engineering/preaching for years! This is why it’s imperative to have a technology that can support all modalities no matter what the workload/query patterns are! One-trick ponies handling specific workloads for specific use cases are not adequate. You need to have the ability to apply the right technique for any question for any data at any time. Anything less than that will not long be tolerated in the BI world IMHO.

  25. Geno Valente on July 29th, 2009 7:30 pm

    We posted the next “ChalkTalk” today titled “Unconstrained Data Exploration”. It is fitting and answers some of the open questions related to data redistribution, how we do it, and etc.

    http://www.xtremedata.com/unconstrained.php

    More ChalkTalks coming soon.

    If anyone has ideas for future chalk talks.. send me an email: gvalente (at) xtremedata (dot) com

  26. Jerome Pineau on July 30th, 2009 3:09 pm

    @Geno: So the thing to take away from this interesting virtual WB session is
    1. all data movement is at 1GB/sec
    2. you’re nodal partition scheme agnostic
    3. you’re “model agnostic”

    Right?

  27. “The Netezza price point” | DBMS2 -- DataBase Management System Services on July 31st, 2009 12:03 am

    […] XtremeData just launched in the new Netezza price range. […]

  28. Geno Valente on August 4th, 2009 10:15 am

    @Jerome – Yes, I think you got it. Add to #1; that not just data movement, but also ” real SQL processing” while being moved at that rate.

    In summary: this give 16 nodes (1 tower) the abilty to hold 60TB of user data and process SQL at 1TB/Min regardless of data partition / data key.

  29. Curt Monash on August 4th, 2009 12:06 pm

    I note that 1 terabyte/minute on 16 cores is a lot like the 1 gigabyte/second/core VectorWise talks about, e.g. http://www.dbms2.com/2009/08/04/vectorwise-ingres-and-monetdb/#comment-133810

    They don’t think FPGAs are needed in addition to the cores, however. 😉

    Best,

    CAM

  30. Geno Valente on August 5th, 2009 2:00 pm

    Having been in the FPGA industry almost since its infancy, I can say that people who need the “BEST” size, weight, and power (SWaP) use FPGAs, especially when it comes to bit, byte, character, string manipulations, or packet processing. Cisco, Nortel, EMC, Moto, Alcatel, GE, and about 20K other customers publicly say that they rely on them to many things that CPUs just can’t do fast enough or inside the SWaP envelop. If you pick “one thing”, then yes CPUs can approach being the same, but let’s take a full data pipeline example.

    CPUs can do JOINs, they can do compression, they can do encryption, and they can do encryption, but how is their performance at all four on streaming data where the CACHE is basically useless? Not very good to say the least.

    A FPGA pipeline could do DECOMPRESS (GZIP-9), DECRYPT (AES-XTS 256 bit by the way or elliptical curve), then JOIN, then RECRYPT, then RECOMPRESS at the same performance through-put because of the power of pipelining and parallelization of custom reprogrammable silicon. O’ya.. AND it gives a new result every clock cycle. A large FPGA in this case is a MEN vs BOYS discussion.

    We have been doing a webinar series called “Acceleration Academy” for all of 2009. I can say that Intel, SGI, HP, Altera, and may others have been on it to support these claims. Large Tier 1 companies have “Accelerator Strategies” (namely XDI FPGA accelerators and nVidia GPGPUs). In addition, Intel and AMD both created accelerator programs, called QuickAssist and Torenza, that included our patented Accelerator Technology as one of, if not THE, first member. People who doubt the power of FPGAs should watch this series: http://www.xtremedata.com/accelerationacademy (click on past webinars) or read the hundreds of pier reviewed papers on the power FPGA technology.

    Additionally, we have over 200 customers that use the same hardware platform as DBX to make their own accelerated appliances (Financial, Genomics, Military Radar, etc) that all say “ X86 + FPGAs” are the way to go… and yesterday TwinFin made the same move that we’ve been advocating since 2004. We have the only Tier 1 approved In-Socket FPGA solution in the world (which is patented), which tightly couples these two technologies together better than any other way. XtremeData Accelerators are approved as an official HP accelerator (www.hp.com/go/accelerators to see our name, white papers, and the HP/Xtreme join PodCast) . This status makes our platform mainstream, available with Tier 1support, and customer proven time and time again to do its job faster, with less, and in less space than CPU + software alone.

    The power of FPGAs: This is one thing that the folks at Netezza and I agree on.

  31. What does Netezza do in the FPGAs anyway, and other questions | DBMS2 -- DataBase Management System Services on August 9th, 2009 8:52 am

    […] A recent discussion of the use of FPGAs for SQL operations in a post and comment thread around XtremeData’s product launch […]

  32. Kickfire’s FPGA-based technical strategy | DBMS2 -- DataBase Management System Services on August 21st, 2009 2:35 am

    […] to imply. But in fact Kickfire just relies on standard chips, even if — like Netezza and XtremeData — Kickfire does rely on less programmer-friendly FPGAs to do some of what most rival vendors […]

  33. OhioRob on December 11th, 2009 5:37 pm

    Geno:
    Do you have any TPC-H benchmarks to substantiate your throughput claims? Its always good to see a reputable independent source that performs consistent tests across multiple competing platforms. Without it, anybody can make any claim (eg. Oracle).

  34. Curt Monash on December 11th, 2009 8:04 pm

    Huh?

    TPC-H is a joke, and Oracle is one of the chief perpetrators of same.

    CAM

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.