August 12, 2006

Introduction to Greenplum and some compare/contrast

Netezza relies on FPGAs. DATallegro essentially uses standard components, but those include Infiniband cards (and there’s a little FPGA action when they do encryption). Greenplum, however, claims to offer a highly competitive data warehouse solution that’s so software-only you can download it from their web site. That said, their main sales mode seems to also be through appliances, specifically ones branded and sold by Sun, combining Greenplum and open source software on a “Thumper” box. And the whole thing supposedly scales even higher than DATallegro and Netezza, because you can manage over a petabyte if you chain together a dozen of the 100 terabyte racks.

As often happens in introductory calls, I came away from my first Greenplum conversation with a much better sense of what it was they’re claiming than I managed to get of why it’s believable, how they’ve achieved what they appear to have, or where the “gotchas” are. Anyhow, here are some highlights of their story so far:

Comments

4 Responses to “Introduction to Greenplum and some compare/contrast”

  1. Text Technologies»Blog Archive » Text mining into big data warehouses on August 12th, 2006 6:54 am

    [...] I previously noted that Attensity seemed to putting a lot of emphasis on a partnership with Business Objects and Teradata, although due to vacations I’ve still failed to get anybody from Business Objects to give me their view of the relationship’s importance. Now Greenplum tells me that O’Reilly is using their system to support text mining (apparently via homegrown technology), although I wasn’t too clear on the details. I also got the sense Greenplum is doing more in text mining, but the details of that completely escaped me. [...]

  2. Luke Lonergan on August 12th, 2006 1:55 pm

    Thanks Curt, this is a good introduction. It’s nice to have someone dig into the technology and find the differences.

    WRT “query shipping”, I was actually referring to a simpler approach used by others, not ours. My admittedly subtle but I think important point was that the whole problem of supporting arbitrary DBMS work is that you have to get inside the database engine and implement optimization at the “execution plan” level and not the “query plan” level. Rather than “repartitioning on the fly”, we pipeline rows through the interconnect among execution plan fragments in real time. We do this because of the performance, generality and ease of adding future capabilities associated with a DBMS internal architecture. I think this is critical and you should expect us to continue to have advantages like supporting a rich assortment of native indexing and highly optimized aggregations that can’t be done easily without our architecture.

    Let’s see if this sparks some comments!

  3. Stuart Frost on September 21st, 2006 12:10 pm

    Since we haven’t yet seen Greenplum in any competitive situations, it’s hard for me to comment in any detail. However, there are a few things I don’t understand:

    1. Shipping rows around between execution plan fragments sounds OK for small amounts of data, but with large volumes, it’s far more efficient to move data around in large blocks to avoid the overhead of many small movements (especially on GigE). We’ve been able to handle all queries put before us, so I don’t see any inherent advantages in terms of functionality.

    2. The last time I looked, Postgres was multi-process and not multi-threaded and I don’t think that’s changed. My guess is that there’s a lot of time spent waiting for rows with this kind of approach.

    3. I don’t understand why the approach described above leads to the conclusion that ‘native indexing and highly optimized aggregations’ can’t be easily done with different architectures. We certainly manage it.

    4. In any real-world DW, what use are bit-mapped indexes with cardinality of up to 10,000? We generally deal with tables of billions of rows and cardinality in the tens or hundreds of millions.

    5. The network throughput numbers quoted by Sun don’t make any sense. How do I get 1GBps through four GigE links? Each link will max out at 80MBps, so that gives only 320MBps. Also, even with a TOE, you’d see a lot of CPU load with that kind of data movement. With such light CPU power relative to the number of disks, how does the system scale under concurrency?

    6. There’s also a claim of one TB per minute table scans from ten servers floating around. I presume that’s calculated by just multiplying Sun’s claim of 2GBps disk throughput x 10 x 60. Seems unlikely that Postgres could get anywhere near that in practice with just two dual core CPUs. Even if a simple table scan could run at that speed (which I doubt), our experience with Postgres is that it’s MUCH slower than Ingres when running actual queries.

    Stuart
    DATAllegro

  4. Infology.Ru » Blog Archive » Стратегии аппаратного обеспечения комплексов для хранилищ данных on September 4th, 2008 3:29 pm

    [...] Greenplum всю дорогу является Типом 2. Несомненно, они бы с удовольствием продали вам только лицензию на программное обеспечение, но я не знаю о ком-либо, кто бы хотел их купить. [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.