June 5, 2009

Greenplum update — Release 3.3 and so on

I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including:

Comments

11 Responses to “Greenplum update — Release 3.3 and so on”

  1. Greenplum will be announcing some stuff | DBMS2 -- DataBase Management System Services on June 5th, 2009 9:18 am

    [...] excepted — and so I’ll do a general, if slightly incomplete, Greenplum update in a separate post. Categories: Data warehousing, Greenplum, Specific users  Subscribe to our complete [...]

  2. Andy E on June 5th, 2009 11:18 am

    Greenplum & Asters’ compression is “fairly close” to columnar DBs? Someone should probably attempt to quantify “fairly close” or specified which columnar db they’re describing (not all columnar DBs compress equivalently).

    One point of comparison…A couple months ago, a now-Vertica customer benchmarked Vertica and one of the aforementioned DBs, and a deciding factor was the relative amount of storage hardware required. 1TB of web app event data compressed to 200GB in Vertica (80% reduction). Same data “compressed” to greater than 1TB in the other. I think in the end, the competitor’s DB was 8x larger than the Vertica physical DB size. 8x less storage = faster performance (less IO) and, more obviously, lots less hardware when you’re managing dozens of TBs (uncompressed) of data.

    That’s just one (pretty typical) data point. Compression results will vary based on DBMSs compared and the type of data as Curt has mentioned (see: http://www.dbms2.com/2008/09/24/vertica-finally-spells-out-its-compression-claims/ )

  3. Ben Werther on June 5th, 2009 4:30 pm

    There’s an information-theoretical bound to how much any data can be compressed — i.e. the entropy of the data. http://en.wikipedia.org/wiki/Entropy_(Information_theory)

    In our lab testing we’ve seen fast block-compression schemes achieve up to approx 2/3rds of the theoretical maximum compression rate for typical datasets. (i.e. if the theoretical max is 6x compression, the best fast compression schemes will achieve approx 4x compression). We see roughly the same compression (give or take 10-20%) if the data is laid out in rows vs an idealized columnar representation.

    In other words, the storage layout of the data makes far less difference than people appreciate, and columnar storage doesn’t provide any magic loophole to defeat entropy.

  4. Curt Monash on June 5th, 2009 10:51 pm

    When you use “entropy” in the context of “compression”, do you basically mean “Kolmogorov complexity”? Anyhow, how is it calculated in PRACTICE? I.e., how do you know what the theoretical maximum is for a given dataset?

    Thanks,

    CAM

  5. Michael McIntire on June 6th, 2009 12:59 pm

    A few points: we’d like to see the master node concept go away all together. The problem is that with a bigger system, it becomes the number of threads/connections which can reasonably be maintained on the head node (inbound and internally), and the cost of the failover which results. Most customers with this problem will be using a PG session pooler, but this comes with it’s own problems. This problem is not unique to GP, and it is a very tough architectural and implementation problem, I think of the majors only Teradata has a solve.

    On the compression front, two factors ultimately influence how well the compression works. The more structured the data is, the more effective an auto-codification scheme like Vertica. The more random and unknown the data, the more likely the standard block/dictionary schemes will work.

  6. John Sichi on June 8th, 2009 1:07 am

    I have written up some commentary on compressing rows vs columns here:

    http://thinkwaitfast.blogspot.com/2009/06/compressing-rows-vs-columns.html

  7. Per-terabyte pricing | DBMS2 -- DataBase Management System Services on June 9th, 2009 4:29 am

    [...] DBMS vendors sometimes price per terabyte of user data.  Vertica’s list price is $100K/TB. Greenplum’s list price is $70K/TB. In practice, both offer substantial discounts, especially at higher volumes.  In both cases, this [...]

  8. Glenn Davis on June 11th, 2009 4:38 am

    Two points.

    Ben Werther’s comment shows a widespread misunderstanding about data and entropy. Bodies of data do not have entropy. Only models of data have entropy. Bear with me. Suppose you compress English text by Huffman-coding individual characters. You would get, say, 4 bits per character. Then compress the same data using Huffman-coded digrams; you’ll get something lower, like 3 bits. So which is the entropy of the data? Neither! You have two figures but the data didn’t change. What changed was the model of the data. That is a very important and not-always-recognized distinction because information theory does not deal with modeling — only with encoding modeled data.

    Now Kolmogorov complexity. That concept has little practical value in database compression because you can’t use it to quantify anything. It’s really philosophical more than anything else.

  9. Paul Johnson on June 16th, 2009 6:05 am

    FYI, the later release of Dataupia also had no master node.

  10. Analytics Team » Blog Archive » Greenplum offers single node edition for free on October 25th, 2009 5:10 pm

    [...] finally, here’s some information about Greenplum’s pricing: either subscription or [...]

  11. Data into results » EMC put Hadoop on new levels on February 25th, 2013 3:47 pm

    [...] be cheaper than Greenplum DB (it’s basically Greenplum DB + Greenplum HD in the same box). Last price tag for Greenplum is old but let’s say we can expect at least 10-20k$/TB of user data. It’s [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.