Comments on: Greenplum update — Release 3.3 and so on

By: Data into results » EMC put Hadoop on new levels

Data into results » EMC put Hadoop on new levels — Mon, 25 Feb 2013 20:47:15 +0000

[…] be cheaper than Greenplum DB (it’s basically Greenplum DB + Greenplum HD in the same box). Last price tag for Greenplum is old but let’s say we can expect at least 10-20k$/TB of user data. It’s […]

By: Analytics Team » Blog Archive » Greenplum offers single node edition for free

Sun, 25 Oct 2009 21:10:10 +0000

[…] finally, here’s some information about Greenplum’s pricing: either subscription or […]

By: Paul Johnson

Paul Johnson — Tue, 16 Jun 2009 10:05:01 +0000

FYI, the later release of Dataupia also had no master node.

By: Glenn Davis

Glenn Davis — Thu, 11 Jun 2009 08:38:28 +0000

Two points.

Ben Werther’s comment shows a widespread misunderstanding about data and entropy. Bodies of data do not have entropy. Only models of data have entropy. Bear with me. Suppose you compress English text by Huffman-coding individual characters. You would get, say, 4 bits per character. Then compress the same data using Huffman-coded digrams; you’ll get something lower, like 3 bits. So which is the entropy of the data? Neither! You have two figures but the data didn’t change. What changed was the model of the data. That is a very important and not-always-recognized distinction because information theory does not deal with modeling — only with encoding modeled data.

Now Kolmogorov complexity. That concept has little practical value in database compression because you can’t use it to quantify anything. It’s really philosophical more than anything else.

By: Per-terabyte pricing | DBMS2 -- DataBase Management System Services

Per-terabyte pricing | DBMS2 -- DataBase Management System Services — Tue, 09 Jun 2009 08:29:49 +0000

[…] DBMS vendors sometimes price per terabyte of user data. Vertica’s list price is $100K/TB. Greenplum’s list price is $70K/TB. In practice, both offer substantial discounts, especially at higher volumes. In both cases, this […]

By: John Sichi

John Sichi — Mon, 08 Jun 2009 05:07:49 +0000

I have written up some commentary on compressing rows vs columns here:

http://thinkwaitfast.blogspot.com/2009/06/compressing-rows-vs-columns.html

By: Michael McIntire

Michael McIntire — Sat, 06 Jun 2009 16:59:15 +0000

A few points: we’d like to see the master node concept go away all together. The problem is that with a bigger system, it becomes the number of threads/connections which can reasonably be maintained on the head node (inbound and internally), and the cost of the failover which results. Most customers with this problem will be using a PG session pooler, but this comes with it’s own problems. This problem is not unique to GP, and it is a very tough architectural and implementation problem, I think of the majors only Teradata has a solve.

On the compression front, two factors ultimately influence how well the compression works. The more structured the data is, the more effective an auto-codification scheme like Vertica. The more random and unknown the data, the more likely the standard block/dictionary schemes will work.

By: Curt Monash

Curt Monash — Sat, 06 Jun 2009 02:51:54 +0000

When you use “entropy” in the context of “compression”, do you basically mean “Kolmogorov complexity”? Anyhow, how is it calculated in PRACTICE? I.e., how do you know what the theoretical maximum is for a given dataset?

Thanks,

CAM

By: Ben Werther

Ben Werther — Fri, 05 Jun 2009 20:30:15 +0000

There’s an information-theoretical bound to how much any data can be compressed — i.e. the entropy of the data. http://en.wikipedia.org/wiki/Entropy_(Information_theory)

In our lab testing we’ve seen fast block-compression schemes achieve up to approx 2/3rds of the theoretical maximum compression rate for typical datasets. (i.e. if the theoretical max is 6x compression, the best fast compression schemes will achieve approx 4x compression). We see roughly the same compression (give or take 10-20%) if the data is laid out in rows vs an idealized columnar representation.

In other words, the storage layout of the data makes far less difference than people appreciate, and columnar storage doesn’t provide any magic loophole to defeat entropy.

By: Andy E

Andy E — Fri, 05 Jun 2009 15:18:51 +0000

Greenplum & Asters’ compression is “fairly close” to columnar DBs? Someone should probably attempt to quantify “fairly close” or specified which columnar db they’re describing (not all columnar DBs compress equivalently).

One point of comparison…A couple months ago, a now-Vertica customer benchmarked Vertica and one of the aforementioned DBs, and a deciding factor was the relative amount of storage hardware required. 1TB of web app event data compressed to 200GB in Vertica (80% reduction). Same data “compressed” to greater than 1TB in the other. I think in the end, the competitor’s DB was 8x larger than the Vertica physical DB size. 8x less storage = faster performance (less IO) and, more obviously, lots less hardware when you’re managing dozens of TBs (uncompressed) of data.

That’s just one (pretty typical) data point. Compression results will vary based on DBMSs compared and the type of data as Curt has mentioned (see: http://www.dbms2.com/2008/09/24/vertica-finally-spells-out-its-compression-claims/ )