Comments on: DATAllegro vs. Vertica and other columnar systems

By: Kirk

Kirk — Fri, 13 Feb 2009 05:37:13 +0000

Vectornova is completely COLUMNS, it’s amazing!

By: Curt Monash

Curt Monash — Tue, 20 Jan 2009 10:31:46 +0000

Never heard of VectorStar.

Looking at VectorNova’s site, however, it sounds like VectorStar is based on arrays rather than columns, perhaps like http://www.dbms2.com/2006/10/04/sas-intelligence-storage/

By: Hernan

Hernan — Mon, 19 Jan 2009 17:04:33 +0000

Have you had a chance at examining VectorStar? I believe this Mexican developers have made great progress with their columnar data engine -almost unknown in the United States. http://www.vectornova.com

By: Column stores vs. vertically-partitioned row stores | DBMS2 -- DataBase Management System Services

Fri, 12 Sep 2008 04:36:56 +0000

[…] DBMS for efficient data warehousing, it isn’t necessarily dispositive for a comparison of columnar systems to data-warehouse-specialist row-based systems. The three reasons suggested for the poor performance of vertically-partitioned row stores […]

By: DBMS2 — DataBase Management System Services » Blog Archive » Three bold assertions by Mike Stonebraker

Fri, 25 Apr 2008 04:07:17 +0000

[…] *For example, were Vertica’s competitors set up with vertical partitioning? […]

By: David Kanter

David Kanter — Tue, 27 Mar 2007 02:30:56 +0000

Chuck,

Thanks for the clarification – that makes much more sense now.

I believe that in the scenario you describe – loading, the graph wouldn’t have much value, whereas when you update (hence you must enforce consistency and coherency), it would be interesting.

By: Chuck

Chuck — Fri, 23 Mar 2007 06:30:12 +0000

David,

I don’t think I follow the analogy. There’s a big distinction between loads and updates. In inserts/loads, there’s no “coherency” needed beyond knowing which data was part of what transaction (timestamping, as you say). From the perspective of a node processing a load stream, each row loaded “belongs” on one of the nodes in the cluster, so it can be sent to that node directly and nobody else needs to know about it. (If a query requests that row, the node processing the query knows where to look.)

So from the perspective of a node doing a load in a cluster of N nodes, it processes a stream of data and sends it to N-1 places (keeping 1/Nth for itself). The node will also receive data from N-1 other nodes if they are processing loads. So if all nodes are loading data at the same time, each node would expect to receive as much data as it sends. So load speed scales linearly with the number of nodes as long as:
1. A node can handle N incoming load streams. (This translates to a memory requirement, mostly.)
2. The interconnect can distribute all the data, full duplex, with all nodes talking to all others at once on the backplane. (Currently this is true of relatively cheap GigE switches with 32 or 64 ports.)

I’m sorry, but I don’t understand the concept behind the graph. It seems to me like for a given load size and frequency, either the system will keep up or it won’t. Wouldn’t it be more intersting to know what the maximum load rate is (GB/min) for combinations of parameters like # of nodes in the cluster and # of GB in each load?

By: David Kanter

David Kanter — Wed, 21 Mar 2007 19:02:19 +0000

Chuck,

Thanks for the elaboration. Here’s why I ask…

When I think about the problem from an architectural standpoint (computer architecture that is), it’s exactly analogous to an update cache coherency policy.

The E/T steps are local, and those probably scale reasonably well. What isn’t going to scale is the 1:N communication of distributing the data out to every node, and the resulting acknowledgements. You have to update all N nodes, and you can’t really get around that.

IIRC, you guys use timestamping, so you don’t have to redo any in-flight transactions because of an update.

As I said, what would be interesting would be a 3D graph of:

Data load size (x GB), data load frequency (Y loads/hour) and performance (Z seconds)

By: Chuck

Chuck — Wed, 21 Mar 2007 13:42:05 +0000

Agreed that 4 nodes is pretty small, but considering that Vertica is shared nothing and all load steps are local to the machine except data segmentation, load speed scales with number of nodes assisting in the load as long as each node has a data stream and you have a good switch.

We did see around 4x load speed compared to 1 node in this test, which is far more than we can say for a competing row store that uses a shared disk architecture and saw 2.5x. Likewise, a competing shared-nothing row store without a parallel load feature didn’t get anywhere near 4x on 4 nodes, as load saw 1x while index and MV build saw 4x.

Of course other products out there do it the same way we do and don’t suffer these limitations, but I repeat my claim that there’s no theoretical reason why a column store would be beaten on load performance. Nor do we ever get beaten (on apples-to-apples hardware of course).

Stay tuned for more complete presentations on our numbers in an upcoming paper, as well as bigger cluster and data sizes.

By: DBMS2 — DataBase Management System Services»Blog Archive » Compression in columnar data stores

Wed, 21 Mar 2007 08:13:26 +0000

[…] We have lively discussions going on columnar data stores vs. vertically partitioned row stores. Part is visible in the comment thread to a recent post. Other parts come in private comments from Stuart Frost of DATAllegro and Mike Stonebraker of Vertica et al. […]