Comments on: Column stores vs. vertically-partitioned row stores

By: More grist for the column vs. row mill | DBMS2 -- DataBase Management System Services

Sun, 21 Dec 2008 02:55:59 +0000

[…] Abadi and Sam Madden are at it again, following up on their blog posts of six months arguing for the general superiority of column stores over row stores (for analytic query processing). The gist is to recite a number of bases for superiority, beyond […]

By: Curt Monash

Curt Monash — Sat, 09 Aug 2008 03:46:12 +0000

I wouldn’t say that at all about columnar databases.

What I would say is that, Sybase IQ excepted, I can’t think of a columnar product mature enough to do a great job on concurrency or mixed workloads YET.

As for impugning your intentions — I think that’s entirely appropriate, given that you’re being a thorough-going jerk.

CAM

By: Bill Walters

Bill Walters — Sat, 09 Aug 2008 01:04:07 +0000

Yet Curt you refuse to directly address my questions instead you attempt to dismiss them and then when that doesn’t work you impugn my intentions, hardly the behavior expected from a blog of open ideas. So are you saying that the columnar approach is not intended to support mixed workloads and high levels of concurrency? It’s just a question.

By: Curt Monash

Curt Monash — Fri, 08 Aug 2008 03:06:47 +0000

Bill,

It used to be that you trolled regarding some patent lawsuit among columnar DBMS vendors, which as I pointed out (after deleting the other 11 or so copies of your spam) is quite irrelevant to software purchase decisions.

Now you seem to be trolling about columnar DBMS’ lack of suitability for uses that they’re neither being sold nor bought for. Not very interesting either.

CAM

By: Curt Monash

Curt Monash — Thu, 07 Aug 2008 16:58:32 +0000

Balaji,

Well, all systems are neater on paper than in real life. 🙂

As for your ETL question:

Everybody has a true batch ETL option. But they also need to deal with cases where customers require sub-day latency. As a practical matter, 5-10 minute latency is a common design target. If a user truly needs latency much lower than that, they may have to pay up for a solution with the strengths — and weaknesses! — of an OLTP relational DBMS.

So how do you add records every few minutes to a column store? Adding them one record at a time is NOT a good idea, since each time you update anything you have to touch every column. Hence, the alternative is some kind of “microbatch” — gather all the records that come in in RAM, and write them to disk once every few minutes. The column store has to know to, on any query, look for data in two places — RAM and disk.

Vertica and ParAccel differ in how they do this. ParAccel says “Hmm. That sounds a lot like a DBMS’ ordinary RAM cache. So we’ll accept records straight into cache, and make sure our cache/disk synchronization is robust enough.”

Vertica, by way of contrast, has a separate in-memory data store, and indeed keeps researching whether the in-memory part should be a row or column store. Not coincidental to Vertica’s decision to have a separate store for ingestion is the fact that the ParAccel solution wouldn’t work for them anyway. Vertica takes compressed data from disk and does query operations on it in its compressed form. Hence, you can’t just bang an uncompressed new record into the RAM part of the system and expect it to fit well.

Best,

CAM

By: Balaji

Balaji — Thu, 07 Aug 2008 16:10:00 +0000

On Paper (and following the web blogs) Column store is neat and clean compared to row stores like Oracle or Teradata ( that i have worked extensively) that requires manual tunning w.r.t index (bit map as in the case of Oracle and careful selection of primary index in the case of Teradata (Primary Key and Primary index in teradata are slightly different)) and preparation of Materialized views or Join index in case or teradata. Although similar to materialized view column stores do have “Projections”.. the implementation of Projections on an MPP environment again on Paper seems neat !. What confuses me is the tuple moving insert concept that i am not used to when compared with regular batch movement of data (ETL jobs that start at 2 am and finish by 5 am daily). Otherwise i love column stores and combined with mondrian presentation of BI info is a potent combination. My 2cents ! 😉

By: Curt Monash

Curt Monash — Thu, 07 Aug 2008 06:42:29 +0000

Bill,

You’re being overharsh. Even very large companies often have huge data marts with fairly low-concurrency analytic workloads.

CAM

By: Curt Monash

Curt Monash — Thu, 07 Aug 2008 06:40:27 +0000

DW Consultant,

I didn’t bring up the subject of range partitioning. Abadi and Madden — the extremely smart column store proponents — did.

CAM

By: Bill Walters

Bill Walters — Thu, 07 Aug 2008 02:44:27 +0000

This strikes me as more theory based discussion that is completely divorced from the realities of corporate computing. Fortune 100 companies that rely on large scale data analytics would be better served by a discussion of real world, verifiable examples of production implementations of the columnar approach that focus on; concurrency in mixed workload situations, the ability to provide real time integration to ETL applications that reside on disparate hardware/operating system platforms, 92 ANSI SQL and 99 Analytical Windows compliance, the ability to load data from transactional systems, and fail over and recovery characteristics. Until the columnar vendors can provide examples of production applications and describe their approach in these areas the entire discussion is entirely intellectual and frankly meaningless.

By: DW Consultant

DW Consultant — Thu, 07 Aug 2008 00:55:19 +0000

Range Partition has been their as long as database have been. DatAllegro or Netezza cannot even compare to a Column store when it comes to scan performance or joining large tables. Range partitions are not valid under adhoc and adhoc is what true analytics is all about.

Netezza might have a chance in that arena but Column stores is the way to go for true adhoc.