September 22, 2011

Teradata Columnar and Teradata 14 compression

Teradata is pre-announcing Teradata 14, for delivery by the end of this year, where by “Teradata 14″ I mean the latest version of the DBMS that drives the classic Teradata product line. Teradata 14’s flagship feature is Teradata Columnar, a hybrid-columnar offering that follows in the footsteps of Greenplum (now part of EMC) and Aster Data (now part of Teradata).

The basic idea of Teradata Columnar is:

The “mix” option is like Vertica’s FlexStore, in that different columns (e.g. different components of a street address) can be grouped into a mini-row, even if you otherwise choose to store that table in a columnar way. Teradata does not at this time offer the Greenplum or Aster way of mixing rows and columns, whereby some of the rows in a table can be stored in a column-store way, while other rows are stored in entire-row row-store solidarity

Thus, Teradata Columnar gives you many of the basic I/O and compression benefits of columnar DBMS, along with all the usual Teradata goodness of concurrency, workload management, system management, concurrency, SQL support, and so on. By way of comparison:

Also, as I noted above, Teradata mixes rows and columns in a different way than Aster or EMC Greenplum do.

*However, I won’t be surprised if Oracle soon announces true hybrid-columnar as well. I originally heard about Teradata Columnar and Oracle’s efforts to develop true hybrid-columnar storage the same week, 23 months ago.

Going hybrid-columnar is a big deal. Aster Data, for example, told me that a considerable fraction of all its workloads ran faster with columnar than row-based storage.* And it’s of extra importance to a vendor that, like Teradata, needs to play catch-up in the compression derby.

*Anything in which the queries eliminated more than half or so of the columns (60%, if I recall correctly, but it was definitely an approximate figure). That pretty much means any query except full and near-full table scans.

Teradata’s columnar compression story is pretty complicated. To quote from a forthcoming press release:

Teradata automatically chooses from among six types of compression: run length, dictionary, trim, delta on mean, null and UTF8. based on the column demographics.

The trickiest words in that are “automatic” and “dictionary”. Teradata divides column-store data into “column containers” of, say, 8 KB. (Current thinking is 8 KB default, 65 KB maximum, but that could change by the time of product release.) By default, Teradata software decides separately for each column container which compression algorithm(s) to use. It can even change its mind dynamically over time, as the contents of the container change.

What I find weird about Teradata’s columnar dictionary compression is that the dictionary is container-specific. One benefit versus having a more global dictionary is that, since you compress fewer items, compression tokens can each be shorter. (The length of a typical token is a lot like the log of the cardinality of the dictionary.) Another benefit is that smaller dictionaries are faster to search. The obvious offsetting drawback is that a larger and more global dictionary has the potential to compress various items that wind up being left uncompressed in this smaller-scale scheme.

Other notes about Teradata compression include:

And finally, Teradata 14 extends Teradata Virtual Storage with a feature called Compress on Cold. The idea is that “cold” data can safely get (extra) compression — that block-level stuff — automatically. If the data heats up again (e.g. by becoming relevant for a while to the latest year-over-year comparisons) it can be just as automatically removed from compression. Teradata thinks this is significantly better than the alternative of making manual compression choices based on not-so-granular range partitions.

Unsurprisingly, Teradata lacks some features and benefits found in certain columnar-first analytic DBMS. One biggie is that, absent clever workarounds such as Vertica’s in-memory write-optimized store, columnar DBMS have a single-row-update performance problem, because you are putting the information in many places on disk rather than just one. I generally take it for granted that a columnar-first vendor has such a workaround. Row-based vendors gone columnar, however, are a different story. Teradata et al. are also likely to decompress data and reassemble it into full rows as soon as it hits RAM, which obviates the potential benefit that you have less data per row clogging up cache.* (Edit: As per Todd Walter’s comments below, this is not accurate — and that’s a potentially important feature.)

*Late decompression actually depends on columnar compression, not columnar storage, and hence can also be enjoyed by row-based DBMS such as DB2.

To use Teradata Columnar, you need to be using round-robin data distribution rather than, say, hash. Teradata jargon for this is NoPI, where the “PI” stands for Primary Index.* Drawbacks to that include:

However, that’s a physical append-only; you can still do logical updates.

*PI is not to be confused with PPI, which stands for Primary Partition Index, and is Teradata’s name for range (or case-statement-based) partitioning. PPI works just fine with Teradata Columnar. As of Teradata 14, you can do PPI up to 62 levels deep.

The Teradata folks also sent along a slide deck laying out parts of the Teradata Columnar story. But it’s not one of the better Teradata decks I’ve ever posted.

Comments

7 Responses to “Teradata Columnar and Teradata 14 compression”

  1. Hybrid-columnar soundbites | DBMS 2 : DataBase Management System Services on September 22nd, 2011 1:06 pm

    [...] Busy couple of days talking with reporters. A few notes on hybrid-columnar analytic DBMS, all backed up by yesterday’s post on Teradata columnar: [...]

  2. Vlad Rodionov on September 22nd, 2011 1:33 pm

    Terradata’s columnar compression algorithms do not impress. Trim is a prefix suppression (frame-of-reference?) and delta on mean are less efficient than PFORDELTA which is implemented in VectorWise (I am wondering if there are some patent limitations here?)

  3. Todd Walter on September 26th, 2011 8:50 am

    A few clarifications:

    In a NoPI table in general you can update rows in place and Teradata will do normal updates to the rows. But single row updates have a significant cost because we have to scan to find the row. Thus you probably need an index on the NoPI table if you are going to update single rows (and it will still be more expensive than PI update of course).

    In Teradata Columnar updated rows will be added to the end of the table, the old copy of the row will be logically deleted. Deleted rows will be logically deleted. This means that if you do a lot of updating or deleting, over time there will be cause to “garbage collect” the table (in Teradata via INSERT SELECT). If it is an INSERT only table, then that will not be true. The logical delete and append updated rows model is chosen as a tradeoff vs touching all the columns and making the updates/deletes very expensive from a write perspective – a similar tradeoff to what other columnar vendors do.

    On decompression: Teradata block compression decompresses on read materializing the uncompressed rows in memory. But none of the other Teradata compression forms do so including the compression forms for Teradata Columnar. The optimizer makes a query plan which first will touch only the containers needed by the query and we do not decompress those containers to process the query. Thus we get maximum leverage from the compression in terms of saving IO and memory in cache. The only exception to this is if block level compression is used on top of the automatic columnar compression – in that case, the block decompression is performed but the columnar compression remains on the in-memory copy.

    It is mentioned that Teradata Columnar can store part of the columns in row form and part of them in columnar form. The example that is used are parts of an address as a row. It is also possible to group together less related columns into row form, we expect a common use case to be to group together high frequency of use columns into row form and use columnar for lower frequency of access columns. This allows the user to trade some amount of compression for higher efficiency access to the subset of columns that are accessed most frequently.

    One feature that was missed in the above discussion is that Teradata Columnar can be combined with PPI/MLPPI to have horizontal partitioning and column partitioning simultaneously. This allows Teradata to touch only a small portion of the column containers for a referenced column when an appropriate partitioning expression is available.

  4. Curt Monash on September 26th, 2011 9:16 am

    Todd,

    Thanks for the clarification on when data is decompressed! That’s exactly the opposite of what I heard when I was briefed.

    As for combining range partitioning with columnar — yeah, I think one would kind of take that for granted. ;)

  5. Paul Sinclair on October 1st, 2011 7:38 pm

    A clarification on Todd’s comment: “Teradata Columnar can be combined with PPI/MLPPI to have horizontal partitioning and column partitioning”.

    It is only the first P (for [row] partitioning) of PPI and the MLP (for multilevel [row]partitioning) of MLPPI that can be combined with column partitioning. The PI (primary index) of PPI and MLPPI can’t – a column-partitioned table doesn’t currently allow the table to have primary index. Row partitioning in this context is the same as horizontal partitioning (and column partitioning is the same as vertical partitioning).

    So you will probably hear people say things (in the Teradata world) like “column partitioning with PPI or MLPPI” since we are so used to those acronyms being associated with [row] partitioning but just remember currently the PI part doesn’t actually carry over to a column-partitioned table.

  6. Abstracting Data Science for the Every Day User | SiliconANGLE on October 10th, 2011 11:46 am

    [...] Curt Monash does a better job than I ever could in explaining the technical differences between Teradata Columnar and competitors like EMC Greenplum. I recommend his review. [...]

  7. Thoughts on the next releases of Oracle and Exadata | DBMS 2 : DataBase Management System Services on July 25th, 2012 2:54 am

    [...] wonder whether Oracle will finally introduce a true columnar storage option, a year behind Teradata. That would be the obvious enhancement on the data warehousing side, if they can pull it off. If [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.