February 17, 2013

Notes and links, February 17, 2013

1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).

Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.

2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.

At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.

3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.

4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.

5. I talked recently with Ashish Thusoo of Qubole. Qubole’s initial offering is a Hive-in-the-cloud, started by the guys who invented Hive. Qubole’s coolest new technical feature vs. generic Hive seems to be a disk-based columnar cache that lives with the servers, to help “smooth over the jitters” between Amazon EC2 and S3. Qubole company basics include:

6. In my recent When I am a VC Overlord post, I wrote:

4. I will not fund any software whose primary feature is that it is implemented in the “cloud” or via “SaaS”. A me-too product on a different platform is still a me-too product.

5. I will not fund any pitch that emphasizes the word “elastic”. Elastic is an important feature of underwear and pajamas, but even in those domains it does not provide differentiation.

Cloud/SaaS deployments give you a chance at providing superior ease of use/installation/administration, without compromising functionality — but they don’t automatically guarantee it. It’s hard work to make your customers’ lives easier.*

*This is the second consecutive post in which I’ve used a similar line. I’ll try to stop now. What’s really scary is that I was inspired by the old Frank Perdue ad “It takes a tough man to make a tender chicken.” :)

7. Ofir Manor of EMC is skeptical about Oracle’s claims for Hybrid Columnar Compression. But he didn’t really dig up that much dirt, except that he seems to think 10X compression is more of a ceiling than the floor that Oracle marketing suggests it is. The money quote is:

Oracle used to provide 3x compression, now it provides 10x compression, so no wonder the best references customers are seeing  about 3.4x savings…

That 3X is from Oracle’s Basic Compression, which seems to be a block-level dictionary scheme.

8. Nong Li of Cloudera wrote in praise of the code generation option in Impala. 3x performance is mentioned. What interested me was a nice observation that goes beyond Impala:

Code generation is most beneficial for queries that execute simple expressions and the interpretation overhead is most pronounced. For example, a query that is doing a regular expression match over each row is not going to benefit from code generation much because the interpretation overhead is low compared to the regex processing time.

Code generation may end up like compression — an architectural feature that DBMS just obviously should have.

Comments

5 Responses to “Notes and links, February 17, 2013”

  1. Ofir on February 18th, 2013 6:45 am

    Monash – thanks a lot for the mention and link!
    To give some background – ever since I joined EMC (Greenplum) from Oracle a year and a half ago, I occasionally ran into EMC core people who are afraid to try to compete with EMC storage against Exadata as a DW platform. Specifically, some get lost when Oracle claims 10x storage savings. After explaining for the tenth time how it works, I decided to write a quick post about it, which somehow became a three-part series :)
    http://ofirm.wordpress.com/tag/hcc/

    The TL;DR version is:
    - HCC is awesome. It provides an average of 10x compression of raw data.
    - However, this translates to only 3x storage savings for the best Oracle DW references, for several reasons.
    - that saving is typically negated in real the world, mostly due to lack of snapshots in Exadata storage (that’s the third post in the series).

    All that is not a big secret, just a widely known gap between marketing and reality. I just used Oracle’s own published references to drive this point home.
    In a way, it is similar to how Oracle recently took a great technical achievement with Exadata X3 series (4x the flash in the storage layer) and turned it into a silly claim of “in-memory” database. Good engineering, deceptive marketing.

  2. aaron on February 19th, 2013 4:08 pm

    It’s important to note that Exadata has a bunch of compression strategies, HCC being one of them – mostly used for static archive data. For certain data (a lot of “ideal for columnar” use cases), it leads to very high compression rates.

    This is negated though in many use cases:
    - DW migration from row stores, including Oracle, tend not to be optimized for compression or even ILM
    - A lot of Exadata use is OLTP or analytical, and really precludes HCC
    - HCC and indices -> complicated. Column searches, storage indices, uncompressed indices… Many row store large tables have index size = (uncompressed) table size. Index compression is a whole other topic.
    On a positive note, there is strong correlation between HCC compressed size and query performance, and good use cases here (scans…) are shockingly fast.

    Net/net:
    - compression is good, and requires design
    - YMWV with compression
    - compression is not discrete true/false states
    - expect all vendors to do a lot more with compression over time

  3. Curt Monash on February 19th, 2013 5:58 pm

    Good points about HCC.

    In particular, since HCC isn’t actually columnar storage, compression isn’t going to give great results unless ALL the columns compress well, or at least a sufficient fraction of them.

    Here by “great results” I mean great performance, not just compression space savings.

  4. aaron on February 19th, 2013 8:32 pm

    The difference I’ve seen in HCC vs column DBMS compression is that HCC is really a bunch of little sharded column stores – each a few MB containing dictionaries and column stores. If the cardinality is large enough to not fit into the compression unit – it gets inefficient. (But don’t give column stores slack – they have problems with alternate indexing and sorts, etc.)

    Changing topic to Impala – do you notice how low a bar they are setting at this point? There have been at least a dozen compile to binary prod releases in DBMS languages, and almost all have near zero adoption (IBM with SQL bind being the exception, but that is an interpreter). The problem is that this will only show big bang with trivial programs, which are implicitly IO bound. Impala, even after it gets to functional adequacy, will still need an optimizer.

  5. One database to rule them all? | DBMS 2 : DataBase Management System Services on February 21st, 2013 12:53 am

    [...] patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.