June 21, 2010

What kinds of data warehouse load latency are practical?

I took advantage of my recent conversations with Netezza and IBM to discuss what kinds of data warehouse load latency were practical. In both cases I got the impression:

There’s generally a throughput/latency tradeoff, so if you want very low latency with good throughput, you may have to throw a lot of hardware at the problem.

I’d expect to hear similar things from any other vendor with reasonably mature analytic DBMS technology. Low-latency load is a problem for columnar systems, but both Vertica and ParAccel designed in workarounds from the getgo. Aster Data probably didn’t meet these criteria until Version 4.0, its old “frontline” positioning notwithstanding, but I think it does now.

Related link

Comments

4 Responses to “What kinds of data warehouse load latency are practical?”

  1. Bradley C. Kuszmaul on June 21st, 2010 4:21 pm

    Although there is a throughput/latency tradeoff, most databases are
    far from the tradeoff curve. There are two ways to look at this: in
    theory and in practice.

    In theory: For example B-trees achieve O(log N/log B)
    disk-head-movements per insertion and O(log N/log B) for point
    queries. There are known data structures that achieve O((log
    N)/sqrt(B)) insertion performance while matching B-tree point-query
    and range-query performance. Note that sqrt(B)>>log B for most
    interesting block sizes, making these data structures much better at
    insertions than B-trees. For example O(log N/log B) is more than one,
    but O(log N/sqrt{B}) is something like 1/1000. There is a tradeoff
    curve, but B-trees are not on it.

    In practice: Tokutek is selling a storage engine for MySQL that
    maintains indexes with tens of thousands of insertions per second,
    even when the indexes are much larger than main memory, and even for
    the worst-case data sets (such as when keys look essentially random).

    Since Fractal Tree index performance dominates B-tree performace, I
    expect Fractal Tree indexes to displace essentially all uses of B-tree
    indexes over time.

    This means there is no real reason to divide the world into OLTP and
    OLAP. One can have OLAP-like throughput with OLTP-like latency.

  2. Jon Bock on June 22nd, 2010 11:19 am

    A key piece of the technology needed to deliver high throughput and low latency loading is offloading and parallelizing the load–serialized loading through a single chokepoint as well as contention between loading and queries are two of the factors that hinder load throughput and latency. I believe your reference to Aster Data has in mind that parallel, offloaded loading leveraging any number of commodity server nodes is part of Aster’s offering.

  3. Curt Monash on June 22nd, 2010 12:44 pm

    @Jon,

    Actually, I was referring to the fact that single-row inserts were really slow in Aster Data until Version 4. 😉

  4. tzahi on June 23rd, 2010 9:51 am

    AFAIK – Vertica takes care of this by delivering results from write cache in memory. I think they give real time update .

    Another similar technology is Lucene – in version 2.9 it incorporates near-real-time updates, again by applying queries to the indexWriter cache. it is not an analytic DBMS, but it serves similar purposes. Attivio even builds SQL access including aggregate functions on top of Lucene. This enables to provide DW like functionality on to of semi-structured data.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.