May 27, 2013

Data skipping

Way back in 2006, I wrote about a cool Netezza feature called the zone map, which in essence allows you to do partition elimination even in the absence of strict range partitioning.

Netezza’s substitute for range partitioning is very simple. Netezza features “zone maps,” which note the minimum and maximum of each column value (if such concepts are meaningful) in each extent. This can amount to effective range partitioning over dates; if data is added over time, there’s a good chance that the data in any particular date range is clustered, and a zone map lets you pick out which data falls in the desired data range.

I further wrote

… that seems to be the primary scenario in which zone maps confer a large benefit.

But I now think that part was too pessimistic. For example, in bulk load scenarios, it’s easy to imagine ways in which data can be clustered or skewed. And in such cases, zone maps can let you skip a large fraction of potential I/O.

Over the years I’ve said that other things were reminiscent of Netezza zone maps, e.g. features of Infobright, SenSage, InfiniDB and even Microsoft SQL Server. But truth be told, when I actually use the phrase “zone map”, people usually give me a blank look.

In a recent briefing about BLU, IBM introduced me to a better term — data skipping. I like it and, unless somebody comes up with a good reason not to, I plan to start using it myself. :)

Comments

10 Responses to “Data skipping”

  1. J. Andrew Rogers on May 27th, 2013 3:02 am

    Curt,

    The term I’ve always seen used for this type of optimization is “constraint exclusion”. Structures inside the database engine are tagged with constraints that, in the simple case, capture the min and max values associated within that structure though you can do more with it. Operations on those structures can be skipped in many cases if the constraint can be excluded from (i.e. does not intersect) the constraint implied by the operation.

    Constraint exclusion can be a highly effective optimization for many types of data models and database engines but is not as common as its utility suggests. Narrowly targeted uses around things like timestamp columns are typical because generalized implementations are difficult to implement effectively, particularly if retrofitting a database engine was not designed for it. This will become an increasingly pervasive and important optimization in the future.

    There is also a less common “constraint inclusion” optimization that uses the same constraint tagging mechanism but works by skipping a large part of record evaluation for cases when the constraint on a structure satisfies a query and therefore every record associated with the structure logically must satisfy the query constraint. You may still have to check things like liveness of the record but that can have a tiny fraction of the cost of a conventional record evaluation for some types of data models and database engine implementations. This has some high value use cases but they are less common than constraint exclusion in practice.

  2. Curt Monash on May 27th, 2013 5:00 am

    “Constraint exclusion” is a new term to me. So I just googled on it, and the top hits seem to relate to PostgreSQL, e.g. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

    That sounds like strict partitioning & partition elimination. And you’re right — the low-hanging fruit for partition elimination is a date range.

  3. J. Andrew Rogers on May 27th, 2013 10:23 am

    Most databases name the concept after the details of their implementation; PostgreSQL used the generic term for its relatively narrow implementation on partitions. Constraint exclusion is also a concept used in stream processing systems.

    A challenge of generalized constraint exclusion is that it starts to look like a hyper-rectangle indexing problem. The literature is sparse on fast, adaptive, and scalable methods for representing constraint data types. Simple, narrow cases like date range exclusion or hash partitioning are straightforward so that is what is implemented.

    The value of generalized constraint exclusion, at least in theory, is that it could effectively replace traditional indexing structures at scales far beyond where traditional indexes start to perform poorly.

  4. David Aldridge on May 28th, 2013 8:03 am

    Sounds like what Oracle call “storage indexes” on Exadata.

    http://www.oracle.com/technetwork/issue-archive/2011/11-may/o31exadata-354069.html

  5. Curt Monash on May 28th, 2013 4:05 pm

    Well, yes — if you come up with a solution that performs well in all instances of a highly general problem, then that could be valuable. :)

  6. Impala and Parquet | DBMS 2 : DataBase Management System Services on June 23rd, 2013 10:47 pm

    […] Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear […]

  7. IBM BLU | DBMS 2 : DataBase Management System Services on July 9th, 2013 12:41 pm

    […] Data skipping, to reduce I/O. […]

  8. Confusion about metadata | DBMS 2 : DataBase Management System Services on February 23rd, 2014 1:50 am

    […] Value ranges that inform partition pruning or, more generally, data skipping. […]

  9. Introduction to CitusDB | DBMS 2 : DataBase Management System Services on May 2nd, 2014 4:00 am

    […] Min/max column values (for data skipping). […]

  10. MemSQL update | DBMS 2 : DataBase Management System Services on November 25th, 2014 1:13 am

    […] believes it has an aggressive data skipping […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.