July 12, 2012

Approximate query results

In theory:

And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.

Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.

First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.

Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:

Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.

Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.

As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.

Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:

The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:

Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.

Comments

2 Responses to “Approximate query results”

  1. Lothar Schubert on July 13th, 2012 4:07 pm

    Bravo! Very good post, very relevant for many scenarios, and certainly not written about enough.

    With regard to #3 and #4, this is ‘somewhat’ similar to the idea of “living sample databases” which had been discussed some 5-10 years ago, but never took off. I think one of the main reasons is the lack of appreciation / skills with regard to statistical theory, when it comes to most DWH/BI implementations. Not an easy change in mindset, and if vendors can lead the way, great.

    Which leads me to another point. Whenever presenting such results (forecasts, interpolations, but also: averages), simply stating the number itself may be not enough, or even plainly misleading. I’d rather want to see numbers with a confidence interval attached to it.

  2. Susan Davis on July 13th, 2012 4:13 pm

    Curt,

    One of the use cases we see for Infobright’s Rough Query is when someone is not sure the specific data exists at all within the constraints they set. An example from some of our telecom customers is when they are searching for an IMSI (specific mobile subscriber) within a specific region against very large volumes of CDR data. Let’s say they want to query within a certain date range – it might take a while if you have to read all the data to see if there is a match, but using Rough Query you can instantaneously get the upper and lower values that do exist. If needed, you can then narrow the filter to determine whether there is a match or not. Doing two sub-second queries is obviously a lot faster than doing an exact query that may take minutes or more.

    As for your comment about (always) requiring a lot of custom coding, Rough query is invoked simply by putting the word “roughly” after the Select statement. Pretty simple.

    I think your post is very timely – given the increasingly large volumes of data people are analyzing, there need to be new ways to query tens or hundreds of TBs of data versus the way they have always been done.

    Susan

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.