July 12, 2012

Approximate query results

In theory:

A database query is a predicate.
A DBMS matches the data it manages against the predicate and send back those records for which the predicate is true.

And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.

Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.

First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.

Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:

You do a SPARQL query.
You modify the query to accept results higher up in the taxonomy. (Which is likely to be possible, because where there’s SPARQL, there’s apt to be a taxonomy as well.) For example, if you really want to query on two people living in the house, you might extend the query to cover two people connected by any kind of address or building.

Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.

Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.

As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.

Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:

Infobright, whose “Rough Query” gives fast approximate results to a broad range of queries.
Metamarkets, which does fast cardinality estimates via HyperLogLog.
Aster Data, which was the first company to point out to me that median, decile, quintile, and so on calculations are a lot faster in a shared-nothing setting if you’re willing to settle for approximate results.

The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:

(Always) Well, there’s a lot of custom coding.
(Sometimes) We’re working with partner BI vendors to make direct use of the capabilities, but that’s not done yet, so it’s too early to talk about any details.

Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.

Categories: Aster Data, Business intelligence, Data models and architecture, Data warehousing, Database compression, Infobright, Text, Vertica Systems, Yarcdata and Cray

Subscribe to our complete feed!

Comments

3 Responses to “Approximate query results”

Lothar Schubert on July 13th, 2012 4:07 pm

Bravo! Very good post, very relevant for many scenarios, and certainly not written about enough.

With regard to #3 and #4, this is ‘somewhat’ similar to the idea of “living sample databases” which had been discussed some 5-10 years ago, but never took off. I think one of the main reasons is the lack of appreciation / skills with regard to statistical theory, when it comes to most DWH/BI implementations. Not an easy change in mindset, and if vendors can lead the way, great.

Which leads me to another point. Whenever presenting such results (forecasts, interpolations, but also: averages), simply stating the number itself may be not enough, or even plainly misleading. I’d rather want to see numbers with a confidence interval attached to it.
Susan Davis on July 13th, 2012 4:13 pm

Curt,

One of the use cases we see for Infobright’s Rough Query is when someone is not sure the specific data exists at all within the constraints they set. An example from some of our telecom customers is when they are searching for an IMSI (specific mobile subscriber) within a specific region against very large volumes of CDR data. Let’s say they want to query within a certain date range – it might take a while if you have to read all the data to see if there is a match, but using Rough Query you can instantaneously get the upper and lower values that do exist. If needed, you can then narrow the filter to determine whether there is a match or not. Doing two sub-second queries is obviously a lot faster than doing an exact query that may take minutes or more.

As for your comment about (always) requiring a lot of custom coding, Rough query is invoked simply by putting the word “roughly” after the Select statement. Pretty simple.

I think your post is very timely – given the increasingly large volumes of data people are analyzing, there need to be new ways to query tens or hundreds of TBs of data versus the way they have always been done.

Susan
DBMS 2 : DataBase Management System Services – Cloud Data Architect on April 5th, 2017 8:11 am

[…] particular feature that could help with high-speed monitoring is to meet latency constraints via approximate query results. This can be done entirely via your BI tool (e.g. Zoomdata’s “query sharpening”) or more by […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Approximate query results

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin