April 16, 2015

Notes on indexes and index-like structures

Indexes are central to database management.

Perhaps it’s time for a round-up post on indexing. 🙂

1. First, let’s review some basics. Classically:

2. Further: 

3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.

I’ll try to explain this paradox below.

4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.

Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.

5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.

6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.

The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:

The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.

7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.

Comments

18 Responses to “Notes on indexes and index-like structures”

  1. Venkat Krishnamurthy on April 16th, 2015 9:10 pm

    Curt, thanks – great post! One question – as DB storage moves higher in the memory hierarchy, what’s your take on how indexing strategies are changing?

  2. Curt Monash on April 16th, 2015 10:07 pm

    Good question! Frankly, discussions of in-memory data management and of indexing seem to be pretty orthogonal, except insofar as indexes will often be in memory even when data isn’t.

    Perhaps part of the reason is that when data is cached, it’s fluid enough that not a lot of investment is made in indexing it.

    I’d say about in-memory structures in general:

    1. Pointers are back on the table when random lookups are RAM cheap.

    2. Tries have been tried. 🙂 See for example solidDB.

    3. Fixed-length fields, so that addresses can just be calculated — i.e. arrays — are popular.

    4. Compression isn’t as universally highly valued as I’d expect.

  3. Tony Baer on April 17th, 2015 12:12 pm

    Excellent minihistory lesson on the origin, who, how, and what’s happening with indexing. My take with indexes in analytic databases (vs transaction) is that the range of queries forced a tradeoff between index sprawl vs. scan performance. One that got further skewed with emergence of columnar, data skipping, and compression. To all that, add in-memory storage. While this is NOT to say that indexes are unnecessary in analytic database, my take is that they won’t be as vital vs. transaction stores where the range of queries (and data) is typically more limited by purpose & design.

  4. J. Andrew Rogers on April 17th, 2015 12:42 pm

    One of the properties of efficiently distributable indexing structures is that they are all describable with space-filling curves. This leads to significant confusion, since being describable in terms of space-filling curves does not imply that the index implementation linearizes data or access on a space-filling curve. Most large-scale geospatial indexes are a case of the former. Linearizing on space-filling curves is not that useful for indexing; this was extensively studied (and patented by IBM, Oracle, et al) in the 1980s but ultimately abandoned. The resurgence in use of these indexes for geospatial is frequently oblivious to prior research since most of the relevant literature is not readily available on the Internet.

    Geospatial data has two properties that mean they cannot be trivially grafted onto most existing database engines. First, you can show that a partitioning function that is both uniform and general does not exist that also preserves spatial relationships. Second, they are not efficiently representable on ordered data structures generally (b-trees, skip lists, etc), which is the only representation mapping most databases support. This leaves you with indexes like R-trees that can work adequately if the scale and data model is tightly constrained.

    This blog post captures many of these problems:

    http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard/

    A number of scalable, general geospatial indexing algorithms exist (despite the lack of literature) but they all use embeddings in very high dimensionality spaces, to get around the first theoretical problem, and are not implemented around ordered-preserving data structures, to get around the second problem. These indexing structures work well for more conventional data too, but they fit poorly in database architectures not designed for them.

  5. Nikita Shamgunov on April 18th, 2015 3:46 pm

    Great post Curt. Few comments re lockless indexes.

    1. Lockless indexes are great for concurrent reads and writes and scale up to a high number concurrent queries on multi-core machines. This is different from saying that they are essential for scale-out. Clustering, high availability, query planning is essential for scale out and skiplist are great for multi-core workloads on one node. One doesn’t exclude the other, basically you need both.

    2. The primary reason to choose skiplists is robustness. They are battle tested in a variety of systems product that require high throughput including OSs, high frequency trading solutions and database internals. There are other solutions that claim to be better suited for database workloads: masstree from MIT and BW-tree from Microsoft. Those haven’t proven to be dominant in the industry yet.

    Nikita

  6. Allen on April 18th, 2015 3:51 pm

    Data partitioning can be considered a form of indexing too – at least for data skipping via partition pruning.

  7. Boaz Raufman on April 18th, 2015 4:12 pm

    Curt, thanks for igniting this indexes discussion.

    The competitive advantage of indexes eventually depends on 2 factors: use case and cost.

    Use case:
    • Workloads such as ETL or wide range reporting are not relevant for indexes. On the other hand, for ad-hoc multiple filter queries there is no replacement for indexes.
    • Most general purpose RDMBS are built to handle large range of scenarios: transactional, ETL, complex full data set reporting, BI… In many of those scenario indexes bring little or no value so their overall impact is limited, while the cost is high. However, when focusing on the specific use case of interactive BI which means ad-hoc queries with multiple filters, indexes are a game changer.

    Cost:
    • It was always a trade-off between value gained from indexes in query performance and cost in write/maintenance. Cost is even more significant when it comes to big data
    • Indexes that relay on locks to maintain integrity, that consume large storage or require high memory signature cannot serve big data
    • Making indexes cheap in terms of write performance, storage and maintenance cost re-positions the value of indexes

    This is JethroData formula for gaining competitive value from indexes:
    • Reduce cost of indexes via lockless, append only and compressed indexes structure
    • Integrate low cost indexes and columns storage to a single SQL solution
    • Focused on interactive BI over Big Data: the more selective the queries are, the more advantageous are the indexes. As a user drills-down into their data, index-based queries become faster and faster.

    Boaz Raufman
    Co-Founder & CTO
    JethroData

  8. Barney Finucane on April 21st, 2015 10:02 am

    At Infor we have an in-memory database that stores column-tokenized data in something very like a patricia trie. Writes (whether updates or inserts) are very fast when the column cardinality doesn’t change, and still good when it does. Data skipping works quite well, and particularly on sparse data. Filtering simply retrieves less data.

  9. Igor on April 22nd, 2015 3:59 pm

    Vertica projections are indexes according to the index definition above.

    Any details as to when/why red-black trees are better than B-trees? It’s definitely not true on HDD (or SSD) and I seriously doubt it’s true in RAM on modern hardware. Any links to articles and/or benchmarks etc. would be appreciated.

  10. J. Andrew Rogers on April 24th, 2015 3:06 am

    To agree with Igor, I can’t think of a case on real computing hardware where red-black trees are a good choice for databases. When performance matters, you treat RAM like a block storage device because that is how it behaves, albeit with lower latency.

  11. Harri Kallio on April 27th, 2015 4:09 pm

    Good topic. Just recently had a discussion about indexes in big data, and would like to remind about nice research work about “Space efficient indexes for the Big data era” by Sidirourgos:
    http://dare.uva.nl/document/522766

    E.g. Column imprint being something that I cannot wait to see in action, especially if it would be implemented to skip disk resident data blocks also, while the paper discusses it more as “cache conscious” index. That would improve Actian Vector over its current min-max block indexes, which are good for naturally ordered data.

    Sybase IQ’s 9 index types reduces to 6, if “date”, “datetime” and “time” index types are seen as datatype specific “high group” implementations. High group is b-tree index type improved with G arrays (having lists of row IDs). Unique HG would be only b-tree (since only one row can have the particular value). At least in 15 version HGs were not local but global if range partitioning was used for a table, which is not good at all. Haven’t checked that is it the case with 16 version.

    IQ’s “fast projection” index type is mandatory i.e. the columns actually are fast projection indexes. There was this optimized flavor for the fast projections (FP1, FP2, and in 15 version FP3), but it seemed to be dropped away from 16 version. It was storage optimization. 1/2/3 byte lookup tables were used to map distinct values of a column to surrogate byte values. It would be nice to hear reason why optimized FPs were dropped from 16 version. I could guess that FP3 was not that good and/or it made columns with big number of distinct values to be turned flat in much later phase (i.e. bigger rewrite) than e.g. FP2 did in earlier versions. In general optimized FPs make data loads to be not predictable, if/when flattening/rewriting of data happens. Most certainly predictability can be improved a lot by choosing properly when to use and not to use optimized FPs (so when knowing the cardinality of a column beforehand well enough).

    Vertica’s projections are like secondary clustered indexes with a subset of (carrying/included) columns. Whereas traditionally (in other products) clustered indexes are primary, since they order the whole base table by the selected ordering column(s).

  12. Harri Kallio on April 27th, 2015 4:21 pm

    Poor “secondary clustered indexes” comment from me about Vertica’s projections, since they are still primary. Overlapping/multi-dimensional/subset/decomposed clustered indexes probably would be better descriptions, yet projection is descriptive name altogether. Vertica’s own description seems to refer to the clustered indexes also:
    http://184.106.12.19/2011/09/06/the-power-of-projections-part-3/

  13. Harri Kallio on April 27th, 2015 5:31 pm

    Correction to this:

    “Sybase IQ’s 9 index types reduces to 6, if “date”, “datetime” and “time” index types are seen as datatype specific “high group” implementations.”

    It should be “high non group” (HNG). Rest of the “high group” are meant to be “high group” still.

  14. Notes, links and comments, May 2, 2015 | DBMS 2 : DataBase Management System Services on May 2nd, 2015 10:37 am

    […] rich it’s worth doubling back to see what other folks added. I think the recent geek-out on indexes is one such case. Good stuff was added by multiple smart […]

  15. Kurt Deschler on May 12th, 2015 2:35 pm

    The 1/2/3-byte FP indexes in SAP (Sybase) IQ were replaced in version 16.0 with n-bit FPs that serve the same role and provide higher compression for low-cardinality data.

    The date/time/datetime indexes are hybrid B-Tree/bitmapped indexes that provide optimized access on datepart boundaries.

    In general, the secondary indexes still provide the best performance for highly-selective conditions but are outpaced by IQ’s MPP capabilities in other cases. Secondary indexes remain relevant and supported but the guidelines for using them have evolved substantially.

  16. Harri Kallio on May 14th, 2015 6:23 am

    Thanks Kurt for giving an answer for SAP Sybase IQ’s optimized FP index change. Seems that NBit FP can also flatten implicitly after the number of unique values have exceeded IQ UNIQUE setting for a column. But with proper schema design that can be prevented.

    About cost efficient secondary indexes:
    It seems already that MonetDB is having Column Imprint index type automatically created for fixed data type columns:
    https://www.monetdb.org/Documentation/Manuals/MonetDB/Kernel/Modules/Imprints

  17. MemSQL 4.0 | DBMS 2 : DataBase Management System Services on May 20th, 2015 8:37 am

    […] if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on […]

  18. Multi-model database managers | DBMS 2 : DataBase Management System Services on August 24th, 2015 4:08 am

    […] April, 2015 post on indexing technology reminds us that one DBMS can do multiple […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.