March 6, 2014

Splunk and inverted-list indexing

Some technical background about Splunk

In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):

Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.

It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.

I also wrote:

The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.

That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.

I further wrote:

Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.

Splunk’s ILM story turns out to be simple indeed.

Finally, I wrote:

I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.

and

I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.

The point of what I in October, 2013 called

a high(er)-performance data store into which you can selectively copy columns of data

and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.

Inverted-list indexing

Inverted list technology is confusing for several reasons, which start: 

What’s more, inverted list technology can take several different forms.

Splunk, HPAS, and inverted indexes

With all that background, we can finally summarize Splunk’s “High Performance Analytic Store” story.

While I haven’t probed for full specifics, I did gather:

Comments

One Response to “Splunk and inverted-list indexing”

  1. Eli Singer on March 7th, 2014 12:48 pm

    Curt – great to see the topic of inverted index being discussed. While it’s mostly been used in text retrieval systems in recent times, I believe it can serve a great purpose in modern, big-data scale, analytical DBMSs. At JethroData we took it to the extreme and actually build an inverted index for every column in a table.

    Inverted indexes are suitable for SELECT with narrowing WHERE clause, and column store is necessary for efficient complex aggregation. The combination of both is a great base for building high performance RDBMS that can make accurate optimization choices based on complete columns statistics.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.