Some technical background about Splunk
In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):
Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.
It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.
I also wrote:
The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.
I further wrote:
Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Splunk’s ILM story turns out to be simple indeed.
- As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
- Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.
Finally, I wrote:
I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
The point of what I in October, 2013 called
a high(er)-performance data store into which you can selectively copy columns of data
and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.
Inverted list technology is confusing for several reasons, which start:
- It has two names that — rightly or wrongly — are used fairly interchangeably: inverted index and inverted list.
- Inverted indexes have played different roles at different times. in particular:
What’s more, inverted list technology can take several different forms.
- In the simplest case, for each of many keywords, the inverted index lists the documents that contain it. Splunk does a form of this, where the “keyword” is the field — i.e. name — in a (field, value) pair.
- Another option is to store, for each keyword or name, not just document_IDs, but additional information.
- In the case of (field, value) pairs, the value can be stored. Splunk sometimes does that too.
- In the case of text documents, the index can store the position(s) in the document that the word occurs. This is irrelevant to Splunk.
- When you list all the records that have a certain field in them, and the list mentions the values, you’re getting pretty close to having a column-group NoSQL DBMS (e.g. Cassandra or HBase). Indeed, you might even be on your way to a columnar RDBMS; after all, SAP HANA grew out of a text indexing system.
Splunk, HPAS, and inverted indexes
With all that background, we can finally summarize Splunk’s “High Performance Analytic Store” story.
- Splunk’s classic data store is an inverted list system that:
- Tracks (field, value) pairs for a few fields that are always the same, such as Source_System.
- Otherwise tracks fields only.
- Splunk HPAS is an inverted list system that tracks (field, value) pairs for arbitrary fields. This gives much higher performance for queries that SELECT on or GROUP BY those fields.
- As of Splunk 6, Splunk Classic and Splunk HPAS are tightly and almost transparently integrated.
While I haven’t probed for full specifics, I did gather:
- Queries execute against both data stores at once, without any syntax change. At least, they do if you press some button; that’s the “almost” in the transparency.
- HPAS time-slices the data it stores by the same time intervals that Splunk Classic does. Hence for each time range, integrated Splunk can interrogate the HPAS first and, if it can’t answer, go to the slower traditional Splunk store.
- There are two basic ways to populate the HPAS:
- As the data streams in.
- Via the result sets of Splunk queries. Splunk talks as if this is the preferred way, which fits with Splunk’s long-time argument that it’s nice not to have to make any schema choices before you start streaming the data in.