As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.
Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include:
- Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs. However, in the latter case indexing is turned off. Thus, Splunk does not portray its software as “agentless.” However, it asserts that its agent-like software runs without “material” overhead.
- The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
- Splunk tries to figure out what the individual entries are in a section of log it looks at. In particular:
- Time stamps are a big clue in this “inferencing” process, but they are not the be-all and end-all.
- Nor are line boundaries, if logs are naturally broken up into lines. (Splunk threw that latter comment in as a shot at SenSage.)
- I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
- Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Given its text search engine, Splunk does – well, it does text searches. And it stores searches, so they can be used for alerting or reporting. Indeed, Splunk persists and presumably updates results to stored searches, in a rough analog to materialized views.
Apparently, Splunk’s indexing is typically done via MapReduce jobs. I don’t know whether any actual Splunk searches are also done via MapReduce; surely they aren’t all, given the discussion of a near-real-time alerting engine and so on. Splunk fondly believes its MapReduce is an order of magnitude faster than SQL (I didn’t ask which SQL engines Splunk has in mind when they say this), and 5-10X faster than Hadoop. One efficiency trick is to look ahead and do Reduces in place where possible. This seems to be done automatically in the execution plan, ala Aster’s SQL-MapReduce, rather than having to be hand-coded. Splunk says its software can “easily” index 1-200 gigabytes of data per day on a commodity 8-core server, while maintaining an active search load, and 3-400 gigabytes are doable.
Splunk’s capabilities right now in tabular-style analytics seem to be limited to a command-line report builder, plus a GUI wizard that generates the command line. A few users have asked for support of third-party business intelligence tools, but Splunk hasn’t provided that yet. Nor can I find much evidence of ODBC/JDBC drivers for Splunk. But then, I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.