September 29, 2013

ClearStory, Spark, and Storm

ClearStory Data is:

I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.

ClearStory:

To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:

Storm, Stormy and data mustering

ClearStory’s architecture revolves around the challenges of data mustering. The most novel part of that is automagic recognition of time periods, currencies, geographic regions, etc. That starts with basics such as parsing, other pattern recognition (regex or whatever), lookup tables and traditional data cleaning — wholly automated if possible, with human intervention if necessary. There also is some more intelligent inferencing — e.g., a set of 5-digit numbers that all fit in the California zip code range might be interpreted as matching the California region. Tying this all together is a decision tree/operations graph.

Other data mustering issues include:

So far as I can tell, most or all of this work is done in Stormy, a Scala-based system built on Storm. So I’ll quickly digress and mention that Storm:

Anyhow, the core of what Stormy does is:

ClearStory currently writes to RCFile, a purely columnar format, so other data structures are obviously being flattened out. Naturally, ClearStory is also eyeing Parquet and ORCfile, and has particularly warm thoughts about the former.

Spark, Sparky and query execution

Much of my interest in Spark was … well, it was sparked by ClearStory. ClearStory was a very early adopter of Spark, at a level of seriousness that includes:

ClearStory still seems to be very pleased with its choice to use Spark.

The fit, in simplest terms, is that ClearStory needs to do analytic data operations on a lot of tables — not necessarily permanent ones — at interactive/memory-centric speeds, and that’s pretty much what Spark is designed for.* Why so many tables? Because for multiple reasons, ClearStory winds up with lots of intermediate query results and derived data sets. In particular:

*There’s no rule that Spark RDDs (Resilient Distributed Datasets) need to look like tables. But in ClearStory’s case and I imagine most other Spark users’ as well, that’s what winds up happening in practice.

As a general rule, ClearStory/Sparky doesn’t keep intermediate query results hanging around for long; rather, it stores the instructions needed to reproduce same.* But the cost-based optimizer gets to decide the exceptions to that rule — i.e., which views will persist in materialized form. I think that’s a pretty cool approach.

*Confusingly, this is called “lineage”, and tracked in the metadata store.

One thing that I haven’t nailed down yet is how well this all scales. If we consider:

we evidently wind up looking at three rather different sets of numbers.

Comments

5 Responses to “ClearStory, Spark, and Storm”

  1. Steve Ardire on September 29th, 2013 8:06 pm

    > I foresee multiple rounds of Bottleneck Whack-A-Mole in ClearStory’s future.

    I agree and so far nice hype marketing with no customers and no revenue but they got a $9M Series A round http://www.crunchbase.com/company/clearstory-data ;-)

    @sardire

  2. Mikhail Bautin on September 30th, 2013 2:54 pm

    The full-stack approach has to happen for fast analysis of large-scale data from structured and unstructured data sources. There may be no well-defined term for this emerging stack yet, but it will soon be clear what it is.

  3. Scott Johnston on October 8th, 2013 12:24 am

    Had the opportunity to use ClearStory and it’s quite slick. Not just the user model, but the types of questions you can ask of the data, as well as the ease with which you can bring in and apply new data to an analysis on the fly.

  4. Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 3:59 pm

    […] back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like […]

  5. Spark on fire | DBMS 2 : DataBase Management System Services on April 30th, 2014 6:41 am

    […] number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.