ClearStory Data is:
- One of the two start-ups I’m most closely engaged with.
- Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy.
- On the verge, finally, of fully destealthing.
I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.
- Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
- … pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
- Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
- Has much of its technical differentiation in the area of data mustering …
- … and much of the rest in DBMS-like engineering.
- Is a flagship user of Spark.
- Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
- Is to a large extent written in Scala.
- Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.
To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:
- ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
- ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.
Storm, Stormy and data mustering
ClearStory’s architecture revolves around the challenges of data mustering. The most novel part of that is automagic recognition of time periods, currencies, geographic regions, etc. That starts with basics such as parsing, other pattern recognition (regex or whatever), lookup tables and traditional data cleaning — wholly automated if possible, with human intervention if necessary. There also is some more intelligent inferencing — e.g., a set of 5-digit numbers that all fit in the California zip code range might be interpreted as matching the California region. Tying this all together is a decision tree/operations graph.
Other data mustering issues include:
- Reconciling — “harmonizing” — the time periods and so on.
- Accepting data in a variety of structures — relational/tabular, JSON, etc.
- Compensating for the facts that in most cases ClearStory only controls one end of the data movement pipeline, and hence:
- The schema in which the data arrives is whatever the provider says it is.
- The same goes for any conventions about the data.
- The same goes for data freshness.
So far as I can tell, most or all of this work is done in Stormy, a Scala-based system built on Storm. So I’ll quickly digress and mention that Storm:
- Originated as a Twitter project focused on streaming.
- Picked up the capability for batch ingest along the way.
- Has become an Apache incubator project.
- Is obviously well thought of; I’m hearing it mentioned in lots of places.
Anyhow, the core of what Stormy does is:
- Accept data.
- Write it to columns in HDFS.
- Write column statistics and so on to the metadata store.
ClearStory currently writes to RCFile, a purely columnar format, so other data structures are obviously being flattened out. Naturally, ClearStory is also eyeing Parquet and ORCfile, and has particularly warm thoughts about the former.
Spark, Sparky and query execution
Much of my interest in Spark was … well, it was sparked by ClearStory. ClearStory was a very early adopter of Spark, at a level of seriousness that includes:
- Mike Franklin as an advisor.
- A couple of Spark committers on staff.
- A lot of miscellaneous enhancements to what, until recently, was just an academic open source project.
- Comments to the effect of “If Spark hadn’t existed, we would have invented something a lot like it.”
ClearStory still seems to be very pleased with its choice to use Spark.
The fit, in simplest terms, is that ClearStory needs to do analytic data operations on a lot of tables — not necessarily permanent ones — at interactive/memory-centric speeds, and that’s pretty much what Spark is designed for.* Why so many tables? Because for multiple reasons, ClearStory winds up with lots of intermediate query results and derived data sets. In particular:
- Like other leading-edge investigative and collaborative BI technologies, ClearStory encourages users to proceed through long chains of query refinement, drilldown, and similar navigation.
- Harmonization challenges can add some complication to the refinement.
- Certain refinements of public/premium data can be multi-tenant, while others are specialized enough to be tenant-specific.
*There’s no rule that Spark RDDs (Resilient Distributed Datasets) need to look like tables. But in ClearStory’s case and I imagine most other Spark users’ as well, that’s what winds up happening in practice.
As a general rule, ClearStory/Sparky doesn’t keep intermediate query results hanging around for long; rather, it stores the instructions needed to reproduce same.* But the cost-based optimizer gets to decide the exceptions to that rule — i.e., which views will persist in materialized form. I think that’s a pretty cool approach.
*Confusingly, this is called “lineage”, and tracked in the metadata store.
One thing that I haven’t nailed down yet is how well this all scales. If we consider:
- What’s been proven in early production use.
- What’s been simulated by testing in ClearStory’s labs.
- What is strongly suggested as likely to work based on proven capabilities of various parts of the technology.
we evidently wind up looking at three rather different sets of numbers.