Zoomdata and the Vs
Let’s start with some terminology biases:
- I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability.
- Though I think it’s silly, I understand why BI innovators flee from the term “business intelligence” (they’re afraid of not sounding new).
So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.
It turns out that:
- Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
- Zoomdata depends heavily on Spark.
- Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
- Zoomdata assumes data can be in a variety of data stores, including:
- Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
- Files (generic HDFS — Hadoop Distributed File System or S3).*
- NoSQL (MongoDB and HBase were mentioned).
- Search (Elasticsearch was mentioned among others).
- Zoomdata also tries to detect data variability.
- Zoomdata is OEM/embedding-friendly.
*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.
Core aspects of Zoomdata’s technical strategy include:
- QlikView/Tableau-style navigation, at least up to a point. (I hope that vendors with a much longer track record have more nuances in their UIs.)
- Suitable UI for wholly or partially “real-time” data. In particular:
- Time is an easy dimension to get along the X-axis.
- You can select current or historical regions from the same graph, aka “data rewind”.
- Federated query with some predicate pushdown, aka “data fusion”.
- Data filtering and some GroupBys are pushed down to the underlying data stores — SQL or NoSQL — when it makes sense.*
- Pushing down joins (assuming that both sides of the join are from the same data store) is a roadmap item.
- Approximate query results, aka “data sharpening”. Zoomdata simulates high-speed query by first serving you approximate query results, ala Datameer.
- Spark to finish up queries. Anything that isn’t pushed down to the underlying data store is probably happening in Spark DataFrames.
- Spark for other kinds of calculations.
*Apparently it doesn’t make sense in some major operational/general-purpose — as opposed to analytic — RDBMS. From those systems, Zoomdata may actually extract and pre-cube data.
The technology story for “data sharpening” starts:
- Zoomdata more-or-less samples the underlying data, and returns a result just for the sample. Since this is a small query, it resolves quickly.
- More precisely, there’s a sequence of approximations, with results based on ever larger samples, until eventually the whole query is answered.
- Zoomdata has a couple of roadmap items for making these approximations more accurate:
- The integration of BlinkDB with Spark will hopefully result in actual error bars for the approximations.
- Zoomdata is working itself on how to avoid sample skew.
The point of data sharpening, besides simply giving immediate gratification, is that hopefully the results for even a small sample will be enough for the user to determine:
- Where in particular she wants to drill down.
- Whether she asked the right query in the first place. 🙂
I like this early drilldown story for a couple of reasons:
- I think it matches the way a lot of people work. First you get to the query of the right general structure; then you refine the parameters.
- It’s good for exact-results performance too. Most of what otherwise might have been a long-running query may not need to happen at all.
Aka “Honey, I shrunk the query!”
Zoomdata’s query execution strategy depends heavily on doing lots of “micro-queries” and unioning their result sets. In particular:
- Data sharpening relies on a bunch of data-subset queries of increasing size.
- Streaming/”real-time” BI is built from a bunch of sub-queries restricted to small time slices each.
Even for not-so-micro queries, Zoomdata may find itself doing a lot of unioning, as data from different time periods may be in different stores.
Architectural choices in support of all this include:
- Zoomdata ships with Spark, but can and probably in most cases should be pointed at an external Spark cluster instead. One point is that Zoomdata itself scales by user count, while the Spark cluster scales by data volume.
- Zoomdata uses MongoDB off to the side as a metadata store. Except for what’s in that store, Zoomdata seems to be able to load balance rather statelessly. And Zoomdata doesn’t think that the MongoDB store is a bottleneck either.
- Zoomdata uses Docker.
- Zoomdata is starting to use Mesos.
When a young company has good ideas, it’s natural to wonder how established or mature this all is. Well:
- Zoomdata has 86 employees.
- Zoomdata has (production) customers, success stories, and so on, but can’t yet talk fluently about many production use cases.
- If we recall that companies don’t always get to do (all) their own positioning, it’s fair to say that Zoomdata started out as “Cloudera’s cheap-option BI buddy”, but I don’t think that’s an accurate characterization as this point.
- Zoomdata, like almost all young companies in the history of BI, favors a “land-and-expand” adoption strategy. Indeed …
- … Zoomdata tells prospects it wants to be an additional BI provider to them, rather than rip-and-replacement.
As for technological maturity:
- Zoomdata’s view of data seems essentially tabular, notwithstanding its facility with streams and NoSQL. It doesn’t seem to have tackled much in the way of event series analytics yet.
- One of Zoomdata’s success stories is iPad-centric. (Salesperson visits prospect and shows her an informative chart; prospect opens wallet; ka-ching.) So I presume mobile BI is working.
- Zoomdata is comfortable handling 10s of millions of rows of data, may be strained when handling 100s of millions of rows, and has been tested in-house up to 1 billion rows. But that’s data that lands in Spark. The underlying data being filtered can be much larger, and Zoomdata indeed cites one example of a >40 TB Impala database.
- When I asked about concurrency, Zoomdata told me of in-house testing, not actual production users.
- Zoomdata’s list when asked what they don’t do (except through partners, of which they have a bunch) was:
- Data wrangling.
- ETL (Extract/Transform/Load).
- Data transformation. (In a market segment with a lot of Hadoop and Spark, that’s not really redundant with the previous bullet point.)
- Data cataloguing, ala Alation or Tamr.
- Machine learning.
Related link
- I wrote about multiple kinds of approximate query result capabilities, Zoomdata-like or otherwise, back in July, 2012.
Comments
9 Responses to “Zoomdata and the Vs”
Leave a Reply
It sounds like Zoomdata is developing their own federated query engine over spark, in some sense competing with Spark SQL. Is it true?
Not at all… Zoomdata is leveraging Spark and SparkSQL for the federation-like fusion operations. We leverage Spark SQL’s External Data Connectors where possible and prefer to execute the queries to the remote sources directly from Spark, where possible.
I thought that efficient federation required rework of the optimizer.
Justin mentioned an optimizer to me as we talked, specifically when I asked about what provided performance. Indeed, that was his only answer other than micro-queries and approximate results, the latter of which is powered by micro-queries.
I didn’t pursue details, because in my experience optimizers in new-ish products are usually rather primitive anyway.
Hi Curt,
You mention here that you hope vendors with a much longer track record have more nuances in their UIs. Would you care to expand on that? Some earlier posts discuss navigation as being more important than visualization (http://www.dbms2.com/2013/09/29/visualization-or-navigation/) – by this do you mean interaction with data through visualization? Or more the ability to easily facet/slice/analyze a multidimensional dataset across various dimensions, and see how subsets in one projection are reflected in another (so called linking/brushing). Thanks.
[…] Monash explains […]
[…] It’s been part of BI since the introduction of Business Objects’ “semantic layer”. (See, for example, my recent post on Zoomdata.) […]
Justin – Zoomdata is great for real-time viz. I tested basic data ingestion of clickstream data from Kafka into ZD (using upload API) via Spark Streaming. Pretty easy to setup and had to write less than 100 lines of code end-end!!
Wondering if you guys are looking at Druid. Druid offers decent time series based OLAP and a single query interface for real-time and batch data. Connecting ZD to Druid, I thought would provide real benefits.
Any druid connectors? Thoughts?
[…] via approximate query results. This can be done entirely via your BI tool (e.g. Zoomdata’s “query sharpening”) or more by your DBMS/platform software (the Snappy Data folks pitched me on that approach this […]