I previously dropped a few hints about my clients at Metamarkets, mentioning that they:
- Have built vertical-market analytic platform technology.
- Use a lot of Hadoop.
- Throw good parties. (That’s where the background photo on my Twitter page comes from.)
But while they’re a joy to talk with, writing about Metamarkets has been frustrating, with many hours and pages of wasted of effort. Even so, I’m trying again, in a three-post series:
Much like Workday, Inc., Metamarkets is a SaaS (Software as a Service) company, with numerous tiers of servers and an affinity for doing things in RAM. That’s where most of the similarities end, however, as Metamarkets is a much smaller company than Workday, doing very different things.
Metamarkets’ business is SaaS (Software as a Service) business intelligence, on large data sets, with low latency in both senses (fresh data can be queried on, and the queries happen at RAM speed). As you might imagine, Metamarkets is used by digital marketers and other kinds of internet companies, whose data typically wants to be in the cloud anyway. Approximate metrics for Metamarkets (and it may well have exceeded these by now) include 10 customers, 100,000 queries/day, 80 billion 100-byte events/month (before summarization), 20 employees, 1 popular CEO, and a metric ton of venture capital.
To understand how Metamarkets’ technology works, it probably helps to start by realizing:
- Metamarkets has one technology stack for receiving and managing data when it is ingested in batch mode.
- Metamarkets has a different, overlapping technology stack for receiving and managing data when it is ingested in streaming mode.
- Metamarkets is open-sourcing part of the two stacks, called Druid.
- In the Venn diagram for these three things, the intersection of no two of them is strictly contained in the third.
- Metamarkets doesn’t surface all the raw data for analysis or viewing. Rather, there’s some early aggregation, with the raw data preserved off to the side in case you want to create more aggregates later on.
- Metamarkets’ application is a dashboard, supporting drilldown but not, at this time, other forms of analytics. A lot of what is measured are time series and/or top lists.
- Druid is in essence an analytic DBMS; indeed, it’s so strictly analytic that it isn’t suited to manage its own metadata. MySQL is used for that.
- Apache Zookeeper is also assumed as part of the environment to manage Druid.
- The batch pipeline relies on Hadoop.
- The streaming pipeline relies on Kafka (a publish-subscribe project out of LinkedIn).
The whole thing is fully multi-tenant, at least by the point that data is being stored and visualized. Metamarkets customers either live in the Amazon cloud (the smaller ones), or else used to live there and don’t mind shipping their data back there for analysis by Metamarkets. Some “not exactly Ted Codd’s tabular DBMS” features are:
- Multi-valued fields (just vectors, not unlimited arrays).
- A couple of fast approximate algorithms (uniques, top lists).
One thing MetaMarkets does that’s pretty much a best practice these days is roll out new code, mid-day if they like, without ever taking their system down. Why is this possible? Because the data is replicated across nodes, so you can do a rolling deployment of a node at a time without making any data unavailable. Notes on that include:
- Performance could be affected, as the read load is generally balanced across all the data replicas.
- Data locking is not an issue — Metamarkets doesn’t have any read locks, as Druid is an MVCC (Multi-Version Concurrency Control) system.