This is part of a three-post series:
The canonical Metamarkets batch ingest pipeline is a bit complicated.
- Data lands on Amazon S3 (uploaded or because it was there all along).
- Metamarkets processes it, primarily via Hadoop and Pig, to summarize and denormalize it, and then puts it back into S3.
- Metamarkets then pulls the data into Hadoop a second time, to get it ready to be put into Druid.
- Druid is notified, and pulls the data from Hadoop at its convenience.
By “get data read to be put into Druid” I mean:
- Build the data segments (recall that Druid manages data in rather large segments).
- Note metadata about the segments.
That metadata is what goes into the MySQL database, which also retains data about shards that have been invalidated. (That part is needed because of the MVCC.)
By “build the data segments” I mean:
- Make the sharding decisions.
- Arrange data columnarly within shard.
- Build a compressed bitmap for each shard.
When things are being done that way, Druid may be regarded as comprising three kinds of servers:
- Actual data storage.
- Query brokers, which also have local cache.
- Coordination/management, including administrative interfaces and so on.
This is in addition to the aforementioned Zookeeper and MySQL.
It occurs to me that I don’t know whether that local cache is the only RAM tier, which is a pretty major point. Oh well …
The alternative is that data just streams into Druid. In that case:
- The various Hadoop pre-processing steps in the batch ingest process don’t happen.
- Instead, it’s assumed and required that data already be in structured properly — in terms of summarization and denormalization — for Druid.
- An additional tier of Druid data storage servers is involved; in particular, that tier is what receives the streaming data.
- Metamarkets calls the data storage servers that receive data “Real-Time”, while the others are “Historical”.
- The Real-Time servers persist data, on an append-only basis, into proper Druid segments. After a while, these are flushed to the Historical servers.
- The Druid query broker sends queries to the Real-Time or Historical servers as appropriate.