June 16, 2012

Metamarkets’ back-end technology

This is part of a three-post series:

Introduction to Metamarkets and Druid
Druid overview
Metamarkets’ back-end technology (this post)

The canonical Metamarkets batch ingest pipeline is a bit complicated.

Data lands on Amazon S3 (uploaded or because it was there all along).
Metamarkets processes it, primarily via Hadoop and Pig, to summarize and denormalize it, and then puts it back into S3.
Metamarkets then pulls the data into Hadoop a second time, to get it ready to be put into Druid.
Druid is notified, and pulls the data from Hadoop at its convenience.

By “get data read to be put into Druid” I mean:

Build the data segments (recall that Druid manages data in rather large segments).
Note metadata about the segments.

That metadata is what goes into the MySQL database, which also retains data about shards that have been invalidated. (That part is needed because of the MVCC.)

By “build the data segments” I mean:

Make the sharding decisions.
Arrange data columnarly within shard.
Build a compressed bitmap for each shard.

When things are being done that way, Druid may be regarded as comprising three kinds of servers:

Actual data storage.
Query brokers, which also have local cache.
Coordination/management, including administrative interfaces and so on.

This is in addition to the aforementioned Zookeeper and MySQL.

It occurs to me that I don’t know whether that local cache is the only RAM tier, which is a pretty major point. Oh well …

The alternative is that data just streams into Druid. In that case:

The various Hadoop pre-processing steps in the batch ingest process don’t happen.
Instead, it’s assumed and required that data already be in structured properly — in terms of summarization and denormalization — for Druid.
An additional tier of Druid data storage servers is involved; in particular, that tier is what receives the streaming data.
Metamarkets calls the data storage servers that receive data “Real-Time”, while the others are “Historical”.
The Real-Time servers persist data, on an append-only basis, into proper Druid segments. After a while, these are flushed to the Historical servers.
The Druid query broker sends queries to the Real-Time or Historical servers as appropriate.

Categories: Amazon and its cloud, Cloud computing, Clustering, Columnar database management, Data pipelining, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Memory-centric data management, Metamarkets and Druid, Software as a Service (SaaS)

Subscribe to our complete feed!

Comments

7 Responses to “Metamarkets’ back-end technology”

Introduction to Metamarkets and Druid | DBMS 2 : DataBase Management System Services on June 16th, 2012 5:56 pm

[…] Metamarkets’ back-end technology […]
Metamarkets Druid overview | DBMS 2 : DataBase Management System Services on June 16th, 2012 5:56 pm

[…] more on Druid, please see my post on Metamarkets’ back-end technology. Categories: Clustering, Columnar database management, Data models and architecture, Data […]
Eric Tschetter on June 19th, 2012 3:51 pm

“It occurs to me that I don’t know whether that local cache is the only RAM tier, which is a pretty major point. Oh well …”

This is probably the only RAM tier, but I must admit that I don’t quite understand what is meant by “RAM tier.”

If “RAM tier” means that it is an application tier that cannot do its own processing, but just stupidly stores stuff in RAM, which it can lose on process death. Then yes, this is the only RAM tier. The main compute nodes will retain their data between process death and restart.

If “RAM tier” is referring to whether there is also a cache on the compute nodes. Then, this is the only RAM tier, the only data accessed on the compute nodes is the base table data. I.e. there is no “cache” local to a specific compute node, instead there is all of the data loaded up in main memory.

If “RAM tier” means something else, then it might not be the only RAM tier :).
Curt Monash on June 19th, 2012 4:37 pm

Eric,

The question is:

Since you guys make such a large point of saying you do your processing based on in-memory data — on which tier is that data located in-memory?
Eric Tschetter on June 20th, 2012 3:33 pm

Ah, that would be the compute nodes. The local cache in the query broker is just a lazy cache of completed query results, no real processing of data is done there.
Curt Monash on June 20th, 2012 7:19 pm

Eric,

So those compute nodes are separate from the various data management and ingest tiers I wrote about?
BI and quasi-DBMS | DBMS 2 : DataBase Management System Services on January 14th, 2016 7:43 am

[…] DBMS-like processing in BI offerings may be found in QlikView, Zoomdata, Platfora, ClearStory, Metamarkets and others. That some of those are SaaS (Software as a Service) doesn’t undermine the general […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Metamarkets’ back-end technology

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin