June 16, 2012

Metamarkets Druid overview

This is part of a three-post series:

My clients at Metamarkets are planning to open source part of their technology, called Druid, which is described in the Druid section of Metamarkets’ blog. The timing of when this will happen is a bit unclear; I know the target date under NDA, but it’s not set in stone. But if you care, you can probably contact the company to get involved earlier than the official unveiling.

I imagine that open-source Druid will be pretty bare-bones in its early days. Code was first checked in early in 2011, and Druid seems to have averaged around 1 full-time developer since then. What’s more, it’s not obvious that all the features I’m citing here will be open-sourced; indeed, some of the ones I’m describing probably won’t be.

In essence, Druid is a distributed analytic DBMS. Druid’s design choices are best understood when you recall that it was invented to support Metamarkets’ large-scale, RAM-speed, internet marketing/personalization SaaS (Software as a Service) offering. In particular:

Druid tries to use RAM well.
Druid tries to stay up all the time.
Druid has multi-valued fields. (Numeric, but of course you can use encoding tricks to be effectively more general.)
Druid’s big limitation is to assume that there’s literally only one (denormalized) table per query; you can’t even join to dimension tables.
SQL is a bit of an afterthought; I would expect Druid’s SQL functionality to be pretty stripped-down out of the gate.

Interestingly, the single-table/multi-valued choice is echoed at WibiData, which deals with similar data sets. However, WibiData’s use cases are different from Metamarkets’, and in most respects the WibiData architecture is quite different from that of Metamarkets/Druid.

As for many DBMS, much of what’s interesting about Druid is how it organizes and chunks data. Most important, Druid has MVCC (Multi-Version Concurrency Control) on a segment-by-segment basis. That is, an update requires a new version of the whole segment to be written; while that happens, reads can continue on the old version unabated.

Obviously, this is more suited for streaming or batch-load scenarios than for ones with many single-row updates.

Other Druid specifics include:

A Druid table must have a timestamp column.
Druid data is stored in columns, in timestamp order.
Druid data is commonly chunked into segments of 5-10 million rows. Data is partitioned by time and then perhaps also by some other dimension.
There can be two sets of data storage servers, one for data that has arrived recently, the other for older data (e.g. >1 hour old). In that case, data is first persisted on one set of servers, then flushed to the other.
Druid data is structured the same way in memory as on disk (memory mapping). More precisely, there seems to be memory mapping between generic persistent storage and virtual memory, with the operating system taking care of figuring out which parts of virtual memory need to be in actual RAM.
Druid keeps compressed bitmap indexes on the various dimensions, on a segment-by-segment basis.
Druid uses dictionary/token compression, with a separate dictionary for each segment. Token length is dynamic, based on column cardinality. Max length is 31 bits, which rarely is a problem, since a segment doesn’t usually hold a lot more than 2^33 rows.
You can have different replication factors for different segments. You can read from all replicas.

For more on Druid, please see my post on Metamarkets’ back-end technology.

Categories: Clustering, Columnar database management, Data models and architecture, Data warehousing, Memory-centric data management, Metamarkets and Druid, Open source, Parallelization, Software as a Service (SaaS)

Subscribe to our complete feed!

Comments

12 Responses to “Metamarkets Druid overview”

Introduction to Metamarkets and Druid | DBMS 2 : DataBase Management System Services on June 16th, 2012 5:54 pm

[…] Druid overview […]
Metamarkets open sources Druid, its in-memory database — Data | GigaOM on October 24th, 2012 9:03 am

[…] Metamarkets runs Druid on an 800-core system running on Amazon EC2. Others have done a decent job explaining what Druid seems good for and where the tradeoffs might […]
Metamarkets open sources Druid, its in-memory database ← techtings on October 24th, 2012 9:06 am

[…] Metamarkets runs Druid on an 800-core system running on Amazon EC2. Others have done a decent job explaining what Druid seems good for and where the tradeoffs might […]
Patrick Wendell on October 31st, 2012 12:54 am

This is the best explanation of Druid that exists anywhere – inclusive of their Marketing material, the Strata talk, and the documentation in the code. Thanks!
Curt Monash on October 31st, 2012 2:29 am

Thanks for the kind words!

I put a lot of effort into it, but was still frustrated by the results (mainly around the in-memory part, not Druid itself).
Notes and comments — October 31, 2012 | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:16 am

[…] Metamarkets’ Druid was open-sourced. Numerous other product introductions and so on that I’ve hinted at have […]
Big Data Warehouse in the cloud « Ravi's Technology Blog on November 28th, 2012 10:04 pm

[…] HANA but cringe at the licensing costs? One option is to look into open source alternatives like Druid which was created by the vendor MetaMarkets. Druid claims to provide real-time analytics using […]
Hadoop’s Successors | Christopher Berry on October 5th, 2013 11:22 am

[…] “I would encourage you to keep an eye on Metamarkets’ Druid, which Curt Monash recently covered: http://www.dbms2.com/2012/06/16/metamarkets-druid-overview/ […]
Hadoop’s Successors – ChristopherBerry.ca on August 7th, 2021 3:38 pm

[…] “I would encourage you to keep an eye on Metamarkets’ Druid, which Curt Monash recently covered: http://www.dbms2.com/2012/06/16/metamarkets-druid-overview/ […]
deposit vivoslot on April 20th, 2022 8:14 pm

First of all I would like to say terrific blog!
I had a quick question in which I’d like to ask if you do not mind.

I was interested to know how you center yourself and clear
your head before writing. I have had trouble clearing
my thoughts in getting my thoughts out. I truly do take pleasure in writing
however it just seems like the first 10 to 15 minutes are usually lost simply just trying
to figure out how to begin. Any ideas or tips? Thanks!
bitsofsunshine1407.blogspot.com on July 29th, 2022 4:20 am

Thanks for finally writing about > Metamarkets Druid overview | DBMS 2 :
DataBase Management System Services < Loved it!
read and click on October 21st, 2023 4:41 pm

I think that what you posted was very logical.
However, consider this, suppose you were to create a killer title?
I ain’t suggesting your information is not solid., however what if
you added a headline that grabbed people’s attention? I mean Metamarkets Druid overview | DBMS 2 : DataBase Management
System Services is a little boring. You could glance at Yahoo’s front page read and click see how they create article titles to get viewers interested.

You might add a related video or a related pic or two to get people interested about everything’ve written. In my opinion, it might make
your website a little livelier. http://pjjivezhqqc28.mee.nu/?entry=3548990

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Metamarkets Druid overview

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin