September 28, 2015

Introduction to Cloudera Kudu

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Part 1 (this post) is an overview of Kudu technology.
Part 2 is a lengthy dive into how Kudu writes and reads data.
Part 3 is a brief speculation as to Kudu’s eventual market significance.

Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.

*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes. 🙂

For starters:

Kudu is an alternative to HDFS (Hadoop Distributed File System), or to HBase.
Kudu is meant to be the underpinning for Impala, Spark and other analytic frameworks or engines.
Kudu is not meant for OLTP (OnLine Transaction Processing), at least in any foreseeable release. For example:
- Kudu doesn’t support multi-row transactions.
- There are no active efforts to front-end Kudu with an engine that is fast at single-row queries.
- Kudu is rather columnar, except for transitory in-memory stores.
Kudu’s core design points are that it should:
- Accept data very quickly.
- Immediately make that data available for analytics.
More specifically, Kudu is meant to accept, along with slower forms of input:
- Lots of fast random writes, e.g. of web interactions.
- Streams, viewed as a succession of inserts.
- Updates and inserts alike.
The core “real-time” use cases for which Kudu is designed are, unsurprisingly:
- Low-latency business intelligence.
- Predictive model scoring.
Kudu is designed to work fine with spinning disk, and indeed has been tested to date mainly on disk-only nodes. Even so, Kudu’s architecture is optimized for the assumption that there will be at least some flash on the node.
Kudu is designed primarily to support relational/SQL processing. However, Kudu also has a nested-data roadmap, which of course starts with supporting the analogous capabilities in Impala.

Also, it might help clarify Kudu’s status and positioning if I add:

Kudu is in its early days — heading out to open source and beta now, with maturity still quite a way off. Many obviously important features haven’t been added yet.
Kudu is expected to be run with a replication factor (tunable, usually =3). Replication is via the Raft protocol.
Kudu and HDFS can run on the same nodes. If they do, they are almost entirely separate from each other, with the main exception being some primitive workload management to help them share resources.
Permanent advantages of older alternatives over Kudu are expected to include:
- Legacy. Older, tuned systems may work better over some HDFS formats than over Kudu.
- Pure batch updates. Preparing data for immediate access has overhead.
- Ultra-high update volumes. Kudu doesn’t have a roadmap to completely catch up in write speeds with NoSQL or in-memory SQL DBMS.

Kudu’s data organization story starts:

Storage is right on the server (this is of course also the usual case for HDFS).
On any one server, Kudu data is broken up into a number of “tablets”, typically 10-100 tablets per node.
Inserts arrive into something called a MemRowSet and are soon flushed to something called a DiskRowSet. Much as in Vertica:
- MemRowSets are managed by an in-memory row store.
- DiskRowSets are managed by a persistent column store.*
- In essence, queries are internally federated between the in-memory and persistent stores.
Each DiskRowSet contains a separate file for each column in the table.
DiskRowSets are tunable in size. 32 MB currently seems like the optimal figure.
Page size default is 256K, but can be dropped as low as 4K.
DiskRowSets feature columnar compression, with a variety of standard techniques.
- All compression choices are specific to a particular DiskRowSet.
- So, in the case of dictionary/token compression, is the dictionary.
- Thus, data is decompressed before being operated on by a query processor.
- Also, selected columns or an entire DiskRowSet can be block-compressed.
Tables and DiskRowSets do not expose any kind of RowID. Rather, tables have primary keys in the usual RDBMS way.
Kudu can partition data in the three usual ways: randomly, by range or by hash.
Kudu does not (yet) have a slick and well-tested way to broadcast-replicated a small table across all nodes.

*I presume there are a few ways in which Kudu’s efficiency or overhead seem more row-store-like than columnar. Still, Kudu seems to meet the basic requirements to be called a columnar system.

Categories: Business intelligence, Cloudera, Columnar database management, Database compression, Databricks, Spark and BDAS, Hadoop, HBase, Predictive modeling and advanced analytics, Solid-state memory, SQL/Hadoop integration

Subscribe to our complete feed!

Comments

7 Responses to “Introduction to Cloudera Kudu”

Cloudera Kudu deep dive | DBMS 2 : DataBase Management System Services on September 28th, 2015 3:52 am

[…] Part 1 is an overview of Kudu technology. […]
The potential significance of Cloudera Kudu | DBMS 2 : DataBase Management System Services on September 28th, 2015 3:54 am

[…] Part 1 is an overview of Kudu technology. […]
Asis Mohanty on October 6th, 2015 11:27 pm

Good overview. Few design aspect is similar as Cassandra Memtable (MemRowSets) & SSTable (DiskRowSets)…
Strata+Hadoop World New York 2015 | Cloudera VISION on October 9th, 2015 7:48 pm

[…] watch Todd’s actual talk on the StrataConf web site. And Curt Monash wrote a lengthy (three-part) blog post that explains Kudu, with considerable input from […]
Strata+Hadoop World New York 2015 - Filling the gaps in Big Data on October 10th, 2015 10:21 am

[…] Todd’s actual talk on the StrataConf web site. And Curt Monash wrote a lengthy (three-part) blog post that explains Kudu, with considerable input from […]
初见Kudu | biaolog on November 17th, 2015 2:15 am

[…] http://www.dbms2.com/2015/09/28/introduction-to-cloudera-kudu/ […]
Cloudera 5.5 | DBMS 2 : DataBase Management System Services on November 19th, 2015 6:54 am

[…] and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Introduction to Cloudera Kudu

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin