My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:
- A very tight integration between an RDBMS-based analytic platform and Hadoop …
- … that is decidedly immature as an analytic RDBMS …
- … but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).
Solr is in the mix as well.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:
- Dump multi-structured data into Hadoop.
- Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.
At the highest level, Hadapt works like this:
- You have a full installation of Hadoop, in either a Cloudera or (new) MapR flavor.
- You also have an RDBMS on every node (PostgreSQL for now), NameNode servers and the like perhaps excepted. Hadapt unites these into a single scale-out analytic RDBMS.
- You can use standard Hadoop interfaces to get at both parts of this (and Solr).
- You can use standard SQL to get at the RDBMS part.
- You can also use standard SQL to get at the rest of the Hadoop data; i.e., Hadapt can among other things be a substitute for Hive, interfacing in similar ways.
- You can package up Hadoop MapReduce processes (among other things) as SQL functions; inputs can be views* drawn from the relational store and the rest of Hadoop alike.
The latter is a new Version 2 feature called HDK (Hadapt Development Kit), and is the reason I’m using the term “analytic platform” for Hadapt.
*I’m using the term “view” a little loosely here.
The Hadapt SQL execution story starts:
- When you send a SQL query to Hadapt, Hadapt plans how to run it across multiple nodes, but uses the local-DBMS in each case for single-node work.
- Hadapt has the same scale-out strategy for long-running queries as before, which is to let Hadoop MapReduce handle data shuffling among nodes. Hadapt reports that even in this case, it’s much faster than Hive.
- But in Version 2, Hadapt directly plans and manages shorter SQL queries, with interactive response time. (A few seconds at most, sub-second if you’re lucky.)
To exploit and celebrate its much-faster-than-Hive query response, Hadapt has a new partnership with my clients at Tableau.
If we leave aside the other kinds of processing and view Hadapt strictly as an RDBMS, it’s primitive versus the current MPP (Massively Parallel Processing) state of the art, but ahead of where some other vendors were at similar points in their company history.
- In contrast to early versions of what are now some well-established scale-out RDBMS, there’s no “fat head” bottleneck in Hadapt. Hadapt reshuffles data directly between query steps, with only the final result set necessarily being collected at a single node (to be shipped back to whoever asked for it).
- Ironically for a company started by Daniel Abadi, Hadapt has no relational columnar storage capability, and isn’t great at compression.
- Generally, Hadapt is too new to have much in the way of concurrency support, workload management, and so on, beyond whatever it inherits from Hadoop.
- In one countervailing advantage versus competitors, Hadapt does have Hadoop-like (and indeed Hadoop-based) capabilities to re-plan a query on the fly when, for example, a node is operating at diminished throughput. (The default replication factor in Hadapt, as in HDFS, is 3, so the planner usually has some options.)
Hadapt is far from the only vendor trying to integrate interactive SQL and Hadoop; indeed, I have multiple clients taking different approaches to the same problem. But if you’re looking for a single data store that’s useful for a lot of different purposes — well, that’s pretty much the essence of the Hadapt design.