I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:
- The Hadapt 1.0 product is going “Early Access” today.
- General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
- Hadapt raised a nice round of venture capital.
- Hadapt added Sharmila Mulligan to the board.
- Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
- Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
- Headcount is in the low teens, with a target of doubling fast.
The Hadapt product story hasn’t changed significantly from what it was before. Specific points I can add include:
- With one exception to date, Hadapt beta customers have used PostgreSQL as the underlying DBMS, rather than some faster columnar system.
- Sure, you want to process data on the nodes where it resides on the cluster. But if each copy is replicated 3X or so, that gives you good flexibility to be adaptive by deciding which of the three copies you’ll operate against.
- In Hadapt Version 1.0, scheduling and workload management are pretty much Hadoop’s. However …
- … an improvement in scheduling is being actively researched.
- In general, Hadapt’s design philosophy for executing SQL is to use MapReduce to get data to the proper nodes, while using the underlying DBMS for node-specific operations such as:
- Initial retrieval from disk.
- Joins and aggregations on data residing at (or visiting) a specific node.
A very busy Daniel Abadi also took the time to walk me through how Hadapt does joins. More precisely, what we discussed about joins includes some of the last features being added to Hadapt 1.0; many of the pieces are still missing from early-access Hadapt 1.0, and some may even slip out of the Hadapt 1.0 GA version. As Dan tells it, there are five kinds of joins in Hadapt:
- Co-partitioned join. Both tables being joined happen to be partitioned on the join key. Happy happy joy joy. The tables are joined locally on each node, with the results aggregated via MapReduce.
- Directed join. One of the tables being joined happens to be partitioned on the join key. MapReduce distributes the other table along the join key, joins happen locally, and MapReduce does the rest.
- Broadcast join. One of the tables is broadcast in its entirety to every node. Joins then happen locally, and MapReduce does the rest.
- Split semijoin. One of the tables is projected to the join key and a row ID, and then distributed via MapReduce. Joins then happen locally. Later on, the joined rows are completed with the help of a second projection on the first table. MapReduce does the rest.
- Distributed/parallel hash join. Sometimes, Hadapt indeed joins just as Hadoop/Hive would.
Highlight’s of Hadapt’s performance story include:
- Dan contends that using a DBMS rather than HDFS (Hadoop Distributed File System) for I/O always gives a performance advantage.
- DBMS local-node join performance can be presumed to be superior as well.
- Of course, Dan also thinks that using a columnar DBMS would extend Hadapt’s performance advantage further, but most of the specifics of what Hadapt has told me about why they don’t routinely use a columnar DBMS yet are NDA.
- Even beta Hadapt/PostgreSQL outperforms Hadoop/Hive by almost 10X at Hadapt’s relatively small number of beta customer sites.