The HadoopDB company Hadapt is finally launching, based on the HadoopDB project, albeit with code rewritten from scratch. As you may recall, the core idea of HadoopDB is to put a DBMS on every node, and use MapReduce to talk to the whole database. The idea is to get the same SQL/MapReduce integration as you get if you use Hive, but with much better performance* and perhaps somewhat better SQL functionality.** Advantages vs. a DBMS-based analytic platform that includes MapReduce — e.g. Aster Data — are less clear.
**It seems that Hadapt in the future is assured of having more SQL coverage than Hive does today.
It’s still early days for the Hadapt company. Funding is on the angel level. There seem to be six employees — Yale professor Daniel Abadi, CEO Justin Borgman, Chief Scientist Kamil Bajda-Pawlikowski,* and three other coders. The Hadapt product will go into beta at an unspecified future time; there currently are a couple of alpha users/design partners. The Hadapt company, a Yale spin-off, obviously needs to move from Connecticut soon. I wasn’t able to detect any particular outside experience in the form of directors or advisors. And Hadapt’s marketing efforts are still somewhat ragged. So basically, the reasons for believing in Hadapt pretty much boil down to:
- Daniel Abadi is a star.**
- Hadapt’s own tests show that Hadapt is a whole lot faster than Hive.
*Bajda-Pawlikowski is one of the two Abadi students who did the HadoopDB work. It turns out he had numerous years of coding experience before entering graduate school. (The other student, Azza Abouzeid, is pursuing an academic career.)
As you might have guessed from the name, the Hadapt guys are proud that their technology is “adaptive,” which communicates their fond belief that Hadapt’s query optimization and planning are more modern and cool than other folks’ query planning and optimization. In particular, Daniel suggested that Hadapt is more thoughtful than most DBMS are about looking at the size of intermediate result sets and then replanning queries accordingly.
However, the really cool adaptivity point is that Hadapt watches the performance of individual nodes, and takes that into account in query replanning. Daniel asserts, credibly, that this is a Really Good Feature to have in cloud and/or virtualized environments, where Hadapt might not have full control and use of its nodes. I’d add that it could also give Hadapt a lot of flexibility to be run on clusters of non-identical machines.
On the negative side, Hadapt will not at first have any awareness of how its underlying DBMS are optimized; it will plan for VectorWise the same way it does for PostgreSQL. In that regard, this is a DATAllegro 1.0 story. If I understood correctly, Hadapt has specific connectors for a couple of DBMS (probably exactly those two), and can also talk JDBC to anything. PostgreSQL was apparently 5X faster than MySQL when tested (with either ISAM or InnoDB); Daniel snorted about, for example, MySQL’s apparent fondness for nested-loop joins over hybrid hash. On the other hand, he was more circumspect about his reasons for favoring VectorWise over, to name another open source columnar DBMS, Infobright.
And finally, a couple of other points:
- Hadapt will be closed source, although it will of course rely on large amounts of other people’s open source software. Pay no attention to the importance Daniel previously ascribed to HadoopDB’s open source nature.
- Hadapt decompresses data before moving it from node to node, and also before doing non-SQL MapReduce operations on it. Pay no attention to the years Daniel spent insisting columnar DBMS absolutely must operate on data in compressed form.