Two of the more interesting approaches for integrating Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story starts:
- You can dump any kind of data you want into Hadoop’s file system.
- You can have data in a scale-out RDBMS to get good performance on analytic SQL.
- You can access all the data (not just the relationally stored part) via SQL.
- You can do MapReduce on all the data (not just the Hadoop-stored part).
- To varying degrees, Hadapt and Aster each offer three kinds of advantage over Hadoop-with-Hive:
- SQL performance is (much) better.
- SQL functionality is better.
- At least some of your employees — the “business analysts” — can invoke MapReduce processes through SQL, if somebody else (e.g. your techies or the vendor’s) coded them up in the first place.
Of course, there are plenty of differences. Those start:
- Teradata Aster is at a whole different stage of corporate and product maturity than Hadapt (even if some crucial Aster/Hadoop features are brand new).
- Aster and Hadoop clusters are separate, even if they can be run on different nodes in the same appliance. Hadapt’s RDBMS runs on the same nodes as HDFS (Hadoop Distributed File System), or optionally MapR’s HDFS alternative.
- The Aster approach involves two kinds of MapReduce. If you want to do MapReduce involving data stored in the Aster RDBMS, you should use Aster’s SQL/MR, not Hadoop MapReduce.
- Teradata Aster encourages appliance deployment (although commodity hardware and even the cloud are options). Hadapt encourages Hadoop-style commodity hardware. I imagine there’s a considerable software price difference as well.
As for use cases — for starters, please note that a large fraction of analytic inquiries are ultimately about people. And when you’re looking at people, there are a whole lot of data sources you can consult. Many are clearly relational; increasingly, however, some are not. What’s more, people are hard to assess and understand, so you may want to take multiple tries at refining your analysis.
So right there you have an argument for flexible investigative or iterative analytics, over multi-structured (and relational) data. And if you think about how to combine information from all those data sources — well, it’s likely that SOME of the analytic steps will be a lot like joins.
That sure sounds like Hadoop/RDBMS integration to me.
- Juggling analytic databases (March, 2012)