After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis:
- There are a lot of big datasets out there, where “big” commonly means “petabyte scale.”
- Owners of that much data commonly like to store it using free or quasi-free software, especially if the data isn’t structured in such a way that relational tables are a great fit in the first place. HDFS (Hadoop Distributed File System) is the default choice. (Of course, there always are exceptions.)
- Some kinds of analytics can be done perfectly well in Hadoop.
- Some kinds of analytics, of course, can not be done well in Hadoop, with the most obvious examples being:
- Queries that involve serious joins.
- Anything that requires a lower latency than Hadoop provides.
- When doing analytics in Hadoop on data stored in HDFS, you often will want to include data you’re storing in your relational DBMS.
So Cloudera is promising fast, bidirectional connectors between Hadoop/HDFS and various DBMS, such as Aster Data nCluster, and will provide them on a services basis even before the productized versions ship. Here “fast” should and in multiple cases does mean “fully parallel,” with all data-owning nodes on either side (Hadoop or HDFS) more or less equally involved. Indeed, Aster is (I think for the first time) bypassing its loader nodes, instead sending Hadoop data straight to its workers.