DBMS/Hadoop integration is a confusing subject. My post on the Cloudera/Aster Data partnership awaits some clarification in the comment thread. A conversation with Vertica left me unsure about some Hadoop/Vertica Year 2 details as well, although I’m doing better after a follow-up call. On the plus side, we also covered some rather cool Hadoop/Vertica product futures, and those seemed easier to understand.
I say “Year 2″ because Hadoop/Vertica integration has been going on since last year. Indeed, Vertica says that there are now over 25 users of the Hadoop/Vertica combination and hence Vertica’s Hadoop connector. Vertica is now introducing — for immediate GA — a new version of its Hadoop connector. So far as I understood:
- Vertica’s Hadoop connector now works with Vertica 4.0.
- Vertica’s Hadoop connector now works with Pig.
- Vertica’s Hadoop connector now can let Vertica do aggregation, whereas in the past Hadoop would have done a bunch of Vertica queries and performed the aggregation itself. I think this technically sets up the paradox that sometimes being less parallel gives better performance, but only because the heavy lifting will already have been done — in parallel — on Vertica.
- Vertica’s Hadoop connector now has smarts about how data is hash-distributed in Vertica.
- Vertica’s Hadoop connector can make single calls from Hadoop or Pig to load data from Vertica, as opposed to — well, I guess as opposed to making calls to each Vertica node separately.
- Vertica’s Hadoop connector now lets Hadoop write more easily to Vertica. Hadoop can write to the Vertica table of its choice — even if the table doesn’t yet exist, because in that case Vertica creates it on the fly. (Note that this capability wouldn’t have made sense before Vertica 4.0, because there wasn’t a simple CREATE TABLE capability in Vertica — manual intervention was needed to choose the table’s physical layout.)
In addition, inspired by a large banking customer, Vertica is announcing some cool Hadoop integration futures:
- Vertica-formatted data will be stored on HDFS (Hadoop Distributed File System).
- It will get there via parallel backup — i.e., you will be able to back up Vertica to HDFS.
- Libraries will be exposed to let HDFS read and write the Vertica-formatted data, for purposes like ETL, long-running analytics, etc.
As for those 25+ (perhaps 27-8) Vertica/Hadoop users:
- 15 or more of them connect to Cloudera’s Hadoop distribution, free or paid. (Some may just use Apache Hadoop.)
- Some number of them indeed do connect to Cloudera Enterprise.
- Most of them are doing ETL.
- Some of the ETL is of text.