I talked with Omer Trajman of Vertica Monday night about Vertica’s MapReduce integration, part of its Vertica 3.5 release. Highlights included:
- By “integrating Vertica and MapReduce,” Vertica means “integrating Vertica and Hadoop.”
- Vertica’s Hadoop integration is based on Cloudera’s DBInputFormat.
- Omer called out for me several features of Vertica’s Hadoop integration that didn’t just come from Cloudera, namely:
- Cloudera’s DBInputFormat assumes the database runs on a single computer, or a single head node of an MPP system. Vertica’s technology, however, runs on peer parallel nodes with no head, and so Vertica adapted the DBInputFormat technology accordingly.
- Vertica lets you push down Map functions to the database. Omer reports a roughly even division among users and prospects between those who want to do this and ones who don’t.
- Vertica lets you do Reduce functions (or Map functions, if you don’t push them down to the database) on a separate cluster than you run the database software. Vertica asserts that its customers and prospects all want to do this. Right here is the big difference between Vertica’s MapReduce integration and Aster’s or Greenplum’s. (Aster would also say that Vertica’s weaker MapReduce/SQL programming integration is a big difference as well.)
- Indeed, Vertica lets you Reduce into a different DBMS than Vertica, if you choose.
- Vertica gives you flexibility on the size of the Map and Reduce clusters. Omer agreed with me when I said there were some limits on how fast one can add or subtract nodes in a Vertica grid, because there’s data redistribution involved. But one can add/change/delete Hadoop clusters extremely quickly.
Apparently, the use cases for Vertica/Hadoop integration to date lie in algorithmic trading and two kinds of web analytics. Specifically:
- One or more Vertica customers are using MapReduce in production to do relatively simple transforms of web log data
- Vertica customers are experimenting with — but have not yet put into production — more sophisticated pattern analysis of web log data.
- Financial services customers are using MapReduce for a lot of experimentation in discovering new algorithms. The idea is that DBMS/MapReduce integration offers rapid prototyping of algorithmic ideas. Those that pan out are then reimplemented for production, presumably in some kind of CEP (Complex Event Processing) system. These users seem to be ones that are pushing down a lot of Map functions to the Vertica DBMS.
By the way, Vertica is based on C-Store, the Ph.D. thesis project of Daniel Abadi, who recently wrote:
To me, it is far more efficient from a performance and a “green” perspective to push the computation to the data. Hence, I am not a fan of decoupling the compute grid and the data grid.
Not coincidentally, Daniel also recently wrote that
If the VectorWise/Ingres solution does get released open source, I believe they will be an excellent column-store storage engine for HadoopDB. I have already requested an academic preview edition of their software to play with.
The VectorWise guys also told me they are looking forward to seeing how the two projects work together.