I talked with Peter Boncz and Marcin Zukowski of VectorWise last Wednesday, but didn’t get around to writing about VectorWise immediately. Since then, VectorWise and its partner Ingres have gotten considerable coverage, especially from an enthusiastic Daniel Abadi. Basic facts that you may already know include:
- VectorWise, the product, will be an open-source columnar analytic DBMS. (But that’s not quite true. Pending productization, it’s more accurate to call the VectorWise technology a row/column hybrid.)
- VectorWise is due to be introduced in 2010. (Peter Boncz said that to me more clearly than I’ve seen in other coverage.)
- VectorWise and Ingres have a deal in which Ingres will at least be the exclusive seller of the VectorWise technology, and hopefully will buy the whole company.
- Notwithstanding that it was once named something like “MonetDB,” VectorWise actually is not the same thing as MonetDB, another open source columnar analytic DBMS from the same research group.
- The MonetDB and VectorWise research groups consist in large part of academics in Holland, specifically at CWI (Centrum voor Wiskunde en Informatica). But Ingres has a research group working on the project too. (Right now there are about seven “highly experienced” people each on the VectorWise and Ingres sides, although at least the VectorWise folks aren’t all full-time. More are being added.)
- Ingres and VectorWise haven’t agreed exactly how VectorWise and Ingres Classic will play together in the Ingres product line. (All of the obvious possibilities are still on the table.)
- VectorWise is shared-everything, just as Ingres is. But plans — still tentative — are afoot to integrate VectorWise with MapReduce in Daniel Abadi’s HadoopDB project.
The MonetDB project is led by Martin Kersten, with whom I chatted at SIGMOD in June (standing up and not taking notes, so I may have some details wrong). I get the impression, based on that conversation, my VectorWise call, and other data:
- Martin has been researching analytic DBMS (mainly but not only relational) since the late 1970s, and has been based at CWI since 1985.
- Peter Boncz has been either second in command of that crew or close to it.
- Martin Kersten, Peter Boncz, and the CWI/MonetDB team in general have gotten all sorts of computer science glory for their work.
- Martin has enjoyed generously stable government research funding for his group, but has found commercialization of the technology more difficult than he might at, stay, Stanford. The figure of 15 MonetDB researchers comes to mind, although I see from Martin’s bio that he oversees a team of ~55 in total.
- One early attempt at commercializing MonetDB turned into a company called Data Distilleries that was sold to SPSS. Peter Boncz was chief architect of Data Distilleries.
- Besides VectorWise, there are at least two other recent spin-off companies from the MonetDB project. One is a zero-headcount shell, set up to facilitate MonetDB project members (and others) consulting to users of the open source MonetDB technology. The other is in stealth mode, focusing on some vertical market.
I further get the impression that VectorWise was actually Marcin Zukowksi’s Master’s Ph.D project, with Peter Boncz being his advisor. VectorWise also boasts another Peter Boncz student, who wrote about updating column stores.
As one might expect from the name, VectorWise does vector processing. I.e., the hard part of Marcin’s work was developing vectorized algorithms for one SQL operation after another. Vectorization, pipelining, and FPGAs might all seem to go together — XtremeData certainly seems to think so — but the VectorWise folks preferred to develop for Intel CPUs anyway, for pretty much the usual reasons. Another major theme is trying to get the right things into CPU cache, because in their opinion RAM cache is just sooooo painfully slow.
Our discussion of VectorWise’s compression was interesting. Highlights included:
- The design requirement is that decompression work at a rate of 3 gigabytes/second or so. That way the system is faster overall than if it operated at 1 gigabyte/second on uncompressed data, which I gather is the alternative.
- VectorWise takes 4-5 steps CPU cycles to decompress a tuple.
- VectorWise says it sacrificed compression ratio to achieve speed. That said, VectorWise claims 3-4X compression on TPC-H data, which is no worse than what ParAccel reported, and enjoys higher compression rates on other kinds of data.
- VectorWise decompresses data before manipulating it, and claims that the advantages of operating on compressed data are only significant if — like Vertica but apparently unlike VectorWise — the database stores columns in multiple sort orders each.
- VectorWise’s compression is mainly on numerical and numerical-like (e.g. date) datatypes. An exception is that VectorWise uses dictionary compression on string data when it makes sense to do so.
Other notes include:
- VectorWise has technology akin to Microsoft SQL Server’s Shared Scans, in which multiple queries that require similar table scans don’t have to repeat all the redundant scanning work. I need to get better at figuring out which other analytic DBMS do similar things.
- While VectorWise hasn’t yet been open-sourced, its code is in the hands of some other academic institutions, used mainly for computer science research (as opposed to, say, as a data store for some kind of scientific experiment).
- VectorWise’s scalability has only been tested up to eight cores.