I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That’s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since then.
The main new activity I know of has been in the open source SciDB project.
- A company called Zetics has been started to commercialize SciDB. As of now, the entire staff seems to be CEO Marilyn Matz, techie Paul Brown, and part of Mike Stonebraker. Marilyn says Zetics has some venture capital, but even under NDA didn’t tell me who it was from. Zetics does not have its own web site.
- Marilyn tells me there are 20-25 contributors to SciDB, led by Paul Brown and Mike Stonebraker. Brown is full-time. Persistent Systems has been donating the efforts of a few of its employees. Some LSST folks have been doing SciDB work backed by grant money. Most or all of the rest seem to be purer volunteers. Some Russians have been particularly active.
- Release 0.5 of SciDB is expected in June. Release 1.0 is expected in September. This is a rewrite; prior demo code has been scrapped. Perhaps not coincidentally, it’s also a small slip from prior project plans.
- The array data model is an example of what’s being implemented first. (Duh — you can’t have a DBMS without a data model.) Support for uncertainty is an example of what’s been deferred until later.
- As has been clear since XLDB3 last August, one major target market for SciDB is genomic research.
- It’s obvious that the oil and gas industry, with all its geospatial data, should be interested in SciDB. But there’s not much activity in that regard; outreach is evidently needed. If you can think of somebody in that sector (or anywhere else) who should be alerted to SciDB, please ping them.
- Interest from web analytics users in SciDB seems to have receded a bit from the days when eBay almost funded the project.
In other scientific data management news,
- Microsoft put out a book called The Fourth Paradigm on scientific database management. The whole thing can be downloaded, very officially, as a giant PDF. I think it’s worth skimming. I don’t think it’s worth actually reading. (I did read it.)
- XLDB4 will be at Stanford October 5-7. Unlike prior XLDBs, it will have an open (i.e., no invitation required) part.
Finally, you surely are aware of the whole “Climategate” mess, in which major climate researchers’ email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about the difficulty of reconstructing published results from files at hand. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:
- Raw data was impossible to use without various adjustments to regularize it (the word “regridding” comes up a lot, for example). Massaging was needed before analytics could be done on it.
- The raw data was thrown out or lost, and could not be reconstructed (why they couldn’t have asked the suppliers of the data to give it to them again was unclear in this case, since it wasn’t original experimental data).
- It was thus impossible to massage the data in any new or improved way.