October 3, 2009
- A data model based on multidimensional arrays, not sets of tuples
- A storage model based on versions and not update in place
- Built-in support for provenance (lineage), workflows, and uncertainty
- Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
- Support for “external” data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
- Open source in order to foster a community of contributors and to insure that data is never “locked up” — a critical requirement for scientists
- I think that’s a dream/wish list. A lot of good could be done without meeting each of those six requirements in full.
- I think at least some of the XLDB/SciDB leaders realize this.
- In my opinion, a highly useful subset of the dream/wish list is achievable in the reasonably-intermediate future, in either of two ways:
- Through a Hadoop-centric open source effort, especially since HadoopDB opens up the possibility of letting DBMS creators offload MPP scaling challenges to somebody else.
- From commercial MPP software-only (as opposed to appliance) DBMS vendors. I think they can develop the needed technology. I also think it could be in their business interest to make licensing arrangements of the sort that the scientific and research communities would need.
- Talking about “scientific” big data is unhelpfully vague. Let’s just focus on multi-dimensional measurement- or model-centric data, from disciplines such as seismology (under the Earth’s surface), climatology (over the surface), and astronomy (outer space). That would also include disciplines whose three-spatial-dimensions-plus-time data comes from inside a laboratory or other man-made environment, such as high-energy physics, fluid dynamics, and so on.
- One place in all that where there should be a commercial-company market is in oil/gas extraction. And by the way, the energy industry is increasing its uptake of data warehousing technology faster these days than any other sector I can think of, except perhaps for …
- … web companies that do log file analysis. Facebook’s log data has arrays-within-arrays reminiscent of the scientists’. eBay has been a major backer of XLDB/SciDB. It’s far from fully known yet just how much overlap there is between log-file-analyzers’ data management needs and those of big-data scientists. But there clearly are at least some commonalities.
- I don’t get the impression that scientists focused on modeling — e.g. climate-predictors — have been big participants in XLDB. That’s a pity for at least two reasons. First, modeling is at the heart of some of the most important global issues scientists address (e.g., climate change). Second, it might be an area of particularly rich overlap with commercial data management needs.
Now let’s step back and consider approximately what is meant by the requirements listed above.
- The requirements for an array structure are evidently pretty deep. You can glean some of the reasons from the scientific database use cases posted on the SciDB website. In particular:
- Coordinate data naturally fits into arrays.
- Coordinate data also naturally fits into geospatial ranges and the like.
- The “grid” for the array can be imprecise — or calculated via transformation — for a whole lot of different reasons.
- Different measurements may be available for different points in the array. (I think this may be the essence of the array-valued-arrays requirement.)
- Some reasons scientists want versioning and support for data provenance are pretty obvious — you never want to lose the record of what the instrument readings said, or ever were believed to say. But it goes further. Data is “cooked” — i.e., transformed/reduced — and stored in huge volumes. So you’d like to later on be able to go back to the raw data and re-cook it.
- The workflow requirement seems to stem in many cases from data movement needs, that in turn in some cases stem from political issues. I haven’t yet understood why workflow would actually need to be baked into a scientific DBMS.
- By the time the database management systems we’re talking about could conceivably be ready, the need will be at least in the 10s of petabytes. 100s of petabytes is a reasonable design goal.
- Not that I’ve run any numbers on the matter, but it seems plausible that query fault-tolerance will be needed, at least in some cases.
- In many sciences (astronomy seems to be an exception), the default choice is to keep data in files rather than a DBMS. For example, CERN has a 10 terabyte or so Oracle database holding just the metadata for a vastly larger collection of data files. Even if the pendulum swings toward greater use of DBMS, the ongoing need for external file access is pretty obvious.
- I suspect that the insistence on open source is part legitimate, part knee-jerk excessive.
- “Free” is the best possible price, of course.
- Beyond cash cost, scientists want data access to be free of licensing encumbrance. There are two main reasons. First, people might want to manage subsets or copies of data remotely from its central repository, for a variety of reasons. Not all of those reasons are easy to overcome, so any closed-source licensing would have to be very comprehensive (e.g., global or at least continent-wide “site” licensing).
- Second, they want assurance that data will always be accessible, even if licenses expire. That seems a little overwrought. Yes, moving data from one multi-petabyte repository to another could be a bit slow. But it’s not an eventuality to panic about.
- As for actual community development — scientists sure have a variety of exotic data management needs. But I’m not sure how much talent or resource there is among scientists to do true DBMS development (as opposed to, say, refining some UDFs). Yes, one XLDB attendee was both an astronomer and a PostgreSQL Major Contributor, but he seemed like an exception. On the other hand, it’s not entirely implausible that, in the right framework, some people with database talent could be recruited to donate some time to the general advancement of science.
- I don’t know much about management of uncertain data, and will duck that subject for now.
Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, GIS and geospatial, Hadoop, Open source, SciDB, Scientific research, Specific users, Web analytics
Subscribe to our complete feed!