October 3, 2009

Martin Kersten on issues in scientific data management

Martin Kersten emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited it, and am posting it below.

Dear Curt,

Thanks for the very nice story and perception on the XLDB meeting. It is a balanced view.

More philosophically I would add a few points:

1) A data management system architecture is a large collection of compromises amongst a number of competing parameters.

data management (hardware, data structure, algorithms, optimizers, languages) –> value-for-application

Given the cost to develop/maintain a dbms, we see only a few parameter constellations in the current product offerings. And the scientists have a hard time to explore the uncharted land, because of effort required and uncertain benefits. (The same holds for researchers in R&D labs of vendors.)

2) The research community needs a focus to move ahead. The array-dbms is such a focus, because it identifies an omission in the type structure being managed at all levels of a system. Articulation of this in the community will help to steer effort.

3) The recent ‘hype’ for going to a HadoopDB like approach should be positioned carefully. It is so far a single point experiment for a limited query domain space, carefully carved out to avoid all the issues that plague a distributed dbms. Within this space the techniques come from a different operating system functionality. [Not sure what he means by this.] It does not change the DBMS itself and as such it is a repetition of middleware solutions to handle a cluster of independent MySQL instances.

This paper might be worth having a look at http://ic2.epfl.ch/labos/publications/freenix2004.pdf

To generalize it to a complete solution e.g. calls for massive replication, to avoid that you have to ship data around during query execution. This is paid for with more expensive updates. Feasible in certain domains. [I'd frame that point just as saying Hadoop-based solutions are unlikely to do as well at reducing data shipping as the better MPP DBMS.]


3 Responses to “Martin Kersten on issues in scientific data management”

  1. Issues in scientific data management | DBMS2 -- DataBase Management System Services on October 3rd, 2009 6:34 am

    [...] Martin Kersten’s response Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook and Cassandra, Hadoop, Open source, SciDB, Scientific research, Specific users  Subscribe to our complete feed! [...]

  2. Jerome Pineau on October 3rd, 2009 1:15 pm

    I think, unless I’m misinterpreting, that Martin is making the same point I was in my last blog on MR, namely that it addresses very specific domains (namely search, aka Google/Yahoo) primarily and that, being “outside” the engine, it is more of an OS component in its current form (file management/connectors in and out of the engine, Vertica-style) rather than an integral piece of the DBMS engine. I haven’t read the referenced paper (yet) but this one here is fairly eye-opening IMHO: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf

    Thanks for reposting this!

  3. Jacek Becla on issues in scientific data management | DBMS2 -- DataBase Management System Services on October 4th, 2009 7:40 am

    [...] as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his [...]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.