May 22, 2010

Notes on SciDB and scientific data management

I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That’s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since then.

The main new activity I know of has been in the open source SciDB project.  

In other scientific data management news,

Finally, you surely are aware of the whole “Climategate” mess, in which major climate researchers’ email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about the difficulty of reconstructing published results from files at hand. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:


5 Responses to “Notes on SciDB and scientific data management”

  1. Michael on May 26th, 2010 4:17 pm

    Why has interest from “web analytics users” receded recently? Could this be due to the increased interest in Hadoop/Cassandra and similar products?

  2. Curt Monash on June 3rd, 2010 6:11 am


    SciDB is for analytics; Cassandra is for OLTP, hold the “T”, which I called HVSP in

    Hadoop is a closer competitor, as are RDBMS, MapReduce-enabled or otherwise.

  3. Michael McIntire on July 31st, 2010 1:27 pm

    What is driving the move to hadoop and other non-relational platforms is the cost and culture of RDBMS implementations.

    The culture problem is related to data management systems forcing data to be transformed into a private and internal form, and all the process that fronts it. Dimensional Modeling is an example. Let’s stop physicalizing dimensional design because that’s what RDBMS products support.

    On the cost front, generating data declines at roughly the inverse of moore’s law, not counting non-native per transaction data growth (I’m collecting more and more data about every event).

    On the analytics side of this problem – there are many more scans of the full dataset to get a single metric, so this function cost grows non-linearly in relation to the data size.

    So – Data costs are declining at the same rate of hardware. Data Analytics costs are RISING per unit of data. Put quite simply, at the upper end of the data size spectrum – data owners cannot afford to buy data management software.

  4. Juan on September 28th, 2016 8:01 am

    I need a database with advanced statical functions or a statistics program working transparently on a very large database. (ideally a distributed).
    What software do you suggest?
    Spark (with some proper underlying file system) could be the solution in the future but it only lets you do basic things. You can’t fit mixed effect models or bayesian models. Scidb has the same problem, you can only use functions implemented on it, and there are little. You can also design your own algorithms but it’s gonna be quite difficult.

    R or similar programs let you import data from a database but you can’t perform large operations properly, you can only get summaries or do it by chunks.

  5. Curt Monash on October 3rd, 2016 10:12 pm

    Hi Juan,

    As per, I’m not sure that I have a good answer for you.

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.