November 15, 2008

The query from hell, and other stories

I write about a lot of products whose core job boils down to Make queries run fast. Without exception, their vendors tout stories of remarkable performance gains over conventional/incumbent DBMS (reported improvement is usually at least 50-fold, and commonly 100-500+). They further claim at least 2-3X better performance than their close competitors. In making these claims, vendors usually stress that their results come from live customer benchmarks. In few if any of the cases, I judge, are they lying outright. So what’s going on? Read more

Categories: Benchmarks and POCs, Data warehousing

MySQL is being used in an IBM Lotus appliance

Apparently, IBM is rolling out an appliance for small businesses. MySQL is under the covers. The appliance won’t have a keyboard or monitor, so there won’t be a lot of database administration going on.

Before Solid and solidDB were acquired by IBM, one of the things Solid was proudest of was some embedded apps in which solidDB ran for years in boxes without keyboards or monitors.

I still think it’s a pity that IBM isn’t using solidDB as broadly as the technology deserves. Even so, this is a nice endorsement of MySQL for reliable zero-DBA mid-range use.

Categories: DBMS product categories, IBM and DB2, Mid-range, MySQL, solidDB

Big scientific databases need to be stored somehow

A year ago, Mike Stonebraker observed that conventional DBMS don’t necessarily do a great job on scientific data, and further pointed out that different kinds of science might call for different data access methods. Even so, some of the largest databases around are scientific ones, and they have to be managed somehow. For example:

Microsoft just put out an overwrought press release. The substance seems to be that Pan-STARRS — a Jim Gray legacy also discussed in an August, 2008 Computerworld article — is adding 1.4 terabytes of image data per night, and one not so new database adds 15 terabytes per year of some kind of computer simulation output used to analyze protein folding. Both run on SQL Server, of course.
Kognitio has an astronomical database too, at Cambridge University, adding 1/2 a terabyte of data per night.
Oracle is used for a McGill University proteonomics database called CellMapBase. A figure of 50 terabytes of “mass storage” is included, which doesn’t include tape backup and so on.
The Large Hadron Collider, once it actually starts functioning, is projected to generate 15 petabytes of data annually, which will be initially stored on tape and then distributed to various computing centers around the world.
Netezza is proud of its ability to serve images and the like quickly, although off the top of my head I’m not thinking of a major customer it has in that area. (But then, if you just sell software, your academic discount can approach 100%; but if like Netezza you have an actual cost of goods sold, that’s not as appealing an option.)

Long-term, I imagine that the most suitable DBMS for these purposes will be MPP systems with strong datatype extensibility — e.g., DB2, PostgreSQL-based Greenplum, PostgreSQL-based Aster nCluster, or maybe Oracle.