Database diversity
Discussion of choices and variety in database management system architecture. Related subjects include:
Three broad categories of data
People often try to draw a distinction between:
- Traditional data of the sort that’s stored in relational databases, aka “structured.”
- Everything else, aka “unstructured” or “semi-structured” or “complex.”
There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don’t tend to work very well, I think the most important one is that:
Databases shouldn’t be divided into just two categories. Even as a rough-cut approximation, they should be divided into three, namely:
- Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays
- Human/Nontabular data — i.e., all other data generated by humans
- Machine-Generated data
Even that trichotomy is grossly oversimplified, for reasons such as:
- These categories overlap.
- There are kinds of data that get into fuzzy border zones.
- Not all data in each category has all the same properties.
But at least as a starting point, I think this basic categorization has some value. Read more
| Categories: Database diversity, Investment research and trading, Log analysis, Telecommunications, Web analytics | 19 Comments |
The legit part of the NoSQL idea
I’ve written some snarky things about the “NoSQL” concept – or at least the moniker. (Carl Olofson’s term “non-schematic databases” seems less bad.) Yet I’m actually favorable about the increasing use of SQL alternatives. Perhaps I should pull those thoughts together. Read more
| Categories: Data models and architecture, Database diversity, Hadoop, NoSQL, Theory and architecture | 20 Comments |
NoSQL Q and A
Neal Leavitt is writing an article for IEEE on NoSQL. So he’s circulated a long list of questions, encouraging people to answer as many or few as they choose. Unfortunately, most of the questions are technically meaningless, in that they implicitly rely on the false assumption that there is such a thing as a single or at least reasonably well-defined NoSQL technology. (I imagine most of his questions are really about key-value stores.) Nonetheless, I took a crack at a number of them before getting bored. Anybody else want to pitch in too? Read more
| Categories: Data models and architecture, Database diversity, NoSQL, Theory and architecture | 10 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
HadoopDB
Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.
The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:
- Open/cheap/free
- Query fault-tolerance
- The related concept of tolerating node degradation that isn’t an outright node failure.
HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.
The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more
Introduction to the XLDB and SciDB projects
Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more
| Categories: Data models and architecture, Database diversity, eBay, Michael Stonebraker, Open source, Petabyte-scale data management, Scientific research, Theory and architecture | 2 Comments |
NoSQL?
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
The Explosion in DBMS Choice
If there’s one central theme to DBMS2, it’s that modern DBMS alternatives should in many cases be used instead of the traditional market leaders. So it was only a matter of time before somebody sponsored a white paper on that subject. The paper, sponsored by EnterpriseDB, is now posted along with my other recent white papers. Its conclusion — summarizing what kinds of database management system you should use in which circumstances — is reproduced below.
Many new applications are built on existing databases, adding new features to already-operating systems. But others are built in connection with truly new databases. And in the latter cases, it’s rare that a market-leading product is the best choice. Mid-range DBMS (for OLTP) or specialty data warehousing systems (for analytics) are usually just as capable, and much more cost-effective. Exceptions arise mainly in three kinds of cases:
- Small enterprises with very limited staff.
- Large enterprises that have negotiated heavily-discounted deals for a market-leading product.
- Super-high-end OLTP apps that need absolute top throughput (or security certifications, etc.)
Otherwise, the less costly products are typically the wiser choice. Read more
| Categories: Database diversity | 7 Comments |
My own data management software taxonomy
On a recent webcast, I presented an 11-node data management software taxonomy, updating a post commenting on Mike Stonebraker’s. It goes:
1. High-end OLTP/general-purpose DBMS
2. Mid-range OLTP/general-purpose DBMS
3. Row-based analytic RDBMS
4. Column- or array-based analytic RDBMS
5. Text search engines
6. XML and OO DBMS (but these may merge with search)
7. RDF and other graphical DBMS (but these may merge with relational)
8. Event/stream processing engines (aka CEP)
9. Embedded DBMS for devices
10. Sub-DBMS file managers (e.g. MapReduce/Hadoop)
11. Science DBMS
Obviously, this is a work in progress. In particular, while there’s clearly more than one kind of analytic DBMS, partitioning them into categories is not easy.
| Categories: Database diversity | 5 Comments |
Webcast on database diversity Wednesday April 9 2 pm Eastern
Once or twice a year, EnterpriseDB sponsors a webcast for me. The last two were super well-attended. And most people stayed to the end, which is generally an encouraging sign!
The emphasis this time is on alternatives to the market-leading DBMS. I’ll highlight the advantages of both data warehousing specialists and general-purpose mid-range DBMS (naturally focusing on the latter, given who the sponsor is). The provocative title is taken from a January, 2008 post — What leading DBMS vendors don’t want you to realize. If you read every word of this blog, there probably won’t be much new for you.
But I’d love to have you listen in and perhaps ask a question anyway!
You can register on EnterpriseDB’s webcast page, which also has an archived webcast I did for them in October, 2007.
| Categories: Database diversity, EnterpriseDB and Postgres Plus, Mid-range | 1 Comment |
