Database diversity
Discussion of choices and variety in database management system architecture. Related subjects include:
The Naming of the Foo
Let’s start from some reasonable premises.
- No technology category name is ever perfect.
- It’s particularly hard to describe NoSQL (Not Only SQL) accurately, given the basic confusion as to what NoSQL is all about.
- That said, it seems pretty clear that NoSQL is about making big websites (and perhaps other cloud-like installations) run and scale.
- Dwight Merriman (founder/CEO of MongoDB vendor 10gen) is heading in the right direction when he says that the unifying ideas of NoSQL are that you do away with transactions and joins. But if he’s ever said something like “NoSQL is Foo without joins and transactions,” I don’t know what Foo is.
- Actually, I do know what Foo is – Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. I just don’t know what Foo is called.
- Obviously, Foo is a lot like OLTP (OnLine Transaction Processing). However, it would be pretty silly for Foo to actually be OLTP, given that one of the core points of NoSQL is that you don’t have transactions.
- It not just the “T” part of OLTP that’s fried. Calling something “OnLine” only makes sense as long as offline is an option, and offline transaction processing has been obsolete for a very long time.*
*Sure, if you strain you can talk yourself into exceptions. But the point stands.
So we need a name for Foo, where Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. Thus, three major subcategories of more-or-less disk-based Foo are:
- No-compromises ACID-compliant relational OLTP
- Sharded MySQL
- NoSQL
There may be some more purely memory-centric versions too, but let’s put those aside for the moment.
Absent a better idea, I can squeeze Foo into yet another four-letter acronym:
HVSP (High-Volume Simple Processing)
That’s as imperfect as any other category name, and an awkward mouthful to boot. So I’d love to hear a better one; if you have such, please share it! In the mean time, I think “HVSP” has merit because:
- The “Processing” part should be noncontroversial.
- “High-Volume” is inherent to the challenge. If RDBMS scale well enough for your use case, using something less powerful is probably silly.* Similarly, while Oracle shines at high-volume OLTP workloads, there are many cheaper DBMS that do a fine job of OLTP at lower volumes.
- “Simple” is the core principle of NoSQL systems, which drop joins and transactions as being too much foofarah. That only makes sense at all under the assumption that you have bone-simple queries and updates, so that programming around the lack of joins and transactions isn’t all that much of a burden.
- Something similar is true of sharded MySQL.
- Less obviously, “simple” is a core principle of relational OLTP as well. The point of the relational model is to cap the complexity of data operations, or more precisely to hide that complexity from programmers.
- And overloading the word “simple” a bit, it’s fair to say that if you’re reading or writing one record at a time, you’re doing something relatively simple, at least as opposed to what you do in analytic processing. The OLTP vs. OLAP distinction is preserved in this name change.
- The whole thing matches my definition above, namely “what happens when lots of people want to get small amounts each of information in or out of a database at the same time.”
*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.
Systems I’m leaving out of the HVSP and hence also NoSQL categories include:
- Hadoop and other batch-oriented MapReduce. Hadoop isn’t part of NoSQL. I’m pretty sure that Cloudera CEO Mike Olson agrees with me.
- More generally, non-SQL data stores that don’t meet the HVSP criteria. Dave Kellogg stretches things when he claims that MarkLogic is a NoSQL system. (But then, that was in a post where he seemingly praised a train wreck of an article.)
But hey – what good is a categorization if it doesn’t leave some things out?
| Categories: Data models and architecture, Database diversity, Hadoop, MapReduce, Mark Logic, NoSQL, OLTP, Theory and architecture | 23 Comments |
Three broad categories of data
People often try to draw a distinction between:
- Traditional data of the sort that’s stored in relational databases, aka “structured.”
- Everything else, aka “unstructured” or “semi-structured” or “complex.”
There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don’t tend to work very well, I think the most important one is that:
Databases shouldn’t be divided into just two categories. Even as a rough-cut approximation, they should be divided into three, namely:
- Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays
- Human/Nontabular data — i.e., all other data generated by humans
- Machine-Generated data
Even that trichotomy is grossly oversimplified, for reasons such as:
- These categories overlap.
- There are kinds of data that get into fuzzy border zones.
- Not all data in each category has all the same properties.
But at least as a starting point, I think this basic categorization has some value. Read more
| Categories: Database diversity, Investment research and trading, Log analysis, Telecommunications, Web analytics | 9 Comments |
The legit part of the NoSQL idea
I’ve written some snarky things about the “NoSQL” concept – or at least the moniker. (Carl Olofson’s term “non-schematic databases” seems less bad.) Yet I’m actually favorable about the increasing use of SQL alternatives. Perhaps I should pull those thoughts together. Read more
| Categories: Data models and architecture, Database diversity, Hadoop, NoSQL, Theory and architecture | 20 Comments |
NoSQL Q and A
Neal Leavitt is writing an article for IEEE on NoSQL. So he’s circulated a long list of questions, encouraging people to answer as many or few as they choose. Unfortunately, most of the questions are technically meaningless, in that they implicitly rely on the false assumption that there is such a thing as a single or at least reasonably well-defined NoSQL technology. (I imagine most of his questions are really about key-value stores.) Nonetheless, I took a crack at a number of them before getting bored. Anybody else want to pitch in too? Read more
| Categories: Data models and architecture, Database diversity, NoSQL, Theory and architecture | 10 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
HadoopDB
Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.
The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:
- Open/cheap/free
- Query fault-tolerance
- The related concept of tolerating node degradation that isn’t an outright node failure.
HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.
The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more
Introduction to the XLDB and SciDB projects
Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more
| Categories: Data models and architecture, Database diversity, Michael Stonebraker, Open source, Petabyte-scale data management, Scientific research, Theory and architecture, eBay | 1 Comment |
NoSQL?
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
The Explosion in DBMS Choice
If there’s one central theme to DBMS2, it’s that modern DBMS alternatives should in many cases be used instead of the traditional market leaders. So it was only a matter of time before somebody sponsored a white paper on that subject. The paper, sponsored by EnterpriseDB, is now posted along with my other recent white papers. Its conclusion — summarizing what kinds of database management system you should use in which circumstances — is reproduced below.
Many new applications are built on existing databases, adding new features to already-operating systems. But others are built in connection with truly new databases. And in the latter cases, it’s rare that a market-leading product is the best choice. Mid-range DBMS (for OLTP) or specialty data warehousing systems (for analytics) are usually just as capable, and much more cost-effective. Exceptions arise mainly in three kinds of cases:
- Small enterprises with very limited staff.
- Large enterprises that have negotiated heavily-discounted deals for a market-leading product.
- Super-high-end OLTP apps that need absolute top throughput (or security certifications, etc.)
Otherwise, the less costly products are typically the wiser choice. Read more
| Categories: Database diversity | 6 Comments |
My own data management software taxonomy
On a recent webcast, I presented an 11-node data management software taxonomy, updating a post commenting on Mike Stonebraker’s. It goes:
1. High-end OLTP/general-purpose DBMS
2. Mid-range OLTP/general-purpose DBMS
3. Row-based analytic RDBMS
4. Column- or array-based analytic RDBMS
5. Text search engines
6. XML and OO DBMS (but these may merge with search)
7. RDF and other graphical DBMS (but these may merge with relational)
8. Event/stream processing engines (aka CEP)
9. Embedded DBMS for devices
10. Sub-DBMS file managers (e.g. MapReduce/Hadoop)
11. Science DBMS
Obviously, this is a work in progress. In particular, while there’s clearly more than one kind of analytic DBMS, partitioning them into categories is not easy.
| Categories: Database diversity | 5 Comments |
