Memcached-based company NorthScale launches
NorthScale, a start-up based around memcached, has just launched, two weeks after the Todd Hoff’s post arguing the MySQL/memcached combo is passe’. NorthScale wouldn’t necessarily argue with Todd, arguing that what you really should use instead is NorthScale’s combo of memcached and MemBase, a memcached-like DBMS …
… or something like that. I don’t intend to write seriously about NorthScale until I have a better idea of what MemBase is.
In the mean time,
- VentureBeat put up a solid post on NorthScale’s company history and so on
- Om Malik bought into the NorthScale memcached pitch
- TechCrunch has a low-quality post about NorthScale (although it wasn’t as error-riddled as the same author’s post about nStein, which Seth Grimes properly blasted)
| Categories: Cache, Clustering, NoSQL, Parallelization | Leave a Comment |
Toward a NoSQL taxonomy
I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:
NoSQL = HVSP (High Volume Simple Processing) without joins or explicit transactions
Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model and replication/sharding strategy. I’d be happier, however, with at least three parts to the taxonomy:
- How data looks logically on a single node
- How data is stored physically on a single node
- How data is distributed, replicated, and reconciled across multiple nodes, and whether applications have to be aware of how the data is partitioned among nodes/shards. Read more
| Categories: Cassandra, Data models and architecture, NoSQL, Parallelization, RDF and graphs, Structured documents, Theory and architecture | 4 Comments |
The Naming of the Foo
Let’s start from some reasonable premises.
- No technology category name is ever perfect.
- It’s particularly hard to describe NoSQL (Not Only SQL) accurately, given the basic confusion as to what NoSQL is all about.
- That said, it seems pretty clear that NoSQL is about making big websites (and perhaps other cloud-like installations) run and scale.
- Dwight Merriman (founder/CEO of MongoDB vendor 10gen) is heading in the right direction when he says that the unifying ideas of NoSQL are that you do away with transactions and joins. But if he’s ever said something like “NoSQL is Foo without joins and transactions,” I don’t know what Foo is.
- Actually, I do know what Foo is – Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. I just don’t know what Foo is called.
- Obviously, Foo is a lot like OLTP (OnLine Transaction Processing). However, it would be pretty silly for Foo to actually be OLTP, given that one of the core points of NoSQL is that you don’t have transactions.
- It not just the “T” part of OLTP that’s fried. Calling something “OnLine” only makes sense as long as offline is an option, and offline transaction processing has been obsolete for a very long time.*
*Sure, if you strain you can talk yourself into exceptions. But the point stands.
So we need a name for Foo, where Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. Thus, three major subcategories of more-or-less disk-based Foo are:
- No-compromises ACID-compliant relational OLTP
- Sharded MySQL
- NoSQL
There may be some more purely memory-centric versions too, but let’s put those aside for the moment.
Absent a better idea, I can squeeze Foo into yet another four-letter acronym:
HVSP (High-Volume Simple Processing)
That’s as imperfect as any other category name, and an awkward mouthful to boot. So I’d love to hear a better one; if you have such, please share it! In the mean time, I think “HVSP” has merit because:
- The “Processing” part should be noncontroversial.
- “High-Volume” is inherent to the challenge. If RDBMS scale well enough for your use case, using something less powerful is probably silly.* Similarly, while Oracle shines at high-volume OLTP workloads, there are many cheaper DBMS that do a fine job of OLTP at lower volumes.
- “Simple” is the core principle of NoSQL systems, which drop joins and transactions as being too much foofarah. That only makes sense at all under the assumption that you have bone-simple queries and updates, so that programming around the lack of joins and transactions isn’t all that much of a burden.
- Something similar is true of sharded MySQL.
- Less obviously, “simple” is a core principle of relational OLTP as well. The point of the relational model is to cap the complexity of data operations, or more precisely to hide that complexity from programmers.
- And overloading the word “simple” a bit, it’s fair to say that if you’re reading or writing one record at a time, you’re doing something relatively simple, at least as opposed to what you do in analytic processing. The OLTP vs. OLAP distinction is preserved in this name change.
- The whole thing matches my definition above, namely “what happens when lots of people want to get small amounts each of information in or out of a database at the same time.”
*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.
Systems I’m leaving out of the HVSP and hence also NoSQL categories include:
- Hadoop and other batch-oriented MapReduce. Hadoop isn’t part of NoSQL. I’m pretty sure that Cloudera CEO Mike Olson agrees with me.
- More generally, non-SQL data stores that don’t meet the HVSP criteria. Dave Kellogg stretches things when he claims that MarkLogic is a NoSQL system. (But then, that was in a post where he seemingly praised a train wreck of an article.)
But hey – what good is a categorization if it doesn’t leave some things out?
| Categories: Data models and architecture, Database diversity, Hadoop, MapReduce, Mark Logic, NoSQL, OLTP, Theory and architecture | 23 Comments |
Some NoSQL links
I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found.
- A little over a year ago, Julian Browne put up a great post on Eric Brewer’s CAP conjecture/theorem, which provides much of the impetus to relax the traditional requirement for atomicity/consistency.
- Even more directly inspirational to NoSQL technology development were two seminal papers: Google’s on BigTable and Amazon’s on Dynamo. (That said, I’m having trouble getting myself to actually read them from start to finish, especially since they’ve been superseded by subsequent technology development.)
- 10gen (the MongoDB guys) hosted a NoSQL conference yesterday. Much blogging has ensued. The best post I’ve seen so far was by Adam Marcus. I find the graph database notes near the bottom particularly interesting.
- Mark Callaghan hit back against the NoSQL movement hype, and in particular against the MySQL/memcached is passe‘ meme. On the other hand, he also bemoaned many failings of MySQL. On the third hand, he praised or at least expressed hope for a variety of MySQL-related technologies, including Tokutek’s TokuDB and Continuent’s Tungsten.
- In connection with that debate, Mark Rendle offered a funny rant, mainly pro-NoSQL, in the style of a Socratic dialogue.
- John Quinn of Digg recently described Digg’s move from MySQL to Cassandra, and outlined a lot of features Digg was adding to Cassandra, all of which it is open-sourcing.
- The NoSQL guys maintain their own long list of NoSQL-related links.
| Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek | 5 Comments |
Cassandra and the NoSQL scalable OLTP argument
Todd Hoff put up a provocative post on High Scalability called MySQL and Memcached: End of an Era? The post itself focuses on observations like:
- Facebook invented and is adopting Cassandra.
- Twitter is adopting Cassandra.
- Digg is adopting Cassandra.
- LinkedIn invented and is adopting Voldemort.
- Gee, it seems as if the super-scalable website biz has moved beyond MySQL/Memcached.
But in addition, he provides a lot of useful links, which DBMS-oriented folks such as myself might have previously overlooked. Read more
| Categories: Cassandra, Data models and architecture, NoSQL, OLTP, Open source, Parallelization, Specific users, Theory and architecture | 11 Comments |
Data exploration vs. data visualization
I’ve tended to conflate data exploration and data visualization, and I’m far from alone in doing so. But a recent Economist article is a useful reminder that they aren’t exactly the same thing. Read more
| Categories: Analytic technologies, Business intelligence | 4 Comments |
Another reason to expect number-crunching and big-data management to converge
Dan Olds argues that Oracle is likely to pursue commercially-substantive high performance computing (HPC), emphasis mine: Read more
| Categories: Analytic technologies, Data warehousing, Exadata, Oracle, Theory and architecture | Leave a Comment |
Notes on Sybase Adaptive Server Enterprise
It had been a very long time since I was remotely up to speed on Sybase’s main OLTP DBMS, Adaptive Server Enterprise (ASE). Raj Rathee, however, was kind enough to fill me in a few days ago. Highlights of our chat included: Read more
| Categories: Cache, In-memory DBMS, Memory-centric data management, Sybase | Leave a Comment |
Chris Bird’s blog is brilliant, and update-in-place is increasingly passe’
I wouldn’t say every post in Chris Bird’s occasionally-updated blog is brilliant. I wouldn’t even say every post is readable. But I’d still recommend his blog to just about anybody who reads here as, at a minimum, a consciousness-raiser.
One of the two posts inspiring me to mention this is a high-level one on “technical debt“, reminding us why things don’t always get done right the first time, and further reminding us that circling back to fix them sooner rather than later is usually wise. The other connects two observations that individually have great merit (at least if you don’t take them to extremes):
- Update-in-place is passe’
- So is elaborate up-front database design
Specific points of interest here include: Read more
| Categories: Theory and architecture | 7 Comments |
February 2010 data warehouse DBMS news roundup
February is usually a busy month for data warehouse DBMS product releases, product announcements, and other real or contrived data warehouse DBMS news, and it can get pretty confusing trying to keep those categories of “news” apart.* This year is no exception, although several vendors – including Teradata and Netezza – are taking “rolling thunder” approaches, doing some of their announcements this month while holding others back for March or April.
*I probably have it worse than most people in that regard, because my clients run tentative feature lists and announcement schedules by me well in advance, which may get changed multiple times before the final dates roll around. I also occasionally miss some detail, if it wasn’t in a pre-briefing but gets added at the end.
Anyhow, the three big themes of this month’s announcements are probably:
- Integrating different kinds of analytic processing into databases and DBMS.
- Taking advantage of hardware advances.
- Playing catchup in areas where small vendors’ products weren’t mature yet.
| Categories: Analytic technologies, Aster Data, Data warehousing, Netezza, Teradata, Vertica Systems | Leave a Comment |
