Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down. Read more
I must start by apologizing for giving a quote in a press release whose contents I deplore. Unlike occasions on which I’ve posted about inaccurate quotes, in this case the fault is mine. The quote is quite accurate. And NuoDB didn’t mislead me about the release’s contents; I just neglected to ask.
NuoDB evidently subscribes to the marketing fallacy:
- Big DBMS companies hit people repeatedly with marketing cudgels.
- We want to be a big DBMS company.
- Therefore we will hit people repeatedly with marketing cudgels too.
But to my taste, NuoDB’s worst travesty is not the deafening drumroll before launch (I asked off their mailing list months before), nor the claim that NuoDB’s launch would be a “big day” for the database industry (annoying but ordinary hype), nor the emergent flock of birds foofarah, nor even NuoDB’s overwrought benchmark marketing (distressingly many vendors do that).
Rather, I think NuoDB’s greatest marketing offense to date is its Codd-imitating “12 rules” for cloud database management. Read more
NuoDB has an interesting NewSQL story. NuoDB’s core design goals seem to be:
- Very flexible topology, including:
- Local replicas.
- Remote replicas.
- Easy deployment and management.
GenieDB is one of the newer and smaller NewSQL companies. GenieDB’s story is focused on wide-area replication and uptime, coupled to claims about ease and the associated low TCO (Total Cost of Ownership).
GenieDB is in my same family of clients as Cirro.
The GenieDB product is more interesting if we conflate the existing GenieDB Version 1 and a soon-forthcoming (mid-year or so) Version 2. On that basis:
- GenieDB has three tiers.
- GenieDB’s top tier is the usual MySQL front-end.
- GenieDB’s bottom tier is either Berkeley DB or a conventional MySQL storage engine.
- GenieDB’s bottom tier stores your entire database at every node.
- If you replicate locally, GenieDB’s middle tier operates a distributed cache.
- If you replicate wide-area, GenieDB’s middle tier allows active-active/multi-master replication.
The heart of the GenieDB story is probably wide-area replication. Specifics there include: Read more
|Categories: Cache, Cloud computing, Clustering, GenieDB, Market share and customer counts, MySQL, NewSQL||3 Comments|
I plan to write about several NewSQL vendors soon, but first here’s an overview post. Like “NoSQL”, the term “NewSQL” has an identifiable, recent coiner — Matt Aslett in 2011 — yet a somewhat fluid meaning. Wikipedia suggests that NewSQL comprises three things:
- OLTP- (OnLine Transaction Processing)/short-request-oriented SQL DBMS that are newer than MySQL.
- Innovative MySQL engines.
- Transparent sharding systems that can be used with, for example, MySQL.
I think that’s a pretty good working definition, and will likely remain one unless or until:
- SQL-oriented and NoSQL-oriented systems blur indistinguishably.
- MySQL (or PostgreSQL) laps the field with innovative features.
To date, NewSQL adoption has been limited.
- NewSQL vendors I’ve written about in the past include Akiban, Tokutek, CodeFutures (dbShards), Clustrix, Schooner (Membrain), VoltDB, ScaleBase, and ScaleDB, with GenieDB and NuoDB coming soon.
- But I’m dubious whether, even taken together, all those vendors have as many customers or production references as any of 10gen, Couchbase, DataStax, or Cloudant.*
That said, the problem may lie more on the supply side than in demand. Developing a competitive SQL DBMS turns out to be harder than developing something in the NoSQL state of the art.
Data/database virtualization seems to be a hot subject right now, and vendors of a broad variety of different technologies are all claiming to be in the space. A terminological mess has ensued, as Monash’s First and Third Laws of Commercial Semantics are borne out in spades.
If something is like “virtualization”, then it should resemble hypervisors such as VMware. To me:
- The core feature of a hypervisor is that it allows many somethings to run and coexist where ordinarily only one something would come into play. Here the “many somethings” are virtual machines and what’s going on inside them, and the “one something” is the ordinary operating system/hardware computing stack.
- A core feature of original VMware was that the “many somethings” could be quite different — for example, the operating environments of numerous different hardware systems you wanted to decommission, or of new systems that you didn’t want to buy quite yet.
- Important features of hypervisors include:
- The ability to have multiple virtual machines run side by side at once, safely.
- Flexible and powerful workload management if the virtual machines do contend for resources.
- Easy management.
- The negative feature of having sufficiently low overhead.
Anything that claims to be “like virtualization” should be viewed in that light. Read more
|Categories: Clustering, Data integration and middleware, ScaleDB, Theory and architecture, Transparent sharding||5 Comments|
Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:
- Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
- Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
- Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
- There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.
The key concept here seems to be the RDD. Any one RDD:
- Is a collection of Java objects, which should have the same or similar structure.
- Can be partitioned/distributed and shuffled/redistributed across the cluster.
- Doesn’t have to be entirely in memory at once.
Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:
- At the moment, RDDs expire at the end of a job.
- This restriction will be lifted in a future release.
|Categories: BDAS, Spark, and Shark, Data models and architecture, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization||5 Comments|
UC Berkeley’s AMPLab is working on a software stack that:
- Is meant (among other goals) to improve upon Hadoop …
- … but also to interoperate with it, and which in fact …
- … uses significant parts of Hadoop.
- Seems to have the overall name BDAS (Berkeley Data Analytics System).
The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).
Specific projects of note in all that include:
- Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
- Spark, a replacement for MapReduce and the associated execution stack.
- Shark, a replacement for Hive.
|Categories: BDAS, Spark, and Shark, ClearStory Data, Hadoop, MapReduce, Parallelization, Specific users||9 Comments|
I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,
Subjects I’ll mention include:
- Parallel Data Warehouse
- Columnar data management
- In-memory data management (Hekaton)
One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.
Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.
|Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Hadoop, Hortonworks, In-memory DBMS, MapReduce, Market share and customer counts, Microsoft and SQL*Server, Oracle, Yahoo||9 Comments|
My clients at Couchbase checked in.
- After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
- Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
- There also are many users of open source Couchbase, most famously LinkedIn.
- Orbitz is a much-mentioned flagship paying Couchbase customer.
- Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
- Couchbase headcount is just under 100.
The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:
- JSON storage, including secondary indexes.
- Multi-data-center replication.
- A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.
Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.
Technology notes on Couchbase 2.0 include: Read more
|Categories: Basho and Riak, Cache, Cassandra, Clustering, Couchbase, MapReduce, Market share and customer counts, MongoDB and 10gen, NoSQL, Open source, Structured documents||4 Comments|