When databases are too big to manage via a single server, responsibility for them is spread among multiple servers. There are numerous names for this strategy, or versions of it — all of them at least somewhat problematic. The most common terms include:
- (Shared-nothing) MPP (Massively Parallel Processing), often used to describe analytic DBMS. On the whole, these terms have worked pretty well, but they have issues even so. First, “MPP” means different things to different marketers. Second, most ostensibly “shared-nothing” systems aren’t really “shared-nothing.” They generally support at least storage arrays, if not storage-area networks (SANs); indeed, in a couple of cases (most notably EMC Greenplum), SAN support is prominent in their marketing message.
- (Horizontal) partitioning and/or data distribution. These have significant problems. “Partitioning” and “distribution” are easily confused with each other, not least because the term “partitioning” is used in different ways by different DBMS product vendors.
- Sharding, commonly used to describe scaled-out MySQL in Internet Request Processing use cases. This one has the advantage of being concise, but is beginning to mean two different things, in that it is used both when the data is REALLY in separate databases on different machines (i.e., the application has to explicitly reference the shard it wants to talk to) and also when the database is transparently distributed (e.g. via dbShards).
- Coherent caching and/or distributed shared memory, describing cases when data is in RAM. Besides being RAM-specific, these terms can be vague as to whether the same data is recopied onto different systems, or whether they are focused on letting (relatively) large in-memory data stores be spread across a cluster.
I plan to start using the term transparent sharding to denote a data management strategy in which data is assigned to multiple servers (or CPUs, cores, etc.), yet looks to programmers and applications as if it were managed by just one. Thus,
- dbShards and ScaleBase feature transparent sharding (this is the case which inspired me to introduce the term).
- Anything which has ever reasonably been called a “shared-nothing” MPP DBMS features transparent sharding.
- Memcached features transparent sharding. So, I imagine, do other caching systems I am less familiar with.
- Shared-disk DBMS do not feature transparent sharding, even if their query work can be scaled out across multiple servers. (But Oracle Exadata does, because of its server tier.)
At the moment, I don’t see much benefit to introducing the term “transparent sharding” into discussions of analytic DBMS; rather, it’s targeted at the Internet Request Processing case. But as the terms “MPP” and “shared-nothing” get ever less useful, I could at some point change my mind.*
*One reason not to switch terms: “MPP” is marvelously concise.
What do you think of this terminology? Comments will be extremely welcome. But please be so kind as to recall one thing — no technology category definition can ever be perfect.