Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.
DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.
In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.
Buyer inertia is a greater concern.
- A significant minority of enterprises are highly committed to their enterprise DBMS standards.
- Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
- FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.
A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.
- First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
- Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
- Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
- Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.
Otherwise I’d say: Read more
I’m commonly asked to assess vendor claims of the kind:
- “Our system lets you do multiple kinds of processing against one database.”
- “Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”
So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:
- The most extreme vendor marketing claims are false.
- There are many different choices that make sense in at least some use cases each.
Horses for courses
It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:
- Short-request vs. analytic.
- SQL vs. non-SQL (NoSQL or otherwise).
- Expensive/heavy-duty vs. cheap/easy-to-support.
Vendors are part of this consensus; already in 2005 I observed
For all practical purposes, there are no DBMS vendors left advocating single-server strategies.
Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.
Multiple data stores for a single application
We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree: Read more
The pessimist thinks the glass is half-empty.
The optimist thinks the glass is half-full.
The engineer thinks the glass was poorly designed.
Most of what I wrote in Part 1 of this post was already true 15 years ago. But much gets added in the modern era, considering that:
- Clusters will have node hiccups more often than single nodes will. (Duh.)
- Networks are relatively slow even when uncongested, and furthermore congest unpredictably.
- In many applications, it’s OK to sacrifice even basic-seeming database functionality.
And so there’s been innovation in numerous cluster-related subjects, two of which are:
- Distributed query and update. When a database is distributed among many modes, how does a request access multiple nodes at once?
- Fault-tolerance in long-running jobs.When a job is expected to run on many nodes for a long time, how can it deal with failures or slowdowns, other than through the distressing alternatives:
- Start over from the beginning?
- Keep (a lot of) the whole cluster’s resources tied up, waiting for things to be set right?
Distributed database consistency
When a distributed database lives up to the same consistency standards as a single-node one, distributed query is straightforward. Performance may be an issue, however, which is why we have seen a lot of:
- Analytic RDBMS innovation.
- Short-request applications designed to avoid distributed joins.
- Short-request clustered RDBMS that don’t allow fully-general distributed joins in the first place.
But in workloads with low-latency writes, living up to those standards is hard. The 1980s approach to distributed writing was two-phase commit (2PC), which may be summarized as: Read more
|Categories: Clustering, CouchDB, Data warehousing, Databricks, Spark and BDAS, Facebook, Hadoop, MapReduce, Sybase, Theory and architecture, VoltDB and H-Store||1 Comment|
Writing data management or analysis software is hard. This post and its sequel are about some of the reasons why.
When systems work as intended, writing and reading data is easy. Much of what’s hard about data management is dealing with the possibility — really the inevitability — of failure. So it might be interesting to survey some of the many ways that considerations of failure come into play. Some have been major parts of IT for decades; others, if not new, are at least newly popular in this cluster-oriented, RAM-crazy era. In this post I’ll focus on topics that apply to single-node systems; in the sequel I’ll emphasize topics that are clustering-specific.
Major areas of failure-aware design — and these overlap greatly — include:
- Backup and restore. In its simplest form, this is very basic stuff. That said — any decent database management system should let backups be made without blocking ongoing database operation, with the least performance impact possible.
- Logging, rollback and replay. Logs are essential to DBMS. And since they’re both ubiquitous and high-performance, logs are being used in ever more ways.
- Locking, latching, transactions and consistency. Database consistency used to be enforced in stern and pessimistic ways. That’s changing, big-time, in large part because of the requirements of …
- … distributed database operations. Increasingly, modern distributed database systems are taking the approach of getting work done first, then cleaning up messes when they occur.
- Redundancy and replication. Parallel computing creates both a need and an opportunity to maintain multiple replicas of data at once, in very different ways than the redundancy and replication of the past.
- Fault-tolerant execution. When one node is inoperative, inaccessible, overloaded or just slow, you may not want a whole long multi-node job to start over. A variety of techniques address this need.
In a single-server, disk-based configuration, techniques for database fault-tolerance start: Read more
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more
One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:
- CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
- Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
- CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
- Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.
*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.
Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.
|Categories: Aster Data, Citus Data, Columnar database management, Data warehousing, Database compression, Greenplum, Hadoop, Parallelization, PostgreSQL, SQL/Hadoop integration, Transparent sharding, Workload management||6 Comments|
I stopped by MemSQL last week, and got a range of new or clarified information. For starters:
- Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
- … which was the point of introducing MemSQL’s flash-based columnar option.
- One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
- Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
- At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
- A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
- MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.
On the more technical side: Read more
|Categories: Clustering, Clustrix, Columnar database management, Data warehousing, Database compression, In-memory DBMS, MemSQL, NewSQL, NuoDB, Specific users, Vertica Systems, VoltDB and H-Store, Workload management, Zynga||17 Comments|
There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.
- Hive is good at some tasks and terrible at others.
- Hive is good at batch data transformation.
- Hive is bad at ad-hoc query, unless you really, really need Hive’s scale and low license cost. One example, per Eli Collins: Facebook has a 500 petabyte Hive warehouse, but jokes that on a good day an analyst can run 6 queries against it.
- Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.)
- Impala is also meant to be good at what Hive is good at, and will someday from Cloudera’s standpoint completely supersede Hive, but Cloudera is in no hurry for that day to arrive. Hive is more mature. Hive still has more SQL coverage than Impala. There’s a lot of legacy investment in Hive. Cloudera gets little business advantage if a customer sunsets Hive.
- Impala is already decent at some tasks analytic RDBMS are commonly used for. Cloudera insists that some queries run very quickly on Impala. I believe them.
- Impala is terrible at others, including some of the ones most closely associated with the concept of “data warehousing”. Data modeling is a big zero right now. Impala’s workload management, concurrency and all that are very immature.
- There are some use cases for which SQL-on-Hadoop blows away analytic RDBMS, for example ones involving data transformations – perhaps on multi-structured data – that are impractical in RDBMS.
And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.
- A survey of SQL/Hadoop integration (February, 2014)
- The cardinal rules of DBMS development (March, 2013)
|Categories: Cloudera, Data warehousing, Facebook, Hadoop, SQL/Hadoop integration, Workload management||4 Comments|
Spark is on the rise, to an even greater degree than I thought last month.
- Numerous clients and other companies I talk with have adopted Spark, plan to adopt Spark, or at least think it’s likely they will. In particular:
- A number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced soon.
- MapR has joined Cloudera in supporting Spark, and indeed — unlike Cloudera — is supporting the full Spark stack.
- Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably:
- The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.)
- A rich set of programming primitives in connection with that model.
- Support also for highly-iterative processing, of the kind found in machine learning.
- Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets).
- A clever approach to fault-tolerance.
- Spark is a major contender in streaming.
- There’s some cool machine-learning innovation using Spark.
- Spark 1.0 will drop by mid-May, Apache voters willin’ an’ the creek don’ rise. Publicity will likely ensue, with strong evidence of industry support.*
*Yes, my fingerprints are showing again.
The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):
… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.
With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.
|Categories: Cloudera, Complex event processing (CEP), Databricks, Spark and BDAS, Hadoop, Hortonworks, MapR, MapReduce, Predictive modeling and advanced analytics, SQL/Hadoop integration, Yahoo||14 Comments|
When I’m asked to talk to academics, the requested subject is usually a version of “What should we know about what’s happening in the actual market/real world?” I then try to figure out what the scholars could stand to hear that they perhaps don’t already know.
In the current case (Berkeley next Tuesday), I’m using the title “Necessary complexity”. I actually mean three different but related things by that, namely:
- No matter how cool an improvement you have in some particular area of technology, it’s not very useful until you add a whole bunch of me-too features and capabilities as well.
- Even beyond that, however, the simple(r) stuff has already been built. Most new opportunities are in the creation of complex integrated stacks, in part because …
- … users are doing ever more complex things.
While everybody on some level already knows all this, I think it bears calling out even so.
I previously encapsulated the first point in the cardinal rules of DBMS development:
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
My recent post about MongoDB is just one example of same.
Examples of the second point include but are hardly limited to: Read more