The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact. Read more
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
|Categories: Databricks, Spark and BDAS, Hadoop, Hortonworks, MapReduce, Workload management, Yahoo||14 Comments|
From time to time I advise a software vendor on how, whether, or to what extent it should offer its technology in open source. In summary, I believe:
- The formal differences between “open source” and “closed source” strategies are of secondary importance.
- The attitudinal and emotional differences between “open source” and “closed source” approaches can be large.
- A pure closed source strategy can make sense.
- A closed source strategy with important open source aspects can make sense.
- A pure open source strategy will only rarely win.
An “open source software” business model and strategy might include:
- Software given away for free.
- Demand generation to encourage people to use the free version of the software.
- Subscription pricing for additional proprietary software and support.
- Direct sales, and further marketing, to encourage users of the free stuff to upgrade to a paid version.
A “closed source software” business model and strategy might include:
- Demand generation.
- Free-download versions of the software.
- Subscription pricing for software (increasingly common) and support (always).
- Direct sales, and associated marketing.
Those look pretty similar to me.
Of course, there can still be differences between open and closed source. In particular: Read more
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Three elephants went out to play
— Popular children’s song
It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:
- Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
- Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
- Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
- Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
- Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
- Cloudera seems more willing to do that than Hortonworks.
- Different distro providers may choose different sets of Apache Hadoop subprojects to include.
- Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
- Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
- Hortonworks markets from a “more open source than thou” stance, even though:
- It is not a purist in that regard.
- That marketing message is often communicated by Hortonworks’ very closed-source partners.
- Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
- Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
- I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
- Hortonworks markets from a “more open source than thou” stance, even though:
- Optionally, third parties’ code can be provided, open or closed source as the case may be.
Most of the same observations could apply to Hadoop appliance vendors.
|Categories: Cloudera, Data warehouse appliances, EMC, Greenplum, Hadoop, Hortonworks, IBM and DB2, Intel, MapR, Market share and customer counts||3 Comments|
1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).
Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.
2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.
At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.
3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.
4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.
I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,
Subjects I’ll mention include:
- Parallel Data Warehouse
- Columnar data management
- In-memory data management (Hekaton)
One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.
Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.
|Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Hadoop, Hortonworks, In-memory DBMS, MapReduce, Market share and customer counts, Microsoft and SQL*Server, Oracle, Yahoo||10 Comments|
To a first approximation, HCatalog is the thing that will make Hadoop play nicely with all kinds of ETLT (Extract/Transform/Load/Transform). However, HCatalog is both less and more than that:
- Less, because some of the basic features people expect in HCatalog may well wind up in other projects instead.
- More, in that HCatalog could (also) be regarded as a manifestation of efforts to make Hadoop/Hive more like a DBMS.
The base use case for HCatalog is:
- You have an HDFS file, and know its record format.
- You write DDL (Data Description Language) to create a virtual table whose columns look a lot like the HDFS file’s record fields. The language for this is HiveQL, Hive’s SQL-like language.
- You interact with that table in any of variety of ways (batch operations only), including but not limited to:
Major variants on that include: Read more
A lot of confusion seems to have built around the facts:
- Hadoop MapReduce is being opened up into something called MapReduce 2 (MRv2).
- Something called YARN (Yet Another Resource Negotiator) is involved.
- One purpose of the whole thing is to make MapReduce not be required for Hadoop.
- MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative, yet the MPI/YARN/Hadoop effort is somehow troubled.
- Cloudera shipped YARN in June, yet simultaneously warned people away from actually using it.
Here’s my best effort to make sense of all that, helped by a number of conversations with various Hadoop companies, but most importantly a chat Friday with Arun Murthy and other Hortonworks folks.
- YARN, as an aspect of Hadoop, has two major kinds of benefits:
- The ability to use programming frameworks other than MapReduce.
- Scalability, no matter what programming framework you use.
- The YARN availability story goes:
- YARN is in alpha.
- YARN is expected to be in production at year-end, give or take.
- Cloudera made the marketing decision to include YARN in its June Hadoop distribution release anyway, but advised that it was for experimentation rather than production.
- Hortonworks, in its own June release, only shipped code it advised putting into production.
- My take on the YARN/MPI story goes something like this:
- Numerous people have told me of YARN/MPI delays.
- One person suggested that Greenplum is taking the lead in YARN/MPI integration, but has gotten slow and reclusive, apparently due to some big company-itis.
- I find that credible because of the Greenplum/SAS/MPI connection.
- If I understood Arun correctly, the latency story on Hadoop MapReduce is approximately:
- Arun says that Hadoop’s reputation for taking 10s of seconds to start a Hadoop job is old news. It takes a low single-digit number of seconds.
- However, starting all that Java does take 100s of milliseconds at best — 200 milliseconds in an ideal case, 500 milliseconds more realistically, and that’s just on a single server.
- Thus, if you want human real-time interaction, Hadoop MapReduce is not and likely never will be the way to go. Getting Hadoop MapReduce latencies under a few seconds is likely to be more trouble than it’s worth — because of MapReduce, not because of Hadoop.
- In particular — instead of incurring the overhead of starting processes up, Arun thinks low-latency needs should be met in a different way, namely by serving them from already-running processes. The examples he kept mentioning were the event processing projects Storm (out of Twitter, via an acquisition) and S4 (out of Yahoo).
In my recent series of Hadoop posts, there were several cases where I had to choose between recommending that enterprises:
- Go with the most advanced features any vendor was credibly advocating.
- Be more cautious, and only adopt features that have been solidly proven in the field.
I favored the more advanced features each time. Here’s why.
To a first approximation, I divide Hadoop use cases into two major buckets, only one of which I was addressing with my comments:
1. Analytic data management.* Here I favored features over reliability because they are more important, for Hadoop as for analytic RDBMS before it. When somebody complains about an analytic data store not being ready for prime time, never really working, or causing them to tear their hair out, what they usually mean is that:
- It couldn’t do the work that needed doing …
- … with reasonable performance and turnaround time …
- … without undue effort in administration and/or programming.
Those complaints are much, much, more frequent than “It crashed”. So it was for Netezza, DATAllegro, Greenplum, Aster Data, Vertica, Infobright, et al. So it also is for Hadoop. And how does one address those complaints? By performance and feature enhancements, of the kind that the Hadoop community is introducing at high speed. Read more
|Categories: Buying processes, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Hortonworks, Open source||Leave a Comment|
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop (this post)?
- HBase 0.92.
The posts depend on each other in various ways.
Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.
- Hortonworks has considerably fewer features than Cloudera, along with less of a production or support track record. (Edit: HCatalog may be a significant exception.)
- I doubt Cloudera really believes or can support the apparent claim in its CDH 4 press release that Hadoop is now suitable for every enterprise, whereas last month it wasn’t.
- While MapR was early with some nice enterprise features, such as high availability or certain management UI elements — quickly imitated in Cloudera Enterprise — I don’t think it has any special status as “enterprise-ready” either.
That said, “enterprise-ready Hadoop” really is an important topic.
So what does it mean for something to be “enterprise-ready”, in whole or in part? Common themes in distinguishing between “enterprise-class” and other software include:
- Usable by our existing staff.
- Sufficiently feature-rich.
- Integrates well with the rest of our environment.
- Fits well into our purchasing and vendor relations model.
- Sufficiently reliable, proven, and secure — which is to say, “safe”.
For Hadoop, as for most things, these concepts overlap in many ways. Read more
|Categories: Buying processes, Cloudera, Clustering, Hadoop, HBase, Hortonworks, MapR, MapReduce, Open source||7 Comments|