Discussion of relational database management systems that are offered through some version of open source licensing. Related subjects include:
Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.
So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):
- CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
- Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
- Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
- In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
- Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
- Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
- Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
- Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
- Cloudera Enterprise Basic Edition contains:
- All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
- Commercial licenses for all that code.
- A license key to use the entirety of Cloudera Manager, not just the free part.
- Support for the “Core Hadoop” part of CDH.
- Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
- The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
- Cloudera Enterprise Data Hub Edition contains:
- Everything in Cloudera Basic Edition.
- A license key for Cloudera Navigator.
- Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
- Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.
In analyzing all this, I’m focused on two particular aspects:
- The “zero, one, many” system for defining the editions of Cloudera Enterprise.
- The use of “Data Hub” as a general marketing term.
|Categories: Cloudera, Data warehousing, Databricks, Spark and BDAS, Hadoop, HBase, Hortonworks, Open source, Pricing||2 Comments|
As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.
DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.
In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.
Buyer inertia is a greater concern.
- A significant minority of enterprises are highly committed to their enterprise DBMS standards.
- Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
- FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.
A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.
- First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
- Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
- Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
- Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.
Otherwise I’d say: Read more
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more
Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:
- Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
- As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
- Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
- … a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
- Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
- Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
- Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.
In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social. Read more
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14″ Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more
My clients at WibiData:
- Think they’re an application software company …
- … but actually are talking about what I call analytic application subsystems.
- Haven’t announced or shipped any of those either …
- … but will shortly.
- Have meanwhile shipped some cool enabling technology.
- Name their products after sushi restaurants.
Yeah, I like these guys.
If you’re building an application that “obviously” calls for a NoSQL database, and which has a strong predictive modeling aspect, then WibiData has thought more cleverly about what you need than most vendors I can think of. More precisely, WibiData has thought cleverly about your data management, movement, crunching, serving, and integration. For pure modeling sophistication, you should look elsewhere — but WibiData will gladly integrate with or execute those models for you.
WibiData’s enabling technology, now called Kiji, is a collection of modules, libraries, and so on — think Spring — running over Hadoop/HBase. Except for some newfound modularity, it is much like what I described at the time of WibiData’s launch or what WibiData further disclosed a few months later. Key aspects include:
- A way to define schemas in HBase, including ones that change as rapidly as consumer-interaction applications require.
- An analytic framework called “Produce/Gather”, which can execute at human real-time speeds (via its own execution engine) or with higher throughput in batch mode (by invoking Hadoop MapReduce).
- Enough load capabilities, Hive interaction, and so on to get data into the proper structure in Kiji in the first place.
|Categories: Hadoop, HBase, NoSQL, Open source, Predictive modeling and advanced analytics, WibiData||4 Comments|
Last week, I edited press releases back-to-back-to-back for three clients, all with announcements at this week’s Percona Live. The ones with embargoes ending today are Tokutek and GenieDB.
Tokutek’s news is that they’re open sourcing much of TokuDB, but holding back hot backup for their paid version. I approve of this strategy — “doesn’t lose data” is an important feature, and well worth paying for.
I kid, I kid. Any system has at least a bad way to do backups — e.g. one that involves slowing performance, or perhaps even requires taking applications offline altogether. So the real points of good backup technology are:
- To keep performance steady.
- To make the whole thing as easy to manage as possible.
GenieDB is announcing a Version 2, which is basically a performance release. So in lieu of pretending to have much article-worthy news, GenieDB is taking the opportunity to remind folks of its core marketing messages, with catchphrases such as “multi-regional self-healing MySQL”. Good choice; indeed, I wish more vendors would adopt that marketing tactic.
Along the way, I did learn a bit more about GenieDB. In particular:
- GenieDB is now just backed by a hacked version of InnoDB (no more Berkeley DB Java Edition).
- Why hacked? Because GenieDB appends a Lamport timestamp to every row, which somehow leads to a need to modify how indexes and caching work.
- Benefits of the chamge include performance and simpler (for the vendor) development.
- An arguable disadvantage of the switch is that GenieDB no longer can use Berkeley DB’s key-value interface — but MySQL now has one of those too.
I also picked up some GenieDB company stats I didn’t know before — 9 employees and 2 paying customers.
|Categories: GenieDB, Market share and customer counts, MySQL, NewSQL, Open source, Tokutek and TokuDB||3 Comments|
From time to time I advise a software vendor on how, whether, or to what extent it should offer its technology in open source. In summary, I believe:
- The formal differences between “open source” and “closed source” strategies are of secondary importance.
- The attitudinal and emotional differences between “open source” and “closed source” approaches can be large.
- A pure closed source strategy can make sense.
- A closed source strategy with important open source aspects can make sense.
- A pure open source strategy will only rarely win.
An “open source software” business model and strategy might include:
- Software given away for free.
- Demand generation to encourage people to use the free version of the software.
- Subscription pricing for additional proprietary software and support.
- Direct sales, and further marketing, to encourage users of the free stuff to upgrade to a paid version.
A “closed source software” business model and strategy might include:
- Demand generation.
- Free-download versions of the software.
- Subscription pricing for software (increasingly common) and support (always).
- Direct sales, and associated marketing.
Those look pretty similar to me.
Of course, there can still be differences between open and closed source. In particular: Read more
1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).
Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.
2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.
At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.
3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.
4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.
Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:
- Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
- Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
- Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
- There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.
The key concept here seems to be the RDD. Any one RDD:
- Is a collection of Java objects, which should have the same or similar structure.
- Can be partitioned/distributed and shuffled/redistributed across the cluster.
- Doesn’t have to be entirely in memory at once.
Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:
- At the moment, RDDs expire at the end of a job.
- This restriction will be lifted in a future release.
|Categories: Data models and architecture, Databricks, Spark and BDAS, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization, SQL/Hadoop integration||10 Comments|