- Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
- Continuuity recently changed its name to Cask and went open source.
- Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
- Cask and Cloudera partnered.
- I got a more technical Cask briefing this week.
- App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
- Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.
So far as I can tell:
- Cask’s current focus is to orchestrate job flows, with lots of data mappings.
- This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
- CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
- CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
- … sub-optimal choices in data mapping, database design or workflow composition.
Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:
- An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
- A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
- An excuse for press releases.
- A source of an extra logo graphic to put on marketing slides.
Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).
The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?
Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:
- Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
- MapR’s flavor, as software from MapR.
- Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.
In saying that, I’m glossing over a few points, such as: Read more
|Categories: Amazon and its cloud, Cloudera, EMC, Emulation, transparency, portability, Greenplum, Hadoop, Hortonworks, IBM and DB2, MapR, Open source||11 Comments|
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
- Cloudera continued to improve various aspects of its product line, especially Impala with a Version 2.0. Good for them. One should always be making one’s products better.
- Cloudera announced a variety of partnerships with companies one would think are opposed to it. Not all are Barney. I’m now hard-pressed to think of any sustainable-looking relationship advantage Hortonworks has left in the Unix/Linux world. (However, I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.)
- Cloudera is getting more cloud-friendly, via a new product — Cloudera Director. Probably there are or will be some cloud-services partnerships as well.
Notes on Cloudera Director start:
- It’s closed-source.
- Code and support are included in any version of Cloudera Enterprise.
- It’s a management tool. Indeed, Cloudera characterized it to me as a sort of manager of Cloudera Managers.
What I have not heard is any answer for the traditional performance challenge of Hadoop-in-the-cloud, which is:
- Hadoop, like most analytic RDBMS, tightly couples processing and storage in a shared-nothing way.
- Standard cloud architectures, however, decouple them, thus mooting a considerable fraction of Hadoop performance engineering.
Maybe that problem isn’t — or is no longer — as big a deal as I’ve been told.
Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.
So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):
- CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
- Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
- Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
- In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
- Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
- Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
- Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
- Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
- Cloudera Enterprise Basic Edition contains:
- All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
- Commercial licenses for all that code.
- A license key to use the entirety of Cloudera Manager, not just the free part.
- Support for the “Core Hadoop” part of CDH.
- Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
- The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
- Cloudera Enterprise Data Hub Edition contains:
- Everything in Cloudera Basic Edition.
- A license key for Cloudera Navigator.
- Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
- Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.
In analyzing all this, I’m focused on two particular aspects:
- The “zero, one, many” system for defining the editions of Cloudera Enterprise.
- The use of “Data Hub” as a general marketing term.
|Categories: Cloudera, Data warehousing, Databricks, Spark and BDAS, Hadoop, HBase, Hortonworks, Open source, Pricing||2 Comments|
The genesis of this post is that:
- Hortonworks is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of Hadoop.
- Cloudera is talking up what I would call its human real-time strategy, which includes but is not limited to Flume, Kafka, and Spark Streaming. Cloudera also sees a few use cases for Storm.
- This all fits with my view that the Current Hot Subject is human real-time data freshness — for analytics, of course, since we’ve always had low latencies in short-request processing.
- This also all fits with the importance I place on log analysis.
- Cloudera reached out to talk to me about all this.
Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.
In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are: Read more
|Categories: Cloudera, Complex event processing (CEP), Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Health care, Hortonworks, Kafka and Confluent, Log analysis, Specific users, Splunk, Web analytics||9 Comments|
I’ve talked with many companies recently that believe they are:
- Focused on building a great data management and analytic stack for log management …
- … unlike all the other companies that might be saying the same thing …
- … and certainly unlike expensive, poorly-scalable Splunk …
- … and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.
At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.
Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.
A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with: Read more
My client Teradata bought my (former) clients Revelytix and Hadapt.* Obviously, I’m in confidentiality up to my eyeballs. That said — Teradata truly doesn’t know what it’s going to do with those acquisitions yet. Indeed, the acquisitions are too new for Teradata to have fully reviewed the code and so on, let alone made strategic decisions informed by that review. So while this is just a guess, I conjecture Teradata won’t say anything concrete until at least September, although I do expect some kind of stated direction in time for its October user conference.
*I love my business, but it does have one distressing aspect, namely the combination of subscription pricing and customer churn. When your customers transform really quickly, or even go out of existence, so sometimes does their reliance on you.
I’ve written extensively about Hadapt, but to review:
- The HadoopDB project was started by Dan Abadi and two grad students.
- HadoopDB tied a bunch of PostgreSQL instances together with Hadoop MapReduce. Lab benchmarks suggested it was more performant than the coyly named DBx (where x=2), but not necessarily competitive with top analytic RDBMS.
- Hadapt was formed to commercialize HadoopDB.
- After some fits and starts, Hadapt was a Cambridge-based company. Former Vertica CEO Chris Lynch invested even before he was a VC, and became an active chairman. Not coincidentally, Hadapt had a bunch of Vertica folks.
- Hadapt decided to stick with row-based PostgreSQL, Dan Abadi’s previous columnar enthusiasm notwithstanding. Not coincidentally, Hadapt’s performance never blew anyone away.
- Especially after the announcement of Cloudera Impala, Hadapt’s SQL-on-Hadoop positioning didn’t work out. Indeed, Hadapt laid off most or all of its sales and marketing folks. Hadapt pivoted to emphasize its schema-on-need story.
- Chris Lynch, who generally seems to think that IT vendors are created to be sold, shopped Hadapt aggressively.
As for what Teradata should do with Hadapt: Read more
|Categories: Aster Data, Citus Data, Cloudera, Columnar database management, Data warehousing, Hadapt, Hadoop, MapReduce, Oracle, SQL/Hadoop integration, Teradata||7 Comments|
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more
My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:
- Rack- or data-center-scale systems.
- The real or imagined demise of Moore’s Law.
I also got updated as to typical Hadoop hardware.
If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)
One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.
|Categories: Amazon and its cloud, Buying processes, Cloudera, Facebook, Google, Intel, Memory-centric data management, Pricing, Solid-state memory||19 Comments|