I made a remarkably rumpled video appearance yesterday with SiliconAngle honchos John Furrier and Dave Vellante. (Excuses include <3 hours sleep, and then a scrambling reaction to a schedule change.) Topics covered included, with approximate timechecks:
- 0:00 Introductory pabulum, and some technical difficulties
- 2:00 More introduction
- 3:00 Dynamic schemas and data model churn
- 6:00 Surveillance and privacy
- 13:00 Hadoop, especially the distro wars
- 22:00 BI innovation
- 23:30 More on dynamic schemas and data model churn
Edit: Some of my remarks were transcribed.
- I posted on dynamic schemas data model churn a few days ago.
- I capped off a series on privacy and surveillance a few days ago.
- I commented on various Hadoop distributions in June.
|Categories: Business intelligence, ClearStory Data, Data warehousing, Hadoop, MapR, MapReduce, Surveillance and privacy||Leave a Comment|
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14″ Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1′s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more
- Cloudera changed CEOs last week. Tom Reilly, late of ArcSight, is the new guy (I don’t know him), while Mike Olson’s titles become Chairman and Chief Strategy Officer. Mike told me Friday that Reilly had secretly been working with him for months.
- Mike shared good-sounding numbers with me. But little is for public disclosure except the stat >400 employees.
- There are always rumors of infighting at Cloudera, perhaps because from earliest days Cloudera was a place where tempers are worn on sleeves. That said, Mike denied stories of problems between him and COO Kirk Dunn, and greatly praised Kirk’s successes at large-account sales.
- Cloudera now self-identifies pretty clearly as an analytic data management company. The vision is multiple execution engines – MapReduce, Impala, something more memory-centric, etc. – talking to any of a variety of HDFS file formats. While some formats may be optimized for specific engines – e.g. Parquet for Impala – anything can work with more or less anything.*
- Mike told me that Cloudera didn’t have any YARN users in production, but thought there would be some by year-end. Even so, he thinks it’s fair to say that Cloudera users have substantial portions of Hadoop 2 in production, for example NameNode failover and HDFS (Hadoop Distributed File System) performance enhancements. Ditto HCatalog.
*Of course, there will always be exceptions. E.g., some formats can be updated on a short-request basis, while others can only be written to via batch conversions.
- There’s a widespread belief that Hortonworks is being shopped. Numerous folks – including me — believe the rumor of an Intel offer for $700 million. Higher figures and alternate buyers aren’t as widely believed.
- Views of MapR market traction, never high, are again on the downswing.
- IBM Big Insights seems to have some traction.
- In case there was any remaining doubt — DBMS vendors are pretty unanimous in agreeing that it makes sense to have Hadoop too. To my knowledge SAP hasn’t been as clear about showing a markitecture incorporating Hadoop as most of the others have … but then, SAP’s markitecture is generally less clear than other vendors’.
- Folks I talk with are generally wondering where and why Datameer lost its way. That still leaves Datameer ahead of other first-generation Hadoop add-on vendors (Karmasphere, Zettaset, et al.), in that I rarely hear them mentioned at all.
- I visited with my client Platfora. Things seem to be going very well.
- My former client Revelytix seems to have racked up some nice partnerships. (I had something to do with that. )
|Categories: Cloudera, Data warehousing, Datameer, Hadoop, Hortonworks, IBM and DB2, Intel, MapR, Market share and customer counts, Platfora, SAP AG, Zettaset||11 Comments|
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Three elephants went out to play
– Popular children’s song
It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:
- Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
- Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
- Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
- Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
- Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
- Cloudera seems more willing to do that than Hortonworks.
- Different distro providers may choose different sets of Apache Hadoop subprojects to include.
- Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
- Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
- Hortonworks markets from a “more open source than thou” stance, even though:
- It is not a purist in that regard.
- That marketing message is often communicated by Hortonworks’ very closed-source partners.
- Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
- Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
- I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
- Hortonworks markets from a “more open source than thou” stance, even though:
- Optionally, third parties’ code can be provided, open or closed source as the case may be.
Most of the same observations could apply to Hadoop appliance vendors.
|Categories: Cloudera, Data warehouse appliances, EMC, Greenplum, Hadoop, Hortonworks, IBM and DB2, Intel, MapR, Market share and customer counts||3 Comments|
With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”
If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed): Read more
|Categories: Business intelligence, Cloudera, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Health care, Investment research and trading, MapR, Market share and customer counts, Telecommunications, Web analytics||4 Comments|
I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware.
Cloudera thinks the picture now is:
- 2-socket servers, with 4- or 6-core chips.
- Increasing number of spindles, with 12 2-TB spindles being common.
- 48 gigs of RAM is most common, with 64-96 fairly frequent.
- A couple of 1GigE networking ports.
Discussion around that included:
- Enterprises had been running out of storage space; hence the increased amount of storage.
- Even more storage can be stuffed on a node, and at times is. But at a certain point there’s so much data on a node that recovery from node failure is too forbidding.
- There are some experiments with 10 GigE.
|Categories: Cloudera, Data warehouse appliances, Hadoop, MapR, Solid-state memory, Storage||7 Comments|
My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:
- A very tight integration between an RDBMS-based analytic platform and Hadoop …
- … that is decidedly immature as an analytic RDBMS …
- … but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).
Solr is in the mix as well.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:
- Dump multi-structured data into Hadoop.
- Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.
At the highest level, Hadapt works like this: Read more
|Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text||4 Comments|
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop (this post)?
- HBase 0.92.
The posts depend on each other in various ways.
Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.
- Hortonworks has considerably fewer features than Cloudera, along with less of a production or support track record. (Edit: HCatalog may be a significant exception.)
- I doubt Cloudera really believes or can support the apparent claim in its CDH 4 press release that Hadoop is now suitable for every enterprise, whereas last month it wasn’t.
- While MapR was early with some nice enterprise features, such as high availability or certain management UI elements — quickly imitated in Cloudera Enterprise — I don’t think it has any special status as “enterprise-ready” either.
That said, “enterprise-ready Hadoop” really is an important topic.
So what does it mean for something to be “enterprise-ready”, in whole or in part? Common themes in distinguishing between “enterprise-class” and other software include:
- Usable by our existing staff.
- Sufficiently feature-rich.
- Integrates well with the rest of our environment.
- Fits well into our purchasing and vendor relations model.
- Sufficiently reliable, proven, and secure — which is to say, “safe”.
For Hadoop, as for most things, these concepts overlap in many ways. Read more
|Categories: Buying processes, Cloudera, Clustering, Hadoop, HBase, Hortonworks, MapR, MapReduce, Open source||7 Comments|
I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:
- Cloudera now has 220 employees.
- Cloudera now has over 100 subscription customers.
- Over the past year, Cloudera has more than doubled in size by every reasonable metric.
- Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
- Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
- Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
- Cloudera has trained over 12,000 people.
- Hortonworks is training people too.
- Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
- A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
- Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
- There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
- I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
- There should be more HBase information at HBaseCon on May 22.
- If MapR still has momentum, nobody I talked with has noticed.
|Categories: Amazon and its cloud, ClearStory Data, Cloud computing, Cloudera, Hadoop, HBase, Hortonworks, MapR, MapReduce, Market share and customer counts, Petabyte-scale data management, WibiData||1 Comment|
Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a direct link, but in case that doesn’t prove stable, here also is a registration-required link from IBM’s Conor O’Mahony.) My comments include:
- The Forrester Wave’s relative vendor rankings are meaningless, in that the document compares apples, peaches, almonds, and peanuts. Apparently, it covers any vendor that includes a distribution of Apache Hadoop MapReduce into something it offers, and that offered at least two (not necessarily full production) references for same.
- The Forrester Wave for “enterprise Hadoop” contradicts itself on the subject of Hortonworks.
- The Forrester Wave for “enterprise Hadoop” is correct when it says “Hortonworks … has Hadoop training and professional services offerings that are still embryonic.”
- Peculiarly, the Forrester Wave for “enterprise Hadoop” also says “Hortonworks offers an impressive Hadoop professional services portfolio”. Hortonworks will likely win one or more nice partnership deals with vendors in adjacent fields, but even so its professional services capabilities are … well, a good word might be “embryonic”.
- Forrester Waves always seem to have weird implicit definitions of “data warehousing”. This one is no exception.
- Forrester gave top marks in “Functionality” to 11 of 13 “enterprise Hadoop” vendors. This seems odd.
- I don’t know why MapR, which doesn’t like HDFS (Hadoop Distributed File System), got top marks in “Subproject integration”.
- Forrester gave top marks in “Storage” to Datameer. It also gave higher marks to MapR than to EMC Greenplum, even though EMC Greenplum’s technology is a superset of MapR’s. Very strange. (Edit: Actually, as per a comment below, there is some uncertainty about the EMC/MapR relationship.)
- Forrester gave higher marks in “Acceleration and optimization” to Hortonworks than to Cloudera and IBM, and higher marks yet to Pentaho. Very odd.
- I’m not sure what Forrester is calling a “Distributed EDW file store connector”, but it sounds like something that Cloudera has provided via partnership to a number of analytic DBMS vendors.
- Forrester’s “Strategy” rankings seem to correlate to a metric of “We’re a large enough vendor to go in N directions at once”, for various values of N.
- Forrester is correct to rank Cloudera’s “Adoption” as being stronger than EMC/Greenplum’s or MapR’s. But Hortonworks’ strong mark for “Adoption” baffles me.
|Categories: Cloudera, Data warehousing, EMC, Greenplum, Hadoop, Hortonworks, MapR, MapReduce, Pentaho||11 Comments|