Discussion of open source MapReduce implementation Hadoop. Related subjects include:
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
I commonly write about real or apparent technical differentiation, in a broad variety of domains. But actually, computers only do a couple of kinds of things:
- Accept instructions.
- Execute them.
And hence almost all IT product differentiation fits into two buckets:
- Easier instruction-giving, whether that’s in the form of a user interface, a language, or an API.
- Better execution, where “better” usually boils down to “faster”, “more reliable” or “more reliably fast”.
As examples of this reductionism, please consider:
- Application development is of course a matter of giving instructions to a computer.
- Database management systems accept and execute data manipulation instructions.
- Data integration tools accept and execute data integration instructions.
- System management software accepts and executes system management instructions.
- Business intelligence tools accept and execute instructions for data retrieval, navigation, aggregation and display.
Similar stories are true about application software, or about anything that has an API (Application Programming Interface) or SDK (Software Development Kit).
Yes, all my examples are in software. That’s what I focus on. If I wanted to be more balanced in including hardware or data centers, I might phrase the discussion a little differently — but the core points would still remain true.
What I’ve said so far should make more sense if we combine it with the observation that differentiation is usually restricted to particular domains. Read more
Datameer checked in, having recently announced general availability of Datameer 5.0. So far as I understood, Datameer is still clearly in the investigative analytics business, in that:
- Datameer does business intelligence, but not at human real-time speeds. Datameer query durations are sometimes sub-minute, but surely not sub-second.
- Datameer also does lightweight predictive analytics/machine learning — k-means clustering, decision trees, and so on.
Key aspects include:
- Datameer runs straight against Hadoop.
- Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.
- The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.
- Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.
- Datameer lets you write derived data back into Hadoop.
|Categories: Business intelligence, Databricks, Spark and BDAS, Datameer, Hadoop, Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Web analytics||5 Comments|
- Cloudera continued to improve various aspects of its product line, especially Impala with a Version 2.0. Good for them. One should always be making one’s products better.
- Cloudera announced a variety of partnerships with companies one would think are opposed to it. Not all are Barney. I’m now hard-pressed to think of any sustainable-looking relationship advantage Hortonworks has left in the Unix/Linux world. (However, I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.)
- Cloudera is getting more cloud-friendly, via a new product — Cloudera Director. Probably there are or will be some cloud-services partnerships as well.
Notes on Cloudera Director start:
- It’s closed-source.
- Code and support are included in any version of Cloudera Enterprise.
- It’s a management tool. Indeed, Cloudera characterized it to me as a sort of manager of Cloudera Managers.
What I have not heard is any answer for the traditional performance challenge of Hadoop-in-the-cloud, which is:
- Hadoop, like most analytic RDBMS, tightly couples processing and storage in a shared-nothing way.
- Standard cloud architectures, however, decouple them, thus mooting a considerable fraction of Hadoop performance engineering.
Maybe that problem isn’t — or is no longer — as big a deal as I’ve been told.
Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.
So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):
- CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
- Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
- Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
- In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
- Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
- Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
- Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
- Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
- Cloudera Enterprise Basic Edition contains:
- All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
- Commercial licenses for all that code.
- A license key to use the entirety of Cloudera Manager, not just the free part.
- Support for the “Core Hadoop” part of CDH.
- Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
- The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
- Cloudera Enterprise Data Hub Edition contains:
- Everything in Cloudera Basic Edition.
- A license key for Cloudera Navigator.
- Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
- Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.
In analyzing all this, I’m focused on two particular aspects:
- The “zero, one, many” system for defining the editions of Cloudera Enterprise.
- The use of “Data Hub” as a general marketing term.
|Categories: Cloudera, Data warehousing, Databricks, Spark and BDAS, Hadoop, HBase, Hortonworks, Open source, Pricing||1 Comment|
I’m on record as noting and agreeing with an industry near-consensus that Spark, rather than Tez, will be the replacement for Hadoop MapReduce. I presumed that Hortonworks, which is pushing Tez, disagreed. But Shaun Connolly of Hortonworks suggested a more nuanced view. Specifically, Shaun tweeted thoughts including:
Tez vs Spark = Apples vs Oranges.
Spark is general-purpose engine with elegant APIs for app devs creating modern data-driven apps, analytics, and ML algos.
Tez is a framework for expressing purpose-built YARN-based DAGs; its APIs are for ISVs & engine/tool builders who embed it
[For example], Hive embeds Tez to convert its SQL needs into purpose-built DAGs expressed optimally and leveraging YARN
That said, I haven’t yet had a chance to understand what advantages Tez might have over Spark in the use cases that Shaun relegates it to.
- The Twitter discussion with Shaun was a spin-out from my research around streaming for Hadoop.
|Categories: Data warehousing, Databricks, Spark and BDAS, Hadoop, Hortonworks, Predictive modeling and advanced analytics||6 Comments|
The genesis of this post is that:
- Hortonworks is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of Hadoop.
- Cloudera is talking up what I would call its human real-time strategy, which includes but is not limited to Flume, Kafka, and Spark Streaming. Cloudera also sees a few use cases for Storm.
- This all fits with my view that the Current Hot Subject is human real-time data freshness — for analytics, of course, since we’ve always had low latencies in short-request processing.
- This also all fits with the importance I place on log analysis.
- Cloudera reached out to talk to me about all this.
Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.
In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are: Read more
|Categories: Cloudera, Complex event processing (CEP), Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Health care, Hortonworks, Log analysis, Specific users, Splunk, Web analytics||4 Comments|
1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?
2. A few thoughts on cloud, colocation, etc.:
- The economies of scale of colocation-or-cloud over operating your own data center are compelling. Most of the reasons you outsource hardware manufacture to Asia also apply to outsourcing data center operation within the United States. (The one exception I can think of is supply chain.)
- The arguments for cloud specifically over colocation are less persuasive. Colo providers can even match cloud deployments in rapid provisioning and elastic pricing, if they so choose.
- Surely not coincidentally, I am told that Rackspace is deemphasizing cloud, reemphasizing colocation, and making a big deal out of Open Compute. In connection with that, Rackspace has pulled back from its leadership role in OpenStack.
- I’m hearing much more mention of Amazon Redshift than I used to. It seems to have a lot of traction as a simple and low-cost option.
- I’m hearing less about Elastic MapReduce than I used to, although I imagine usage is still large and growing.
- In general, I get the impression that progress is being made in overcoming the inherent difficulties in cloud (and even colo) parallel analytic processing. But it all still seems pretty vague, except for the specific claims being made for traction of Redshift, EMR, and so on.
- Teradata recently told me that in colocation pricing, it is common for floor space to be everything, with power not separately metered. But I don’t think that trend is a big deal, as it is not necessarily permanent.
- Cloud hype is of course still with us.
- Other than the above, I stand by my previous thoughts on appliances, clusters and clouds.
3. As for the analytic DBMS industry: Read more
I’ve talked with many companies recently that believe they are:
- Focused on building a great data management and analytic stack for log management …
- … unlike all the other companies that might be saying the same thing …
- … and certainly unlike expensive, poorly-scalable Splunk …
- … and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.
At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.
Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.
A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with: Read more
I spent a day with Teradata in Rancho Bernardo last week. Most of what we discussed is confidential, but I think the non-confidential parts and my general impressions add up to enough for a post.
First, let’s catch up with some personnel gossip. So far as I can tell:
- Scott Gnau runs most of Teradata’s development, product management, and product marketing, the big exception being that …
- … Darryl McDonald run the apps part (Aprimo and so on), and no longer is head of marketing.
- Oliver Ratzesberger runs Teradata’s software development.
- Jeff Carter has returned to his roots and runs the hardware part, in place of Carson Schmidt.
- Aster founders Mayank Bawa and Tasso Argyros have left Teradata (perhaps some earn-out period ended).
- Carson is temporarily running Aster development (in place of Mayank), and has some sort of evangelism role waiting after that.
- With the acquisition of Hadapt, Teradata gets some attention from Dan Abadi. Also, they’re retaining Justin Borgman.
The biggest change in my general impressions about Teradata is that they’re having smart thoughts about the cloud. At least, Oliver is. All details are confidential, and I wouldn’t necessarily expect them to become clear even in October (which once again is the month for Teradata’s user conference). My main concern about all that is whether Teradata’s engineering team can successfully execute on Oliver’s directives. I’m optimistic, but I don’t have a lot of detail to support my good feelings.
In some quick-and-dirty positioning and sales qualification notes, which crystallize what we already knew before:
- The Teradata 1xxx series is focused on cost-per-bit.
- The Teradata 2xxx series is focused on cost-per-query. It is commonly Teradata’s “lead” product, at least for new customers.
- The Teradata 6xxx series is supposed to be able to do “everything”.
- The Teradata Aster “Discovery Analytics” platform is sold mainly to customers who have a specific high-value problem to solve. (Randy Lea gave me a nice round dollar number, but I won’t share it.) I like that approach, as it obviates much of the concern about “Wait — is this strategic for us long-term, given that we also have both Teradata database and Hadoop clusters?”
Also: Read more
|Categories: Aster Data, Data warehouse appliances, Data warehousing, Hadapt, Hadoop, MapReduce, Solid-state memory, Teradata||2 Comments|