I believe in all of the following trends:
- Hadoop is a Big Deal, and here to stay.
- Spark, for most practical purposes, is becoming a big part of Hadoop.
- Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.
Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:
- People would like this to be true, although in most cases only as one of several cluster computing platforms.
- Hadoop, when viewed as an operating system, is extremely primitive.
- Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.
There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.
|Categories: Cloud computing, Databricks, Spark and BDAS, Hadoop, MapReduce, MemSQL, Software as a Service (SaaS)||12 Comments|
- Hortonworks’ subscription revenues for the 9 months ended last September 30 appear to be:
- $11.7 million from everybody but Microsoft, …
- … plus $7.5 million from Microsoft, …
- … for a total of $19.2 million.
- Hortonworks states subscription customer counts (as per Page 55 this includes multiple “customers” within the same organization) of:
- 2 on April 30, 2012.
- 9 on December 31, 2012.
- 25 on April 30, 2013.
- 54 on September 30, 2013.
- 95 on December 31, 2013.
- 233 on September 30, 2014.
- Per Page 70, Hortonworks’ total September 30, 2014 customer count was 292, including professional services customers.
- Non-Microsoft subscription revenue in the quarter ended September 30, 2014 seems to have been $5.6 million, or $22.5 million annualized. This suggests Hortonworks’ average subscription revenue per non-Microsoft customer is a little over $100K/year.
- This IPO looks to be a sharply “down round” vs. Hortonworks’ Series D financing earlier this year.
- In March and June, 2014, Hortonworks sold stock that subsequently was converted into 1/2 a Hortonworks share each at $12.1871 per share.
- The tentative top of the offering’s price range is $14/share.
- That’s also slightly down from the Series C price in mid-2013.
And, perhaps of interest only to me — there are approximately 50 references to YARN in the Hortonworks S-1, but only 1 mention of Tez.
|Categories: Hadoop, Hortonworks, HP and Neoview, Market share and customer counts, Microsoft and SQL*Server, Pricing, Teradata, Yahoo||7 Comments|
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
I commonly write about real or apparent technical differentiation, in a broad variety of domains. But actually, computers only do a couple of kinds of things:
- Accept instructions.
- Execute them.
And hence almost all IT product differentiation fits into two buckets:
- Easier instruction-giving, whether that’s in the form of a user interface, a language, or an API.
- Better execution, where “better” usually boils down to “faster”, “more reliable” or “more reliably fast”.
As examples of this reductionism, please consider:
- Application development is of course a matter of giving instructions to a computer.
- Database management systems accept and execute data manipulation instructions.
- Data integration tools accept and execute data integration instructions.
- System management software accepts and executes system management instructions.
- Business intelligence tools accept and execute instructions for data retrieval, navigation, aggregation and display.
Similar stories are true about application software, or about anything that has an API (Application Programming Interface) or SDK (Software Development Kit).
Yes, all my examples are in software. That’s what I focus on. If I wanted to be more balanced in including hardware or data centers, I might phrase the discussion a little differently — but the core points would still remain true.
What I’ve said so far should make more sense if we combine it with the observation that differentiation is usually restricted to particular domains. Read more
Datameer checked in, having recently announced general availability of Datameer 5.0. So far as I understood, Datameer is still clearly in the investigative analytics business, in that:
- Datameer does business intelligence, but not at human real-time speeds. Datameer query durations are sometimes sub-minute, but surely not sub-second.
- Datameer also does lightweight predictive analytics/machine learning — k-means clustering, decision trees, and so on.
Key aspects include:
- Datameer runs straight against Hadoop.
- Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.
- The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.
- Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.
- Datameer lets you write derived data back into Hadoop.
|Categories: Business intelligence, Databricks, Spark and BDAS, Datameer, Hadoop, Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Web analytics||5 Comments|
It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:
- Amazon Redshift appears to be prospering.
- So are some SaaS (Software as a Service) business intelligence vendors.
- Amazon Elastic MapReduce is still around.
- Snowflake Computing launched with a cloud strategy.
- Cazena, with vague intentions for cloud data warehousing, destealthed.*
- Cloudera made various cloud-related announcements.
- Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
- The general argument for cloud-or-at-least-colocation has compelling aspects.
- Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.
I talked with the Snowflake Computing guys Friday. For starters:
- Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
- The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
- The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
- Snowflake claims excellent SQL coverage for a 1.0 product.
- Snowflake, the company, has:
- 50 people.
- A similar number of current or past users.
- 5 referenceable customers.
- 2 techie founders out of Oracle, plus Marcin Zukowski.
- Bob Muglia as CEO.
Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*
*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.
In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:
- Ingest formats are JSON, XML or AVRO for now.
- I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.
I don’t know enough details to judge whether I’d call that an example of schema-on-need.
A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather: Read more
|Categories: Amazon and its cloud, Cloud computing, Data mart outsourcing, Data models and architecture, Data warehousing, Market share and customer counts, Parallelization, Pricing, Software as a Service (SaaS), Structured documents||1 Comment|
- Cloudera continued to improve various aspects of its product line, especially Impala with a Version 2.0. Good for them. One should always be making one’s products better.
- Cloudera announced a variety of partnerships with companies one would think are opposed to it. Not all are Barney. I’m now hard-pressed to think of any sustainable-looking relationship advantage Hortonworks has left in the Unix/Linux world. (However, I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.)
- Cloudera is getting more cloud-friendly, via a new product — Cloudera Director. Probably there are or will be some cloud-services partnerships as well.
Notes on Cloudera Director start:
- It’s closed-source.
- Code and support are included in any version of Cloudera Enterprise.
- It’s a management tool. Indeed, Cloudera characterized it to me as a sort of manager of Cloudera Managers.
What I have not heard is any answer for the traditional performance challenge of Hadoop-in-the-cloud, which is:
- Hadoop, like most analytic RDBMS, tightly couples processing and storage in a shared-nothing way.
- Standard cloud architectures, however, decouple them, thus mooting a considerable fraction of Hadoop performance engineering.
Maybe that problem isn’t — or is no longer — as big a deal as I’ve been told.
Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.
So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):
- CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
- Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
- Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
- In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
- Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
- Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
- Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
- Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
- Cloudera Enterprise Basic Edition contains:
- All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
- Commercial licenses for all that code.
- A license key to use the entirety of Cloudera Manager, not just the free part.
- Support for the “Core Hadoop” part of CDH.
- Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
- The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
- Cloudera Enterprise Data Hub Edition contains:
- Everything in Cloudera Basic Edition.
- A license key for Cloudera Navigator.
- Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
- Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.
In analyzing all this, I’m focused on two particular aspects:
- The “zero, one, many” system for defining the editions of Cloudera Enterprise.
- The use of “Data Hub” as a general marketing term.
|Categories: Cloudera, Data warehousing, Databricks, Spark and BDAS, Hadoop, HBase, Hortonworks, Open source, Pricing||1 Comment|
As planned, I’m getting more active in predictive modeling. Anyhow …
1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.
2. The most controversial part of that post was probably the claim:
I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- It is always possible to instead go with a single model formally.
- A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
- Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.
3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start: Read more
|Categories: Ayasdi, Databricks, Spark and BDAS, Log analysis, Nutonian, Predictive modeling and advanced analytics, Revolution Analytics, Scientific research, Web analytics||5 Comments|