Telecommunications – DBMS 2 : DataBase Management System Services

Cask and CDAP

Curt Monash — Thu, 05 Mar 2015 15:00:13 +0000

For starters:

Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
Continuuity recently changed its name to Cask and went open source.
Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
Cask and Cloudera partnered.
I got a more technical Cask briefing this week.

Also:

App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.

So far as I can tell:

Cask’s current focus is to orchestrate job flows, with lots of data mappings.
This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
… sub-optimal choices in data mapping, database design or workflow composition.

I’d didn’t push the competition point hard (the call was generally a bit rushed due to a hard stop on my side), but:

Cask thinks it doesn’t have much in the way of exact or head-to-head competitors, but cites Spring and WibiData/Kiji as coming closest.
I’d think that data integration vendors who use Hadoop as an execution engine (Informatica, Syncsort and many more) would be in the mix as well.
Cask disclaimed competition with Teradata Revelytix, on the theory that Cask is focused on operational/”real-time” use cases, while Revelytix Loom is focused on data science/investigative analytics.

To reiterate part of that last bullet — like much else we’re hearing about these days, CDAP is focused on operational apps, perhaps with a streaming aspect.

To some extent CDAP can be viewed as restoring the programmer/DBA distinction to the non-SQL world and streaming worlds. That is:

Somebody creates a data mapping “pattern”.
Programmers (including perhaps the creator) write to that pattern.
Somebody (perhaps the creator) tweaks the mapping to optimize performance, or to reflect changes in the underlying data management.

Further notes on CDAP data access include:

Cask is proud that a pattern can literally be remapped from one data store to another, although I wonder how often that is likely to happen in practice.
Also, a single “row” can reference multiple data stores.
Cask’s demo focused on imposing a schema on a log file, something you might do incrementally as you decide to extract another field of information. This is similar to major use cases for schema-on-need and for Splunk.
For most SQL-like access and operations, CDAP relies on Hive, even to external data stores or non-tabular data. Cask is working with Cloudera on Impala access.

Examples of things that Cask supposedly makes easy include:

Chunking streaming data by time (e.g. 1 minute buckets).
Encryption.
Generating database stats (histograms and so on).

Tidbits as to how Cask perceives or CDAP plays with other technologies include:

Kafka is hot.
Spark Streaming is hot enough to be on the CDAP roadmap.
Cask believes that its administrative tools don’t conflict with Cloudera Manager or Ambari, because they’re more specific to an application, job or dataset.
CDAP is built on Twill, which is a thread-like abstraction over YARN that Cask contributed to Apache. Mesos is in the picture as well, as a YARN alternative.
Cask is seeing some interest in Flink. (Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded.)

Cask has ~40 people, multiple millions of dollars in trailing revenue, and — naturally — high expectations for future growth. I neglected, however, to ask how that revenue was split between subscription, professional services and miscellaneous. Cask expects to finish 2015 with a healthy two-digit number of customers.

Cask’s customers seem concentrated in usual-suspect internet-related sectors, although Cask gave it a bit of an enterprise-y spin by specifically citing SaaS (Software as a Service) and telecom. When I asked who else seems to be a user or interested based on mailing list activity, Cask mentioned a lot of financial services and some health care as well.

Related link

Cask doesn’t have the obvious .com URL.

Notes on machine-generated data, year-end 2014

Curt Monash — Thu, 01 Jan 2015 03:49:37 +0000

Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.

1. There are many kinds of machine-generated data. Important categories include:

Web, network and other IT logs.
Game and mobile app event data.
CDRs (telecom Call Detail Records).
“Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
Sensor network output (for example from a pipeline or other utility network).
Vehicle telemetry.
Health care data, in hospitals.
Digital health data from consumer devices.
Images from public-safety camera networks.
Stock tickers (if you regard them as being machine-generated, which I do).

That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.

2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:

Government snooping on the contents of communications.
Communication traffic analysis.
Photos and videos (airport scanners, public cameras, etc.)
Commercial ad targeting.
Traditional medical records.

Other areas, however, continue to be overlooked, with the two biggies in my opinion being:

The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.

My core arguments about privacy and surveillance seem as valid as ever.

3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …

4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.

5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.

6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.

Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:

High-frequency trading is an extreme race; microseconds matter.
Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.

There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.

7. My views about predictive analytics are still somewhat confused. For starters:

The math and technology of predictive modeling both still seem pretty simple …
… but sometimes achieve mind-blowing results even so.
There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.

So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.

WibiData has some innovative ideas in predictive experimentation.
Nutonian has some innovative ideas in non-linear modeling for pattern detection/root-cause analysis.
It’s still at the anecdotal level, but there have been interesting ideas in the rapid retraining of models.
Ayasdi reminded us that there’s room for innovation in clustering.
My Thanksgiving round-up post points to a lot of my prior comments on predictive modeling.

Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example linked above is certainly imaginative. But I’d like to see a lot more in the area.

Even more important, of course, could be some kind of revolution in predictive modeling for medicine.

A few numbers from MapR

Curt Monash — Wed, 10 Dec 2014 06:55:20 +0000

MapR put out a press release aggregating some customer information; unfortunately, the release is a monument to vagueness. Let me start by saying:

I don’t know for sure, but I’m guessing Derrick Harris was incorrect in suspecting that this release was a reaction to my recent post about Hortonworks’ numbers. For one thing, press releases usually don’t happen that quickly.
And as should be obvious from the previous point — notwithstanding that MapR is a client, I had no direct involvement in this release.
In general, I advise clients and other vendors to put out the kind of aggregate of customer success stories found in this release. However, I would like to see more substance than MapR offered.

Anyhow, the key statement in the MapR release is:

… the number of companies that have a paid subscription for MapR now exceeds 700.

Unfortunately, that includes OEM customers as well as direct ones; I imagine MapR’s direct customer count is much lower.

In one gesture to numerical conservatism, MapR did indicate by email that it counts by overall customer organization, not by department/cluster/contract (i.e., not the way Hortonworks does).

The MapR press release also said:

As of November 2014, MapR has one or more customers in eight vertical markets that have purchased more than one million dollars of MapR software and services. These vertical markets are advertising/media, financial services, healthcare, internet, information technology, retail, security, and telecom.

Since the word “each” isn’t in that quote, so we don’t even know whether MapR is referring to individual big customers or just general sector penetration. We also don’t know whether the revenue is predominantly subscription or some other kind of relationship.

MapR also indicated that the average customer more than doubled its annualized subscription rate vs. a year ago; the comparable figure — albeit with heavy disclaimers — from Hortonworks was 25%.

Misconceptions about privacy and surveillance

Curt Monash — Mon, 15 Sep 2014 17:07:56 +0000

Everybody is confused about privacy and surveillance. So I’m renewing my efforts to consciousness-raise within the tech community. For if we don’t figure out and explain the issues clearly enough, there isn’t a snowball’s chance in Hades our lawmakers will get it right without us.

How bad is the confusion? Well, even Edward Snowden is getting it wrong. A Wired interview with Snowden says:

“If somebody’s really watching me, they’ve got a team of guys whose job is just to hack me,” he says. “I don’t think they’ve geolocated me, but they almost certainly monitor who I’m talking to online. Even if they don’t know what you’re saying, because it’s encrypted, they can still get a lot from who you’re talking to and when you’re talking to them.”

That is surely correct. But the same article also says:

“We have the means and we have the technology to end mass surveillance without any legislative action at all, without any policy changes.” The answer, he says, is robust encryption. “By basically adopting changes like making encryption a universal standard—where all communications are encrypted by default—we can end mass surveillance not just in the United States but around the world.”

That is false, for a myriad of reasons, and indeed is contradicted by the first excerpt I cited.

What privacy/surveillance commentators evidently keep forgetting is:

There are many kinds of privacy-destroying information. I think people frequently overlook just how many kinds there are.
Many kinds of organization capture that information, can share it with each other, and gain benefits from eroding or destroying privacy. Similarly, I think people overlook just how pervasive the incentive is to snoop.
Privacy is invaded through a variety of analytic techniques applied to that information.

So closing down a few vectors of privacy attack doesn’t solve the underlying problem at all.

Worst of all, commentators forget that the correct metric for danger is not just harmful information use, but chilling effects on the exercise of ordinary liberties. But in the interest of space, I won’t reiterate that argument in this post.

Perhaps I can refresh your memory why each of those bulleted claims is correct. Major categories of privacy-destroying information (raw or derived) include:

The actual content of your communications — phone calls, email, social media posts and more.
The metadata of your communications — who you communicate with, when, how long, etc.
What you read, watch, surf to or otherwise pay attention to.
Your purchases, sales and other transactions.
Video images, via stationary cameras, license plate readers in police cars, drones or just ordinary consumer photography.
Monitoring via the devices you carry, such as phones or medical monitors.
Your health and physical state, via those devices, but also inferred from, for example, your transactions or search engine entries.
Your state of mind, which can be inferred to various extents from almost any of the other information areas.
Your location and movements, ditto. Insurance companies also want to put monitors in cars to track your driving behavior in detail.

Of course, these categories overlap. For example, information about your movements can be derived not just from your mobile phone, but also from your transactions, from surveillance cameras, and from the health-monitoring devices that are likely to become much more pervasive in the future.

So who has reason to invade your privacy? Unfortunately, the answer boils down to “just about everybody”. In particular:

Any internet or telecom business would like to know, in great detail, what you are doing with their offerings, along with any other information that might influence what you’re apt to buy or do next.
Anybody who markets or sells to consumers wants to know similar things.
Similar things are true of anybody who worries about credit or insurance risk.
Anybody who worries about fraud wants to know who you’re connected to, and also wants to match you against any known patterns of fraud-related behavior.
Anybody who hires employees wants to know who might be likely to work hard, get sick or quit.
Similarly, they’d like to know who does or might engage in employee misconduct.
Medical researchers and caregivers have some of the most admirable reasons for wanting to violate privacy.

And that’s even without mentioning the most obvious suspects — law enforcement and national security of many kinds, who can be presumed to in at least certain cases be able to get any information that’s available to any other organization.

Finally, my sense is:

People appreciate the potential of fancy-schmantzy language and image recognition.
The graph analysis done on telecom metadata is so simple that people generally “get” what’s going on.
Despite all the “big data analytics” hype, commentators tend to forget just how powerful machine learning/predictive analytics privacy intrusions could be. Those psychographic clustering techniques devised to support advertising and personalization could be applied in much more sinister ways as well.

Related links

The crucial point about chilling effects was laid out in two July, 2013 posts, and some public policy recommendations around the same time. Those four posts are a great starting point for the non-technical “bottom line” part of the discussion. A January, 2014 post adds some more political context.
A January, 2011 post on the technology of privacy threats adds detail to many of the points above.
A February, 2014 post on various metadata-related confusions notes some egregious governmental spin.

Confusion about metadata

Curt Monash — Sun, 23 Feb 2014 06:50:05 +0000

A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.

“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:

Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
Inputs to ancillary data management functions — for example, security privileges.
Support for human decisions about data — for example, information about authorship or lineage.

What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.

And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:

Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
Some document databases store structural metadata right with the document data itself.
Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
Actual text documents carry the structure imposed by grammar and syntax.

Related links

A lengthy survey of metadata kinds, biased to Hadoop (August, 2012)
Metadata as derived data (May, 2011)
Dataset management (May, 2013)
Structured/unstructured … multi-structured/poly-structured (May, 2011)

Some stuff I’m thinking about (early 2014)

Curt Monash — Sun, 02 Feb 2014 18:51:49 +0000

From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:

Hadoop (always, and please see below).
Analytic RDBMS (ditto).
NoSQL and NewSQL.
Specifically, SQL-on-Hadoop
Schema-on-need.
Spark and other memory-centric technology, including streaming.
Public policy, mainly but not only in the area of surveillance/privacy.
General strategic advice for all sizes of tech company.

Other stuff on my mind includes but is not limited to:

1. Certain categories of buying organizations are inherently leading-edge.

Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
… as have national-security agencies …
… as have pharmaceutical research departments.

Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.

2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.

Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.

3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.

Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.

Analytic RDBMS prices are still shaking out.

4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.

One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.

Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.

5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)

6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.

Data management is getting ever more complex.

Trends in predictive modeling

Curt Monash — Fri, 20 Sep 2013 12:10:36 +0000

I talked with Teradata about a bunch of stuff yesterday, including this week’s announcements in in-database predictive modeling. The specific news was about partnerships with Fuzzy Logix and Revolution Analytics. But what I found more interesting was the surrounding discussion. In a nutshell:

Teradata is finally seeing substantial interest in in-database modeling, rather than just in-database scoring (which has been important for years) and in-database data preparation (which is a lot like ELT — Extract/Load/transform).
Teradata is seeing substantial interest in R.
It seems as if similar groups of customers are interested in both parts of that, such as:
- Usual-suspect consumer marketing sectors (telecom, credit card, retail).*
- Semiconductor manufacturing.**
- Parallelized SAS modeling on Teradata seems to be limited by the small number of algorithms that are parallelized. (SAS scoring, I presume, is a different matter.)

This is the strongest statement of perceived demand for in-database modeling I’ve heard. (Compare Point #3 of my July predictive modeling post.) And fits with what I’ve been hearing about R.

*That’s very similar to the list of sectors for SAS HPA.

**To support their extremely high focus on product quality, semiconductor manufacturers have been using state-of-the-art analytic tools for at least 30 years.

In-database modeling is a performance feature, and performance can have several kinds of benefit, which may be summarized as “cheaper”, “better”, and “previously impractical”. My impression is that in-database modeling is pretty far toward the “previously impractical” end of the spectrum; enterprises don’t adopt a new way of predictive modeling until they want to create models that the old way can’t get done.

Basically, I think that models are increasingly:

Richer and more diverse than before. (see for example Point #5 of my July predictive modeling post.)
Developed in a more experimental and quickly-iterative way than before.

I think the first point pretty much implies the second, but the converse isn’t as clear; one can tweak old-style models in quick-turnaround fashion even more easily than one can develop the more complex newer styles.

And finally: I’m not hearing that modeling — even when it’s parallel and in-database fast — is commonly done on a complete many-terabyte dataset. It’s not a question I always remember to ask; for example, I didn’t bring it up with Teradata. But when I do, I rarely hear of models being trained on more than a few terabytes of data each.

The Hemisphere program

Curt Monash — Tue, 03 Sep 2013 08:04:52 +0000

Another surveillance slide deck has emerged, as reported by the New York Times and other media outlets. This one is for the Hemisphere program, which apparently:

Stores CDRs (Call Detail Records), many or all of which are collected via …
… some kind of back door into the AT&T switches that many carriers use. (See Slide 2.)
Has also included “subscriber information” for AT&T phones since July, 2012.
Contains “long distance and international” CDRs back to 1987.
Currently adds 4 billion CDRs per day.
Is administered by a Federal drug-related law enforcement agency but …
… is used to combat many non-drug-related crimes as well. (See Slides 21-26.)

Other notes include:

The agencies specifically mentioned on Slide 16 as making numerous Hemisphere requests are the DEA (Drug Enforcement Agency) and DHS (Department of Homeland Security).
“Roaming” data giving city/state is mentioned in the deck, but more precise geo-targeting is not.

I’ve never gotten a single consistent figure, but typical CDR size seems to be in the 100s of bytes range. So I conjecture that Project Hemisphere spawned one of the first petabyte-scale databases ever.

Hemisphere Project unknowns start:

Is that “back door into AT&T switches” inference really reliable? (I’m basing it on just a few words in the deck, and such decks can have inaccuracies in them.)
Just which calls’ metadata is currently being collected?
How long has this approximate rate of CDR collection been going on; can we just extrapolate back from the current 4 billion calls/day?

It seems that a primary use case for Project Hemisphere is to guess what phone numbers baddies are using, especially those of disposable “burner” cell phones that are otherwise very hard to trace. (The key benefit mentioned to such analysis is that those new phones can then be tapped.) There aren’t many details as to how the phone numbers are inferred, but since almost nothing is initially known about the target phone numbers except calling patterns, those are surely a huge part of the puzzle. In particular, it doesn’t seem to have been disclosed which other databases, if any, are linked into the analysis. There is no hint in the deck that the Hemisphere program directly collects telephone call contents. Rather, it’s used to help determine which telephone numbers to tap.

The government apparently trains its people to keep Hemisphere secret, to the point of lying about it, even though Slide 2 states that Hemisphere is “an unclassified program”.

Slide 8-12 generally emphasize the Hemisphere program’s secrecy.
Slide 10 seems to advocate outright deception. Specifically — and this is both complicated and ironic — it seems to say that the government should get subpoenas for information it already had without subpoena, so that those subpoenas can be the claimed source of the information when applying for yet other subpoenas.

So it seems as if Hemisphere is yet another example of the pattern:

The US government has long lied about how far it invades privacy …
… and about the assistance it receives from the telecom/technology industry in doing so.
Little tangible harm has been done by those invasions, except to those who clearly deserved it.

Up to a point, this is reassuring. But it still bodes badly for a future in which there are many more ways surveillance can be used to hurt us than were possible before.

Hortonworks business notes

Curt Monash — Sat, 24 Aug 2013 11:07:53 +0000

Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:

Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
… a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Media.
- Telecommunications.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.

In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social.

These use cases can be any of a true new application, an enhancement to an existing application, or a general investigative analytics environment.
This adoption is typically driven by a line-of-business group, but IT is a key influencer, and IT usually winds up running the project.
Overall, this accounts for 70% of Hortonworks’ business by some metric.

The other 30% Hortonworks sees is efficiency-oriented — i.e., a cheaper way to store and/or process data.

Hortonworks assigns ELT (Extract/Load/Transform) to this group. Based in part on a subsequent conversation with Cloudera, I gather that batch ELT offload — especially but not only from large Teradata installations — is a significant fraction of the total.
“Data lake” and similar buzzwords fall into this group, as does “re-architecting”.
Hortonworks asserts that adopters from the 70% rapidly move to this kind of use as well, while Teradata customers typically start out in this part.
Unsurprisingly, this part is IT all the way.

One customer apparently estimates its fully burdened Hadoop costs at $900/terabyte/year.

Edit: I followed up on these efficiency-oriented use cases in a conversation with Cloudera.

And finally: One of my favorite things to ask is “When you win, why do win?” — at least when I think the vendor won’t just reiterate their core marketing messages. Hortonworks gave a great, threefold answer:

Its relationships with Teradata, Microsoft, et al.
Its promise that it can get specific customer-requested features into Apache Hadoop on a specific timeframe. (Yes, the Contribution Olympics are still with us.)
Its claim of greater experience with truly huge clusters — not just Yahoo, but I don’t know who its other examples are.

Related link

A few weeks ago, I talked with Hortonworks at length about technology and other subjects.

The refactoring of everything

Curt Monash — Sat, 20 Jul 2013 16:13:02 +0000

I’ll start with three observations:

Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.

As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences.

NoSQL and schema-on-read.
- The main point of relational DBMS is the Ted Codd guarantee, which says that applications and database designs can be loosely coupled. The price is that database designs for different applications are tightly coupled into one comprehensive schema.
- NoSQL and dynamic schemas turn that around. For any one application, application design and database design are tightly coupled; but database designs for different apps are often unrelated.
- What you think of these alternatives probably has a lot to do with what you think about separating the developer and DBA job functions. If you like that separation, the relational approach should look good; if you don’t, then dynamic schemas may be more suitable.
BI with dedicated data stores. Instead of just running against relational DBMS, various business intelligence tools feature proprietary data stores, often memory-centric. Two examples I’ve written about are Platfora and QlikView, but “in-memory BI” goes far beyond those two vendors.
BI integrated into operational apps. A trend that’s been developing for years is the tight integration of BI into operational apps. I’ve written mainly about Workday’s version of this, but it’s at least as big an issue going forward in competition among SAP, Oracle, and most other application vendors you can think of.
SAP HANA and competitors. It’s an understatement to call SAP HANA “overhyped”. But technology will surely someday get to the point that SAP implies is already here, with a lot of the silo-merging that that suggests. For databases that grow at much less than Moore’s Law speeds, it will be possible to integrate in-memory database capabilities that previously called for a variety of disk-based specialty systems.
Analytic application subsystems. Customer-facing analytic applications will have a whole different standard of completeness than more traditional back-office transactional ones. The base case is analytic subsystems loosely coupled to a variety of front-end technologies.
Other “real-time analytics”. I expected the short-request/analytic distinction to blur, but even so I’m astonished by the number of NoSQL and NewSQL vendors who’ve adopted “real-time analytics” as a core message. (For more on that, I refer you again to my recent webinar on the topic.)
Appliances. For a variety of technical and business reasons, vendors love selling appliances, aka “engineered systems”. Frankly, some of the integrations between hardware, operating system, and other software are tighter than others, so only some of this is a true refactoring. Anyhow, appliance stories can be heard from a large fraction of the computer industry, including for example:
- Apple and other mobile device makers.
- IBM — PureThisAndThat.
- Oracle — ExaEverything.
- Teradata.
- Microsoft — Xbox, Surface, Parallel Data Warehouse, etc.
- SAP — HANA appliances
- The whole telecom equipment industry.
Cluster management. We have entered the era of cluster computing. This has several consequences:
- Software designed to run on dedicated clusters often has cluster management software integrated in.
- Virtualization has evolved to break the old pairing between applications and the specific servers on which they run.
- OpenStack and similar cloud stacks are trying to take the evolution further.
SaaS/IaaS/DBaaS/PaaS. Software as a Service, in all its acronymic variations, can integrate software, hardware, the surrounding bricks and mortar, service, and everything else. Conversely, different SaaS systems can be a lot more stand-offish from each other than multiple vendors’ packaged apps, all running in the same data center, perhaps even on the same DBMS.

I could keep going on for a while; for example, I haven’t said anything yet about “intelligent storage”; indeed, I haven’t even mentioned analytic platforms or their SQL-on-Hadoop cousins. But hopefully I’ve run through enough different cases to justify the slightly hyperbolic title of this post.

So what are some possible implications? My candidates start:

As previously noted, I expect most computing to eventually wind up on a combination of appliances, clusters and/or clouds. That applies even to organizations whose workloads are small enough to run on single servers, because most of their computing (except for personal devices) will either be SaaS, or else the kind of public-facing internet applications that already tend to be in the cloud.
Also as previously noted, I expect traditional databases — i.e. ones that focus on human-generated data — to eventually wind up in RAM. I imagine that there will be both relational and dynamic-schema APIs to the memory-centric DBMS that manage them.
The current and near-future technology stacks underneath application suites such as Oracle’s, SAP’s or Infor’s are of little importance to their ultimate success. (Yes, that applies even to the wonder of HANA.) Much more significant will be the subsequent cloud/SaaS generation.
Similarly, I think there is plenty of opportunity for large new application software companies, SaaS or not as the case may be, just as there usually is in connection with major technological change.
Notwithstanding the various points of integration between analytics and short-request processing, there will also be analytics-only technology stacks for a long time to come.

The IT industry seems likely to remain interesting for a long time to come.