The genesis of this post is that:
- Hortonworks is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of Hadoop.
- Cloudera is talking up what I would call its human real-time strategy, which includes but is not limited to Flume, Kafka, and Spark Streaming. Cloudera also sees a few use cases for Storm.
- This all fits with my view that the Current Hot Subject is human real-time data freshness — for analytics, of course, since we’ve always had low latencies in short-request processing.
- This also all fits with the importance I place on log analysis.
- Cloudera reached out to talk to me about all this.
Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.
In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are: Read more
|Categories: Cloudera, Complex event processing (CEP), Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Health care, Hortonworks, Log analysis, Specific users, Splunk, Web analytics||1 Comment|
1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?
2. A few thoughts on cloud, colocation, etc.:
- The economies of scale of colocation-or-cloud over operating your own data center are compelling. Most of the reasons you outsource hardware manufacture to Asia also apply to outsourcing data center operation within the United States. (The one exception I can think of is supply chain.)
- The arguments for cloud specifically over colocation are less persuasive. Colo providers can even match cloud deployments in rapid provisioning and elastic pricing, if they so choose.
- Surely not coincidentally, I am told that Rackspace is deemphasizing cloud, reemphasizing colocation, and making a big deal out of Open Compute. In connection with that, Rackspace has pulled back from its leadership role in OpenStack.
- I’m hearing much more mention of Amazon Redshift than I used to. It seems to have a lot of traction as a simple and low-cost option.
- I’m hearing less about Elastic MapReduce than I used to, although I imagine usage is still large and growing.
- In general, I get the impression that progress is being made in overcoming the inherent difficulties in cloud (and even colo) parallel analytic processing. But it all still seems pretty vague, except for the specific claims being made for traction of Redshift, EMR, and so on.
- Teradata recently told me that in colocation pricing, it is common for floor space to be everything, with power not separately metered. But I don’t think that trend is a big deal, as it is not necessarily permanent.
- Cloud hype is of course still with us.
- Other than the above, I stand by my previous thoughts on appliances, clusters and clouds.
3. As for the analytic DBMS industry: Read more
Many of the companies I talk with boast of freeing business analysts from reliance on IT. This, to put it mildly, is not a unique value proposition. As I wrote in 2012, when I went on a history of analytics posting kick,
- Most interesting analytic software has been adopted first and foremost at the departmental level.
- People seem to be forgetting that fact.
In particular, I would argue that the following analytic technologies started and prospered largely through departmental adoption:
- Fourth-generation languages (the analytically-focused ones, which in fact started out being consumed on a remote/time-sharing basis)
- Electronic spreadsheets
- 1990s-era business intelligence
- Fancy-visualization business intelligence
- Predictive analytics
- Text analytics
- Rules engines
What brings me back to the topic is conversations I had this week with Paxata and Metanautix. The Paxata story starts:
- Paxata is offering easy — and hopefully in the future comprehensive — “data preparation” tools …
- … that are meant to be used by business analysts rather than ETL (Extract/Transform/Load) specialists or other IT professionals …
- … where what Paxata means by “data preparation” is not specifically what a statistician would mean by the term, but rather generally refers to getting data ready for business intelligence or other analytics.
Metanautix seems to aspire to a more complete full-analytic-stack-without-IT kind of story, but clearly sees the data preparation part as a big part of its value.
If there’s anything new about such stories, it has to be on the transformation side; BI tools have been helping with data extraction since — well, since the dawn of BI. Read more
|Categories: Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Predictive modeling and advanced analytics, Progress, Apama, and DataDirect||8 Comments|
A significant fraction of IT professional services industry revenue comes from data integration. But as a software business, data integration has been more problematic. Informatica, the largest independent data integration software vendor, does $1 billion in revenue. INFA’s enterprise value (market capitalization after adjusting for cash and debt) is $3 billion, which puts it way short of other category leaders such as VMware, and even sits behind Tableau.* When I talk with data integration startups, I ask questions such as “What fraction of Informatica’s revenue are you shooting for?” and, as a follow-up, “Why would that be grounds for excitement?”
*If you believe that Splunk is a data integration company, that changes these observations only a little.
On the other hand, several successful software categories have, at particular points in their history, been focused on data integration. One of the major benefits of 1990s business intelligence was “Combines data from multiple sources on the same screen” and, in some cases, even “Joins data from multiple sources in a single view”. The last few years before application servers were commoditized, data integration was one of their chief benefits. Data warehousing and Hadoop both of course have a “collect all your data in one place” part to their stories — which I call data mustering — and Hadoop is a data transformation tool as well.
I’m commonly asked to assess vendor claims of the kind:
- “Our system lets you do multiple kinds of processing against one database.”
- “Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”
So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:
- The most extreme vendor marketing claims are false.
- There are many different choices that make sense in at least some use cases each.
Horses for courses
It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:
- Short-request vs. analytic.
- SQL vs. non-SQL (NoSQL or otherwise).
- Expensive/heavy-duty vs. cheap/easy-to-support.
Vendors are part of this consensus; already in 2005 I observed
For all practical purposes, there are no DBMS vendors left advocating single-server strategies.
Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.
Multiple data stores for a single application
We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree: Read more
The name of this blog comes from an August, 2005 column. 8 1/2 years later, that analysis holds up pretty well. Indeed, I’d keep the first two precepts exactly as I proposed back then:
- Task-appropriate data managers. Much of this blog is about task-appropriate data stores, so I won’t say more about them in this post.
- Drastic limitations on relational schema complexity. I think I’ve been vindicated on that one by, for example:
- NoSQL and dynamic schemas.
- Schema-on-read, and its smarter younger brother schema-on-need.
- Limitations on the performance and/or allowed functionality of joins in scale-out short-request RDBMS, and the relative lack of complaints about same.
- Funky database design from major Software as a Service (SaaS) vendors such as Workday and Salesforce.com.
- A whole lot of logs.
I’d also keep the general sense of the third precept, namely appropriately-capable data integration, but for that one the specifics do need some serious rework.
For starters, let me say: Read more
|Categories: About this blog, Business intelligence, Database diversity, EAI, EII, ETL, ELT, ETLT, Investment research and trading, NoSQL, Schema on need||2 Comments|
For quite some time, one of the most frequent marketing pitches I’ve heard is “Analytics made easy for everybody!”, where by “quite some time” I mean “over 30 years”. “Uniquely easy analytics” is a claim that I meet with the greatest of skepticism.* Further confusing matters, these claims are usually about what amounts to business intelligence tools, but vendors increasingly say “Our stuff is better than the BI that came before, so we don’t want you to call it ‘BI’ as well.”
*That’s even if your slide deck doesn’t contain a picture of a pyramid of user kinds; if there actually is such a drawing, then the chance that I believe you is effectively nil.
All those caveats notwithstanding, there are indeed at least three forms of widespread analytics:
- Fairly standalone, eas(ier) to use business intelligence tools, sometimes marketed as focusing on “data exploration” or “data discovery”.
- Charts and graphs integrated or at least well-embedded into production applications. This technology is on a long-term rise. But in some sense, integrated reporting has been around since the invention of accounting.
- Predictive analytics built into automated systems, for example ad selection. This is not what is usually meant by the “easy analytics” claim, and I’ll say no more about it in this post.
It would be nice to say that the first two bullet points represent a fairly clean operational/investigative BI split, but that would be wrong; human real-time dashboards can at once be standalone and operational.
|Categories: Business intelligence, Data integration and middleware, Data warehousing||Leave a Comment|
Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):
- Are the SQL engine and Hadoop:
- Necessarily on the same cluster?
- Necessarily or at least most naturally on different clusters?
- How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
- HDFS (Hadoop Distributed File System)?
- Hadoop MapReduce?
- How, if at all, is the SQL engine invoked by Hadoop?
- If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
- A way of making data transfer maximally parallel.
- Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
- If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.
Let’s go to some examples. Read more
|Categories: Cloudera, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadapt, Hadoop, HBase, Hortonworks, MapReduce, Microsoft and SQL*Server, NewSQL, PostgreSQL, SQL/Hadoop integration, Teradata||33 Comments|
From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:
- Hadoop (always, and please see below).
- Analytic RDBMS (ditto).
- NoSQL and NewSQL.
- Specifically, SQL-on-Hadoop
- Spark and other memory-centric technology, including streaming.
- Public policy, mainly but not only in the area of surveillance/privacy.
- General strategic advice for all sizes of tech company.
Other stuff on my mind includes but is not limited to:
1. Certain categories of buying organizations are inherently leading-edge.
- Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
- US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
- Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
- … as have national-security agencies …
- … as have pharmaceutical research departments.
Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.