Some stuff I’m thinking about (early 2014)
From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:
- Hadoop (always, and please see below).
- Analytic RDBMS (ditto).
- NoSQL and NewSQL.
- Specifically, SQL-on-Hadoop
- Schema-on-need.
- Spark and other memory-centric technology, including streaming.
- Public policy, mainly but not only in the area of surveillance/privacy.
- General strategic advice for all sizes of tech company.
Other stuff on my mind includes but is not limited to:
1. Certain categories of buying organizations are inherently leading-edge.
- Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
- US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
- Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
- … as have national-security agencies …
- … as have pharmaceutical research departments.
Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.
2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.
- Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
- I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
- I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
- I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.
3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.
Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.
Analytic RDBMS prices are still shaking out.
4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.
One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.
Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.
5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)
6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.
Data management is getting ever more complex.
Comments
3 Responses to “Some stuff I’m thinking about (early 2014)”
Leave a Reply
Can you elaborate on “widespread Hadoop adoption at ordinary large enterprises”? For example, do you expect them to write map-reduce jobs. If their primary use is SQL, R and packaged apps then do you see Hadoop getting better on ease of use/management faster then proprietary vendors get at reducing cost?
Mark,
1. People these days use a LOT of languages, programming frameworks, add-on execution frameworks, whatever. I wouldn’t want to anoint any 1 or 2 of those as the expected dominant winner, beyond such obvious points as there will be a lot of SQL for a long time.
2. In particular, Hadoop doesn’t necessarily imply MapReduce. Spark, for example, has 15 primitives, 2 of which are Map and Reduce. Hadoop 2 has the capacity for multiple execution engines. Etc.
3. Most of the major analytic RDBMS are now owned by big enterprise technology companies with classic enterprise technology cost structures. I’d still argue that Greenplum and Vertica, for example, should be offered at low prices. But I’m probably hearing fewer “those guys at Greenplum are buying business” complaints than I used to. (And I’m not going to say more about Vertica pricing at all, due to client confidentiality.)
Hi Curt,
Have you come across implementations of enterprise data hub or Hadoop based transformations. What has been the learning from such implementations.