From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:
- Hadoop (always, and please see below).
- Analytic RDBMS (ditto).
- NoSQL and NewSQL.
- Specifically, SQL-on-Hadoop
- Spark and other memory-centric technology, including streaming.
- Public policy, mainly but not only in the area of surveillance/privacy.
- General strategic advice for all sizes of tech company.
Other stuff on my mind includes but is not limited to:
1. Certain categories of buying organizations are inherently leading-edge.
- Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
- US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
- Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
- … as have national-security agencies …
- … as have pharmaceutical research departments.
Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.
2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.
- Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
- I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
- I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
- I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.
3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.
Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.
Analytic RDBMS prices are still shaking out.
4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.
One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.
Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.
5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)
6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.
Data management is getting ever more complex.