It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start:
- There are many possibilities for the “right” way to manage analytic data. Generally, these are not the same as the “right” way to write the data, as that choice needs to be optimized for user experience (including performance), reliability, and of course cost.
- I.e., it is usually best to move data from where you write it to where you (at least in part) analyze it.
- Vendors who suggest they have a complete solution for getting data ready to be analyzed are … optimists.
- This specifically includes “magic data stores”, such as fast analytic RDBMS (on which I’m very bullish) or in-memory analytic DBMS (about which I’m more skeptical). They’re great starting points, but they’re not the whole enchilada.
- There are many ways to help with preparing data for analysis. Some of them are well-served by the industry. Some, however, are not.
1. There are many terms for all this. I once titled a post “Data that is derived, augmented, enhanced, adjusted, or cooked”. “Data munging” and “data wrangling” are in the mix too. And I’ve heard the term data preparation used several different ways.
2. Microsoft told me last week that the leading paid-for data products in their data-for-sale business are for data cleaning. (I.e., authoritative data to help with the matching/cleaning of both physical and email addresses.) Salesforce.com/data.com told me something similar a while back. This underscores the importance of data cleaning/data quality, and more generally of master data management.
Yes, I just said that data cleaning is part of master data management. Not coincidentally, I buy into to the view that MDM is an attitude and a process, not just a specific technology.
3. Everybody knows that Hadoop usage involves long-ish workflows, in which data keeps get massaged and written back to the data store. But that point is not as central to how people think about Hadoop as it probably should be.
4. One thing people have no trouble recalling is that Hadoop is a great place to dump stuff and get it out later. Depending on exactly what you have in mind, there are various metaphors for this, most of which have something to do with liquids. Most famous is “big bit bucket”, but also used have been “data refinery”, “data lake”, and “data reservoir”.
5. For years, DBMS and Hadoop vendors have bundled low-end text analytics capabilities rather than costlier state-of-the-art ones. I think that may be changing, however, mainly in the form of Attensity partnerships.
Truth be told, I’m not wholly current on text mining vendors — but when I last was, Attensity was indeed the best choice for such partnerships. And I’m not aware of any subsequent developments that would change that conclusion.
- Merv Adrian’s contrast between Hadoop and data integration tours some of the components of ETL suites. (February, 2013)
- Part of why analytic applications are usually incomplete are the issues discussed in this post.
- De-anonymization is an important — albeit privacy-threatening — way of making data more analyzable. (January, 2011)
- I updated my thoughts on Gartner’s Logical Data Warehouse concept earlier this month.