A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including:
- Variant names.
- Variant spellings.
- Variant data structures (not to mention datatypes, formats, etc.).
Addressing the first two is the province of master data management (MDM), and also of the same data cleaning technologies that might help with outright errors. Addressing the third is the province of other data integration technology, which also may be what’s needed to break down the barriers between data silos.
So far I’ve been assuming that data is neatly arranged in fields in some kind of database. But suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be present.
All of the above issues occur with analytic data too. In some cases it probably makes sense not to fix them until the data is shipped over for analysis. In other cases, it should be fixed earlier, but isn’t. And in hybrid cases, data is explicitly shipped to an operational data warehouse where the problems are presumably fixed.
Further, some problems are much greater in their analytic guise. Harmonization and integration among data silos are likely to be much more intense. (What is one table for analytic purposes might be many different ones operationally, for reasons that might span geography, time period, or application legacy.) Addressing those issues is the province of data integration technologies old and new. Also, data transformation and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those needs.
Let’s now consider missing data. In operational cases, there are three main kinds of missing data:
- Missing values, as a special case of inaccuracy.
- Data that was only collected over certain time periods, as a special case of changing data structure.
- Data that hasn’t been derived yet, as the main case of a need for data enhancement.
All of those cases can ripple through to cause analytic headaches. But for certain inherently analytic data sets — e.g. a weblog or similar stream — the problem can be even worse. The data source might stop functioning, or might change the format in which it transmits; but with no immediate operations compromised, it might take a while to even notice. I don’t know of any technology that does a good, simple job of addressing these problems, but I am advising one startup that plans to try.
Further analytics-mainly data messes can be found in three broad areas:
- Problems caused by new or changing data sources hit much faster in analytics than in operations, because analytics draws on a greater variety of data.
- Event recognition, in which most of a super-high-volume stream is discarded while the “good stuff” is kept, is more commonly a problem in analytics than in pure operations. (That said, it may arise on the boundary of operations and analytics, namely in “real-time” monitoring.
- Analytics has major problems with data scavenger hunts, in which business analysts and data scientists don’t know what data is available for them to examine.
That last area is the domain of a lot of analytics innovation. In particular:
- It’s central to the dubious Gartner concept of a Logical Data Warehouse, and to the more modest logical data layers I advocate as alternative.
- It’s been part of BI since the introduction of Business Objects’ “semantic layer”. (See, for example, my recent post on Zoomdata.)
- It’s a big part of the story of startups such as Alation or Tamr.
- In a failed effort, it was part of Greenplum’s pitch some years back, as an aspect of the “enterprise data cloud”.
- It led to some of the earliest differentiated features at Gooddata.
- It’s implicit in the some BI collaboration stories, in some BI/search integration, and in ClearStory’s “Data You May Like”.
Finally, suppose we return to the case of operational data, assumed to be accurately stored in fielded databases, with sufficient data integration technologies in place. There’s still a whole other kind of possible mess than those I cited above — applications may not be doing a good job of understanding and using it. I could write a whole series of posts on that subject alone … but it’s going slowly. So I’ll leave that subject area for another time.