August 3, 2015

Data messes

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — 🙂 — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

Inconsistency can take multiple forms, including: 

Addressing the first two is the province of master data management (MDM), and also of the same data cleaning technologies that might help with outright errors. Addressing the third is the province of other data integration technology, which also may be what’s needed to break down the barriers between data silos.

So far I’ve been assuming that data is neatly arranged in fields in some kind of database. But suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be present.

All of the above issues occur with analytic data too. In some cases it probably makes sense not to fix them until the data is shipped over for analysis. In other cases, it should be fixed earlier, but isn’t. And in hybrid cases, data is explicitly shipped to an operational data warehouse where the problems are presumably fixed.

Further, some problems are much greater in their analytic guise. Harmonization and integration among data silos are likely to be much more intense. (What is one table for analytic purposes might be many different ones operationally, for reasons that might span geography, time period, or application legacy.) Addressing those issues is the province of data integration technologies old and new. Also, data transformation and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those needs.

Let’s now consider missing data. In operational cases, there are three main kinds of missing data:

All of those cases can ripple through to cause analytic headaches. But for certain inherently analytic data sets — e.g. a weblog or similar stream — the problem can be even worse. The data source might stop functioning, or might change the format in which it transmits; but with no immediate operations compromised, it might take a while to even notice. I don’t know of any technology that does a good, simple job of addressing these problems, but I am advising one startup that plans to try.

Further analytics-mainly data messes can be found in three broad areas:

That last area is the domain of a lot of analytics innovation. In particular:

Finally, suppose we return to the case of operational data, assumed to be accurately stored in fielded databases, with sufficient data integration technologies in place. There’s still a whole other kind of possible mess than those I cited above — applications may not be doing a good job of understanding and using it. I could write a whole series of posts on that subject alone … but it’s going slowly. 🙂 So I’ll leave that subject area for another time.

Comments

11 Responses to “Data messes”

  1. Chris Bird on August 3rd, 2015 12:56 pm

    Curt,

    As usual an insightful, pithy and well timed piece.

    But it is worse than that. We still haven’t got to grips with vocabulary – especially in an industry like travel. One might be forgiven for thinking of a direct flight as going directly from point A to point B. But it ain’t necessarily so. If there is an intermediate point, and the flight number doesn’t change and you don’t have to get off, then the flight can be described as direct. If you want to go somewhere directly, you have to go non-stop. If you don’t mind stopping then you might have a connection.

    And that’s a simple case.

    There’s no substitute for knowing what your terms mean – and realizing that others may have different terms that overlap, contradict or extend yours.

    Corporate data definitions are all well and good, but people abuse the terms, use them in outmoded ways. We treat corporate language like natural language – meaning changes with time. Mostly we hope we get the meaning right, but when dealing in situations where accuracy and precision are required, we need to make sure we have standard definitions – and we use them.

  2. Curt Monash on August 3rd, 2015 11:08 pm

    Chris — just to be clear, that’s a data value problem you’re focusing on, not metadata, correct?

    Just the sort of addition I was hoping for. I indeed assumed that fielded data had unambiguous meanings, even if they were vague. (Example of vague but not ambiguous — a number that clearly means what it means, EXCEPT that the intended precision is unclear.)

  3. Richard Ordowich on August 4th, 2015 8:09 am

    Much of the data in organizations is messy mainly because the data was not designed following any rigorous practices. This is an example of data illiteracy.

    Few data modelers and users apply naming conventions or follow standards for writing descriptions. Few organizations maintain a controlled vocabulary of terms. Data design is a random act each time new data is added to a database or a new database is developed.

    We then try to use technology such as MDM and data cleansing to attempt to resolve the variations in this randomly created data. Mapping data silos itself becomes an unstructured exercise since most organizations do not have rigorous harmonization practices. How do you harmonize the semantics and pragmatics (context of use) of the data?

    The data messes is not a technical problem but a Data Literacy problem. We keep throwing technical solutions at human behavioral problem.

  4. David Gruzman on August 4th, 2015 12:30 pm

    Richard,
    While being long time lover of RDBMS I admit that today’s data velocity and variety do not allow to follow relational model… I believe that following more “high level” practices you advocate became hardly possible.. I think our software must became smarter to be able to make sense from data it presented.

  5. Aaron on August 4th, 2015 4:36 pm

    Relevant material. Note that there is a huge gap in the taxonomy here – data semantics.
    Consider a sale going from an online store to various targets, including GL, a sales mart, and an operational BI mart. In the process we create three distinct truths, since often the timing of the data and what is used to enrich it can differ between them (so one may know that the customer is part of a family that shops at the site, and another may know about a refund before others. They may or may not eventually become consistent – often sales recognizes revenue based on different rules than a GL.)
    This is a fundamental justification for EDW, where a company can institute data governance and policy and come up with official truths and standards, where data is reconciled and integrated and such.
    The justification for ETL is compelling for master data and reference data, imposing standards (though awkward for an early startup, really you can’t redefine everything midstream – so the data semantics has longevity.) For transactional and detail and other data with more velocity ETL is often awkward and problematic –shoehorning data that rapidly changes, often hiding semantics or other work that doesn’t feel right, and which creates bottlenecks if the dev velocity is high.)
    Note how similar this issue is to big data scenarios, where you have a choice of standardizing data and sharing vs. having each application do ETL-like activities on the fly.
    Richard – I think data illiteracy is just one reason for data issues. Another example is time-to-market decisions, another is acquisition or packaged software with different semantics.
    David – I suspect the RDBMS discussion is a shibboleth – almost all big data stuff can happen in a RDBMS without *maintenance and process centric* management. Mainframes became 95% stable and got taken over by operations and froze out iterative development and innovation, and the same has happened in many companies for RDBMS. Many big data projects are ways to subvert that control and get performance and flexibility rather than do something technically difficult without clusters.

  6. David Gruzman on August 5th, 2015 8:46 am

    Aaron, it is exactly my point, that RDBMS are not flexible enough for many cases. In the samе time, loading of let say 5TB a day or joining two tables with one billion records each is hardly possible without a clusters…

  7. Application databases | Software Memories on August 7th, 2015 10:58 am

    […] my recent post on data messes, I left an IOU for a discussion of application databases. I’ve addressed parts of that […]

  8. clive boulton on August 9th, 2015 5:27 pm

    Doesn’t an Object Store, a NoSQL database with schema agnostic indexing and continuous map-reduce ‘heal this’…

    https://github.com/nestorpersist/ostore
    http://www.meetup.com/Seattle-Scalability-Meetup/photos/26152701/#437982894

  9. Elevator pitches and other self-introductions | Strategic Messaging on August 30th, 2015 4:03 am

    […] Popular problem types include chaos, confusion, and mess. […]

  10. Kingsley Idehen on September 3rd, 2015 11:40 am

    There isn’t a silver bullet for solving data integration issues. Data is always subjectively messy. Thus, you need “context lenses” through which relevant data is viewed en route to final processing by data consuming applications.

    The following principles provide an effective basis for addressing these challenges:

    1. Use HTTP URIs as Entity Names (Identifiers) — these implicitly resolve to entity description documents

    2. Use an abstract daa representation language (e.g., RDF) to describe entities using subject, predicate, object based sentences — in a notation of your choice (i.e., it doesn’t have to be JSON, JSON-LD, Turtle, RDF/XML etc..).

    3. Ensure the nature of Entity Types and Relationship Types are also described using same abstract data representation language — this relates to Classes, Sub Classes, Transitivity, Symmetry, Equivalence etc..

    The links that follow include live examples of this approach to data integration across disparate data sources:

    Links:

    [1] http://kidehen.blogspot.com/2015/07/conceptual-data-virtualization-across.html — deals with integrating data across disparate RDBMS systems

    [2] http://kidehen.blogspot.com/2015/07/situation-analysis-never-day-goes-by.html — show how controlled natural language can be used to harmonize disparate data

    [3] http://kidehen.blogspot.com/2014/01/demonstrating-reasoning-via-sparql.html — demonstrating reasoning and inference based on entity relationship type semantics, using SPARQL .

  11. Consumer data management | DBMS 2 : DataBase Management System Services on October 5th, 2015 2:27 am

    […] the biggest mess in all of IT is the management of individual consumers’ data. Our electronic data is […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.