February 13, 2013

It’s hard to make data easy to analyze

It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.

Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:

*Complex event/stream processing terminology is always problematic.

My thoughts on all this start: 

Further:

1. There are many terms for all this. I once titled a post “Data that is derived, augmented, enhanced, adjusted, or cooked”. “Data munging” and “data wrangling” are in the mix too. And I’ve heard the term data preparation used several different ways.

2. Microsoft told me last week that the leading paid-for data products in their data-for-sale business are for data cleaning. (I.e., authoritative data to help with the matching/cleaning of both physical and email addresses.) Salesforce.com/data.com told me something similar a while back. This underscores the importance of data cleaning/data quality, and more generally of master data management.

Yes, I just said that data cleaning is part of master data management. Not coincidentally, I buy into to the view that MDM is an attitude and a process, not just a specific technology.

3. Everybody knows that Hadoop usage involves long-ish workflows, in which data keeps get massaged and written back to the data store. But that point is not as central to how people think about Hadoop as it probably should be.

4. One thing people have no trouble recalling is that Hadoop is a great place to dump stuff and get it out later. Depending on exactly what you have in mind, there are various metaphors for this, most of which have something to do with liquids. Most famous is “big bit bucket”, but also used have been “data refinery”, “data lake”, and “data reservoir”.

5. For years, DBMS and Hadoop vendors have bundled low-end text analytics capabilities rather than costlier state-of-the-art ones. I think that may be changing, however, mainly in the form of Attensity partnerships.

Truth be told, I’m not wholly current on text mining vendors — but when I last was, Attensity was indeed the best choice for such partnerships. And I’m not aware of any subsequent developments that would change that conclusion.

Related links:

Comments

3 Responses to “It’s hard to make data easy to analyze”

  1. Keith Kohl on February 15th, 2013 9:18 am

    Great post Curt, thanks for this. I hear exactly the same thing from organizations and system integrators. I put a comment out on Merv Adrian’s blog post you referenced above about some of these problems.

    Joining data, especially where both sides of the join are large, is a very difficult in a distributed environment like Hadoop. I hear this time and time again where MR developers have had to write hundreds of lines of complex Java code to get it to work.

    As mentioned in my response to that blog post, and I remember we discussed this sometime ago, ETL is a very common use case in Hadoop. You state above “Hadoop is a great place to dump stuff”, but if you want to operate on it and/or extract it, you need to perform ETL (filter data, joining data, etc.).

  2. A Useful Product Service That Someone Should Create « Enterprising Thoughts on February 18th, 2013 2:03 am

    [...] be analyzed, it must first be prepared (consolidated, cleansed, munged, etc.). Curt Monash wrote a great post about this [...]

  3. Brian Hoover on February 19th, 2013 3:47 pm

    Making data available in the right format at the right time for analysis is a great topic, and one that deserves more discussion. Thanks for pointing it out.

    Just an observation – I agree MDM is an attitude and a process, and I believe that data quality is both a pre-requisite to MDM (standardization of reference data to enable master data matching) and is enabled by MDM (using master data as a reference validation to avoid the introduction and propogation of poor data quality).

    However, data quality is not a subset of MDM – transaction data that doesn’t fall into the domain of MDM can and should conform to data quality standards and processes and may or may not be enhanced by MDM. Sometimes the MDM vendors are overselling MDM as a panacea for all enterprise data quality and transformation needs when there are more cost effective data quality specific or ETL solutions.

    Ideally MDM should be taking place on the front-end of operational systems so that data quality is enforced before it enters the warehouse or analytical stream, not as a “fix” in preparation for analysis.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.