June 4, 2011

Dirty data, stored dirt cheap

A major driver of Hadoop adoption is the “big bit bucket” use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a whole lot of things you can do with it.

Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.

So the question arises — why would you want to spend serious money to look after your low-value data? The answer, of course, is that maybe your log data isn’t so low-value. True, the signal-to-noise ratio in purely machine-generated data is rarely high (web logs and so on may be an exception). But if the signal is sufficiently important, the overall data set may have decent average value. Intelligence work is one case where the occasional black swan might justify gilded cages for the whole aviary; the same might go for other forms of especially paranoid security.

For example, I was told of one big bank that was pulling 5 GB of logs every half hour into Splunk (selected for performance), or at least planning to. The application was forensics to protect against internal fraudulent trading, something that’s been a multi-hundred million or even multi-billion dollar problem at various investment banks in the past. I have no idea what the retention policy on those logs is, but clearly the core application can support higher-than-Hadoop pricing.


9 Responses to “Dirty data, stored dirt cheap”

  1. Tony on June 4th, 2011 9:48 pm

    Retention policy on logs is the higher value of what regulators require and what beliefs in hidden value justify.

    Of course, if technologies such as those you mention do extract most of the value from logs, value remaining might not justify retention, just as dirt from which gold or silver has been sifted is just dirt.

  2. Stefan Groschupf on June 4th, 2011 10:01 pm

    We see a lot of use cases where customers try to understand interaction (clickstream, CRM events, etc) rather than transaction (traditional BI and DW) to increase conversion rates etc.

    Also, we see excitement around Hadoop virtual removing the limitations of storage and compute (or lowers $). This allows the storage of raw (as is) data in structured or unstructured formats.
    Doing so eliminates heavy, time consuming and expansive ETL processes since the data can be model ‘on read’ as part of an analytics pipeline, rather than data needs to be pre modeled and pre optimized to fit into a star-schema in a EDW.

  3. Hadoop confusion from Forrester Research | DBMS 2 : DataBase Management System Services on June 5th, 2011 6:13 pm

    […] is well on its way to being a surviving data-storage-plus-processing system — like an analytic DBMS or DBMS-imitating data integration tool […]

  4. You May Need This One Day « Ellie Asks Why on June 13th, 2011 2:15 pm

    […] Curt’s recent three-part series of posts about best practice use-cases for Hadoop began with Dirty Data Stored Dirt Cheap. I liked the conclusion: But if the signal is sufficiently important, the overall data set may have […]

  5. Metaphors amok | DBMS 2 : DataBase Management System Services on June 17th, 2011 4:52 am

    […] is likely to play an important role in many enterprises’ analytic ecosystems, both for its big bit bucket and parallel analytics […]

  6. More On Security & Big Data…Where Data Analytics and Security Collide | Rational Survivability on July 22nd, 2011 2:40 pm

    […] Dirty data, stored dirt cheap (dbms2.com) […]

  7. Windows Azure and Cloud Computing Posts for 7/22/2011+ - Windows Azure Blog on July 23rd, 2011 2:11 pm

    […] Dirty data, stored dirt cheap (dbms2.com) […]

  8. Some trends that will continue in 2013 | DBMS 2 : DataBase Management System Services on January 5th, 2013 8:54 am

    […] adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good […]

  9. Vertica 7 | DBMS 2 : DataBase Management System Services on December 5th, 2013 2:50 pm

    […] Flex Zone is meant to be (among other things) a big bit bucket, perhaps in some cases obviating the need for Hadoop to play the same […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.