A major driver of Hadoop adoption is the “big bit bucket” use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a whole lot of things you can do with it.
Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.
So the question arises — why would you want to spend serious money to look after your low-value data? The answer, of course, is that maybe your log data isn’t so low-value. True, the signal-to-noise ratio in purely machine-generated data is rarely high (web logs and so on may be an exception). But if the signal is sufficiently important, the overall data set may have decent average value. Intelligence work is one case where the occasional black swan might justify gilded cages for the whole aviary; the same might go for other forms of especially paranoid security.
For example, I was told of one big bank that was pulling 5 GB of logs every half hour into Splunk (selected for performance), or at least planning to. The application was forensics to protect against internal fraudulent trading, something that’s been a multi-hundred million or even multi-billion dollar problem at various investment banks in the past. I have no idea what the retention policy on those logs is, but clearly the core application can support higher-than-Hadoop pricing.