I frequently observe that no market categorization is ever precise and, in particular, that bad jargon drives out good. But when it comes to “big data” or “big data analytics”, matters are worse yet. The definitive shark-jumping moment may be Forrester Research’s Brian Hopkins’ claim that:
… typical data warehouse appliances, even if they are petascale and parallel, [are] NOT big data solutions.
Nonsense almost as bad can be found in other venues.
Forrester seems to claim that “big data” is characterized by Volume, Velocity, Variety, and Variability. Others, less alliteratively-inclined, might put Complexity in the mix. So far, so good; after all, much of what people call “big data” is collections of disparate data streams, all collected somewhere in a big bit bucket. But when people start defining “big data” to include Variety and/or Variability, they’ve gone too far.
Up to that point, Hopkins — while wrong — is far from alone. The less common part of his error is to further claim that for data to be “big”, it must be stored in a way that violates the C in the CAP Theorem. Yes, the bigger the data set, the more likely that each datum has low individual value, with immediate consistency not being strictly necessary. But there are plenty of big data use cases in which data accuracy turns out to be a good idea.
It actually is reasonable to say that Volume and Velocity of data go together. If you’re storing 5 terabytes of data per day, you have a “big data” kind of problem, whether you then keep it for 30 days or 3000. It also is reasonable to say that Variety and Variability go together; indeed, I’d guess that what Forrester means by those terms corresponds to multi-structured and poly-structured respectively, and using one of those terms is generally plenty.
But while we can whittle four concepts down to two, the reduction should stop there. I say this because any of four combinations is possible (and not just in edge cases):
- Data can be both big and poly-structured. For example, consider the classic Hadoop log-collection use case, or the bigger of MarkLogic’s databases, or of Splunk’s, or even the dynamic-schema parts of relational data warehouses built by Zynga and eBay. And yes, also consider some of the NoSQL-based short-request systems Hopkins was surely thinking of as well.
- Data can be both big and simply-structured. I think most of Teradata’s and Vertica’s petabyte-scale installations would fit that description, the partial counterexamples at eBay and Zynga notwithstanding.
- Data can be not-so-big and poly-structured. Consider, for example, a typical user of Intersystems Cache’.
- Data can be not-so-big and simply-structured. Consider, for example, most of the traditional RDBMS world.
To pretend that those four possibilities are only two — “big data” and otherwise — is a travesty.
If the term “big data” has become useless, then what? Gartner may have switched over to “extreme data,” as reported by my clients at SAND, in honor of the multi-V stuff. That would be an improvement. Better yet would be to stop pretending that a matrix with two dimensions has only one. If what you mean is “huge, poly-structured databases”, then that’s what you should say, or something like it.
If things are bad for “big data”, they’re even worse for “big data analytics”, a term that starts out by inheriting all of big data’s problems and adds more of its own. “Big data analytics” surely means “analytics done on big data” — but nobody’s quite sure what “analytics” are. For example:
- I’m OK with “analytic processing” incorporating all of what might be called business intelligence, visualization (which sometimes now is just the new term for BI), data mining, machine learning, predictive analytics (which for some years has been the term for data mining and machine learning), planning, and yet more. However, …
- … others don’t agree, and contrast “analytics” to “OLAP” and/or to “visualization”, and seem to equate “analytics” to “predictive analytics” or something similar.
- The latter is what most people have in mind when they say “big data analytics”, but …
- … vendors who can only lay claim to the “analytics” term in its most expansive sense claim to be doing “big data analytics” as well.
So here’s what I propose.
- Nobody should ever again say that “big data” doesn’t include big relational data warehouses.
- If your definition of “big data” goes beyond Volume and perhaps Velocity to include Variety, Variability, or Complexity — please call it something else instead. “Extreme data” sounds like a snowboarding competition or something, but at least it’s not as totally erroneous as “big”.
- Never, ever use the phrase “big data analytics” unless you have modifiers near it, to show what kind of big data analytics you’re talking about, or at least to describe the special value you think you bring to the big data analytics process.
Edit: Merv Adrian of Gartner Group has a more reasonable — and wittier! — take than Forrester’s:
You won’t see us telling people “That’s not #bigdata. This is big data.” That’s Crocodile Dundee’s job.