Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:
- Bigness — Volume, Velocity, size
- Structure — Variety, Variability, Complexity
- High-velocity “big data” problems are usually high-volume as well.*
- Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.
But the conflation should stop there.
*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.
When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:
- Relational big data is data of high volume that fits well into a relational DBMS.
- Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
- Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
- Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.
Notes on all this include:
- “Relational big data” is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.
- The paradigmatic example of “multi-structured big data” is log files. Thus, multi-structured big data is commonly what you need a big bit bucket for.
- One might want to equate non-analytic relational big data technology to “NewSQL”. However, I’m struggling to think of a database size range in which the entire NewSQL industry can match Oracle’s market share alone.
- One might want to equate non-analytic multi-structured big data technology to “NoSQL”. However:
- “NoSQL” is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.
- “NoSQL” has non-ACID/low(er)-data-integrity connotations that aren’t appropriate for all non-relational systems.
- Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
- 1-2 terabytes if you’ve never bought anything past Oracle Standard Edition.
- 5-10 terabytes if you’re already paying for Oracle Enterprise Edition.
- A lot higher than that if you actually find Oracle Exadata to be cost-effective.
- Depending on how big one acknowledges as “big”, the market share leader in “big bit bucket” use cases is either Splunk or Hadoop.
- If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.
- It is wrong to say that the large web companies invented “big data” technology. But it is more reasonable to say they invented much of “multi-structured big data” management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.