It seems my prediction of a limited blogging schedule in December came emphatically true. I shall re-start with a collection of quick thoughts, clearing the decks for more detailed posts to follow. If you’d like to contribute thoughts on these subjects, now might be a really good time.
1. Not many terms I coin gets marketing traction, but machine-generated data has grown some legs. Clients (Infobright, Cloudera) and non-clients alike have adopted it. I need to follow up with a more official description/definition of the concept. The Wikipedia article on same doesn’t get the job done yet. (Edit: Here’s my take on defining machine-generated data. Be sure to read through to Daniel Abadi’s response.)
2. Merv Adrian is going to Gartner Group. Expect great improvement in Gartner’s DBMS coverage, in areas beyond the straightforward “This is what users say they are doing” Gartner already excels at. That said, Merv is probably not starting at Gartner soon enough to help make the 2010 analytic DBMS Magic Quadrant any better than the Gartner 2009 data warehouse database management system magic quadrant, the Gartner 2008 data warehouse database management system magic quadrant, and so on.
In particular, Merv has a good understanding of trends and technology on analytic DBMS and related markets. Judging by his Twitter stream, James Kobielus at Forrester if anything overrates the shift to general “analytic platforms.” And I of course am expected to help define the “analytic platform”/”advanced analytics”/whatever category. Taking all those analyst efforts together, it’s reasonable to expect a lot more market awareness — and also market confusion — around these areas.
3. All that plugs into a larger project I was working on before my family issues came crashing in. The enterprise data warehouse is a myth, and that’s just the first reason that the old EDW vs. data mart bifurcation is grossly inadequate for understanding analytic data management choices. So I’m working on some ideas to categorize types of data warehouse/mart/whatever according to what kind of data you have and how you use that data. Multiple industry players (OK, vendors) have offered interesting and useful feedback in this process, although I’m still waiting for Teradata and IBM. (Edit: My bad. Teradata actually had sent a helpful response some time ago.)
In connection with that effort, the last outline I did back in October of analytic data use styles read:
- Traditional BI
- Reporting, dashboards, & light-weight ad-hoc query
- (Even if you make this more into data exploration, you’re probably not stressing the underlying DBMS much more than traditional BI does)
- (If integrated into operational apps, your DBMS choice for this may be constrained by your choice of operational apps)
- Near-real-time BI
- E.g., dashboards w/ constant or 1-minute refresh
- (Actually, this isn’t a great fit for most analytic DBMS yet)
- (Also, it’s not a big market yet, except in specialized niches such as trading or network control)
- Budgeting & consolidation
- (MOLAP is still strong here)
- (I took out the word “planning” because it has several meanings)
- Investigative analytics*
- Can be but doesn’t have to be long-running
- Example technologies include:
- Heavy ad-hoc query
- Data mining/machine learning/predictive analytics modeling
- Other advanced analytics
- (Advanced) operational analytics
- Inputs to operational apps
- Technologically similar to investigative analytics
- Data mining/machine learning/predictive analytics scoring
- Other advanced analytics
- Example applications include:
- Customer classification or scoring
- Wholesale telecom pricing
- Basel 3 risk analysis
- Pre-processing, staging, and ETL
- Archive & compliance
The data warehouse/mart categories weren’t in exact one-to-one correlation to those use styles, but the connection was of course pretty close.
*I’ve really struggled with terminology in the area of data exploration (over-used already)/discovery analytics (sounds weird)/research analytics (caused confusion when I tried it). Investigative analytics is my latest try.
4. And finally — like most people, I find the terms unstructured or semi-structured data to be misleading, for at least two reasons:
- When the data is human-generated, what’s really happening is usually that the structure is just in a different place — structured databases generally tend to hold unstructured data, and vice-versa.
- In the case of machine-generated data, you really can start out with unstructured sets of individually unstructured logs. So what do you do then? You derive data, which has some kind of structure, and do most of your operations on that.
So I’ve been playing for a couple of years with the thought of introducing the term polystructured data. This is not a finished concept, because there are at least three different things I could mean by it:
- “Polystructured data is data that has considerable structure, but whose structure is in some important way unpredictable.” That’s a direct quote from a draft of a never-published paper. The paper, conceived before the days of NoSQL, was meant to be very XML-centric.
- “Polystructured data is data whose structure is apt to be interpreted in different ways at different times” — e.g., data that will variously get referenced by free text and structured searches. The example I gave illustrates part of the problem with that version, as increasingly many software vendors think it’s a dandy idea to do free-text searches across various columns of relational tables.
- “Polystructured data is data that gets restructured over time.” That’s the derived data point.
It may take a while to find, but I think there’s a pony in there somewhere.
Edit: Here’s the definition of poly-structured database I eventually came up with.