May 15, 2011

What to do about “unstructured data”

We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.

*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.

Second, a more accurate distinction is not whether a database has one structure or none — it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.

One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.

*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.

All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example:

IMS is the most structured DBMS of all. The data in an IMS database is in a hierarchy, and that’s that.
CODASYL and other kinds of what used to be called network DBMS (before the word got so overloaded) — e.g. RDB, IDMS, or TOTAL — are/were almost as structured as IMS.
Relational databases were invented because their structure was more flexible than that of linked-list databases. The whole point of relational DBMS is that you can view the data in a multitude of ways. Still, I see classical relational databases as being toward the single-structure end of the spectrum. (I say “classical” because Oracle and DB2 actually can manage combinations of XML, text, and traditional relational tables, if you choose.)
A multivalue DBMS is a little more multi-structured than a relational one, because of how a field can be filled one or multiple times.
eBay Singularity (as implemented on Teradata gear) has, in essence, two structures (that I know of). One structure is just the relational schema. The other is the structure you would get if each kind of name-value pair truly had its own column.
A Splunk collection of log data can reasonably be said to have a different structure for every type or source of log. It further can be said to have multiple structures in somewhat the same way that eBay Singularity does.
So-called document stores can be very multi-structured. MongoDB, Couchbase, et al. let you have a different structure for every document, if you choose. The same goes for XML-based MarkLogic.
HBase and Cassandra are also very multi-structured. Theoretically, each record gets to decide which column sets it does or doesn’t fit into.

As a general rule — the more structures a database can have at once, the easier it is to change those structures, even on the fly (e.g., by inserting yet another bit of self-describing data). Thus, I sometimes use the term polystructured instead of multi-structured or multistructured. Thoughts as to which term I should choose going forward would be much appreciated.

As for an actual definition — well, here’s something I drafted 3 1/2 years ago but never published:

These problems with the relational paradigm are big enough to be worth coining a word for – polystructured. Polystructured data is data with structure that:

Can be exploited to provide most of the benefits of a highly structured database (e.g., a tabular/relational one) …

… but cannot be described in the concise, consistent form such highly structured systems require.

Specifically, we’ll call a database “polystructured” if it is characterized by at least two of the following:

Data suitable for being queried by simple predicate-based matching (e.g., equality to certain values, falling with in ranges, etc.)

(Other) data suitable for being queried by more complex matching (e.g., text search relevancy rankings)

Subsets that are more neatly structured than the whole.

Equivalently, we’ll just say that polystructured data is data that has considerable structure, but whose structure is in some important way unpredictable.

NoSQL document or “column” stores would satisfy #1 and #3, as would Splunk. MarkLogic would satisfy all three criteria. #1 + #2 is sort of like what happens when text queries are allowed to go against (groups of) relational columns … and the vagueness with which I’m saying that makes me suspect that at least the unbolded/first definition doesn’t really fly.

Finally, here’s what led up to those definitions (the whole thing is from the introduction to a never-completed white paper). Please forgive any anachronisms in it. A number of the points in it have also been addressed in posts here; e.g.,

In December, 2005 I expounded on the mismatch between text data and the relational model.
In June, 2010 I elucidated the variety of data that could go into an individual’s marketing-oriented profile.
In February, 2008 I predicted that flexible-schema DBMS would gain share.

The case for polystructured data

Traditional computer databases amount to sets of records. There usually are a limited number of record formats, which each instance of a particular format containing parallel kinds of information. Business transactions, web page visits, instrument readings– whatever the nature of the information, application designers stick it into the simplest structure they think makes sense.

These records are arranged into a variety of data structures.

Log files are widely used, especially to track web site visits, in other networking uses, and for other kinds of instrument readings.

Computer user administration is commonly in LDAP (Lightweight Directory Access Protocol) format.

There are still a lot of installations of legacy “linked-list” DBMS (DataBase Management Systems) such as IBM’s IMS.

Some decision support applications use data in multidimensional arrays.

Even so, most new business applications are written over relational DBMS, in the well-known rows-and-tables paradigm.

There are good reasons for the dominance of the relational model and of rows and tables. (Strictly speaking, “relational” equates neither to “rows and tables” nor to “SQL”, but in practice the three concepts are closely linked.) In particular:

Data integrity is (fairly) easy to ensure.

From some standpoints, relational databases are flexible; you can construct almost any kind of query, without having to do any kind of database reorganization (except perhaps for performance).

SQL programmers are easy to find.

There’s simply been much more engineering effort invested in making good relational DBMS than in any other kind.

But the relational database paradigm also has some major drawbacks. Three of the big ones are:

Queries must have strictly match/fail answers; there’s no natural way for a relational DBMS to handle “somewhat relevant” hits.

Relational databases can get cumbersome when large fractions of the potential data happen to be missing. (Hence the decades-long debates about the problems with NULL values.)

While you have good flexibility in querying against any particular data structure, you do have to predefine your structure before you start accepting input.

The last point is why you wind up with all those NULL values in the first place; if a kind of information can be in any record in a set, the database is set up to assume that its present in all of them. Or if you normalize your database so highly as to avert missing values, then you wind up with a huge number of tables, making queries (and updates) complicated from both the programmer’s and the machine’s standpoint.

Text apps suffer from RDBMS’ inelegant handing of relevancy. What’s more, documents can have almost unlimited internal structures, in three senses:

They can have chapters, sections, subsections, sidebars, footnotes, and so on, in any combination.

Semantic references can link words, phrases, sentences, and paragraphs in a near-infinite number of ways.

Documents can explicitly contain fielded data, such as numbers, addresses, dates, or geo-encodings.

Another group of apps that suffer from RDBMS’ limitations are in the area of personalization and similar fine-grained marketing analysis. Analysis of web clicks throws away most kinds of path information. Analysis of written or verbal communication isn’t well-integrated with that of fielded data. Different customers and prospects give different kinds of contact information, and are “touched” by different marketing initiatives; current systems do a poor job of integrating all that scattered information.

Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB, NoSQL, Splunk, Theory and architecture

Subscribe to our complete feed!

Comments

19 Responses to “What to do about “unstructured data””

Gary on May 15th, 2011 5:52 pm

I’ve used the terms ‘globally structured’ and ‘locally structured’ (one global set of structure metadata as opposed to needing to inspect something to determine its structure) but I like your ‘polystructured’ better.

I’d add that ‘unstructured data’ could be classed as data where the DBMS doesn’t have the ability to interpret any structure the data might have. So a word document might be unstructured in one DBMS but not in another that is capable of processing it.
Curt Monash on May 15th, 2011 5:57 pm

Gary,

Thanks for the feedback!

In practice, people talking about “unstructured” or “semi-structured” data are often trying to sell something that DOES help understand the structure of the documents or logs being referred to. So even by your suggested definition they don’t intend for the data to stay “unstructured” for long.
aditya on May 16th, 2011 2:27 am

I would like to argue that data can be both structured and unstructured, while agreeing with most of the things you mention about databases.

Something like a web page data is unstructured. But something like a machine generated record e.g. a machine event like an Alarm with a specific format is definitely structured. Both kinds of data streams should generally be handled differently. This has nothing to do with whether your DBMS is structured or not. The act of imposing a “structure” of schema on “just lists of numbers or character strings” doesnt mean that the input data was unstructured – its most likely being converted from one structure to another.
clive boulton on May 16th, 2011 3:01 am

Knowing SQL and NoSQL “databases” need to converge, to cope with the scope of consumerized enterprise convergence, yet Cassandra is far too much for majority of business developers.

Bastardized SQL Server, Oracle-on-Oracle database patterns, blimey makes Btrieve-Pervasive SQL look easy.
Curt Monash on May 16th, 2011 4:02 am

I’d argue that a web page has a LOT of structure — much more than a single entry in a server log.

It’s just that the next web page might well not have the same structure (perhaps unless they’re rigorously generated by the same CMS).
Foursquare hits 10 million users « V E X E D on May 16th, 2011 5:05 am

[…] of Social in Shopping Foursquare hits 10 Million Users JSON: The JavaScript subset that isn’t What to do about “unstructured data” Viking Direct has six-fold increase in traffic thanks to YouTube 4 Free E-Books on Learning […]
Mashood on May 16th, 2011 5:58 am

We need not bother about the unstructured nature of data. We only need to see how or how much of it do we require to do analysis and/or make decisions. We can always live with the “un-structured” data as long as our basic need of decision making is satisfied.

Probably we do think of having data associated with other data to make it more meaningful. Add context to a data element and we are ready to go!
Dave Duggal on May 16th, 2011 7:34 am

Curt makes a good point re: structured relative to what. The term ‘unstructured data’ was certainly a reaction/response to ‘Classical’ relational databases that turns out were not so relational and couldn’t support the demand for reporting against new data types.

Moreover, the notion of defining schemas in advance constrains more interesting introspection of stored data. Static schemas effectively preclude run-time context. In reality, relations between bits of information is emergent. Standardizing relations gets consistency at the expense of relevance.
stewart levin on May 16th, 2011 11:23 am

A lot of detail here. I thought structured data was that contained in a structure aka database. Everything else was unstructured. exchange cached web pages.

Last I heard about that was a conversation with a Varonis telemarketer about tracking access to unstructured data.
aditya on May 16th, 2011 1:05 pm

In my earlier comment, i would like to replace web page with any text file or “content”. Web page was prob not a good example, but the comment still holds.

I think we probably need to define “structure” now – which i think is something that be understood or deciphered with the help of a “schema” (or design or definition). Is that right?
Curt Monash on May 16th, 2011 8:25 pm

Anant Jhingran of IBM responded very well to the term “polystructured” last week. Judging by his post at http://jhingran.typepad.com/anant_jhingrans_musings/2011/05/ibms-big-data-capabilities.html, he may still feel that way.
Curt Monash on May 16th, 2011 8:40 pm

@aditya,

A page of text has internal structure too. There are paragraphs, sentences, clauses, subject-verb-object relationships and so on.
shrikanth shankar on May 17th, 2011 1:37 am

When you *process* data dont you usually impose structure on it? To some degree its about late-binding of structure (unstructured) vs early binding of structure (structured). I like the term poly-structured. You can view a log file as a sequence of many different kinds of records . The particular query you run, picks one of these. Others (for e.g. Splunk see http://blogs.splunk.com/2010/02/10/sql-splunk-splunkmse/) have made similar points. This where M-R can really shine. Your mappers can parse your log files in many different ways. The R portion is a different discussion
Curt Monash on May 17th, 2011 2:47 am

I notice people putting a hyphen into poly-structured. I guess it does look better that way than polystructured, actually …
Dave Duggal on May 17th, 2011 7:19 am

Shrikanth – Yes! Structure has a temporal quality. When fully dynamical, as in our system, structure is synonymous with cohesion (structure as a run-time construct).
Anant Jhingran on May 17th, 2011 1:49 pm

Curt, more on this debate at http://jhingran.typepad.com/anant_jhingrans_musings/2011/05/more-on-poly-structured-data-and-its-relationship-with-hadoop-big-data.html (i know i should shorten it, sorry!)
What to do About Unstructured Data | ttl :: Time to Live on May 20th, 2011 2:08 pm

[…] we should try to define it first. Read more … This entry was posted on Friday, May 20th, 2011 at 2:07 pmand is filed under Analytics, […]
Search Facets » Big Data and the Rogue Dentists of Kowloon on June 13th, 2011 7:39 am

[…] of analyzing Big Variety, aka poly-structured data (adopting the term Curt Monash has advanced to better describe unstructured or semi-structured data.) With the release of Endeca Latitude 2 today, we have new approaches to analyze Big Variety, at […]
Jeffrey Moran on February 2nd, 2021 5:30 pm

Good info. What data capture system do you recommend for unstructured data? We are looking at several options (one we are considering is above) but are needing to get feedback from those with experience. Thank you

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

What to do about “unstructured data”

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin