December 9, 2005

Relational DBMS versus text data

There seems to be tremendous confusion about “search,” “meaning,” “semantics,” the suitability of relational DBMS to manage text data, and similar subjects. Here are some observations that may help sort some of that out.

1. Relational database theorists like to talk about the “meaning” or “semantics” of data as being in the database (specifically its metadata, and more specifically its constraints). This is at best a very limited use of the words “meaning” or “semantics,” and has little to do with understanding the meaning of plain English (or other language) phrases, sentences, paragraphs, etc. that may be stored in the database. Hugh Darwen is right and his fellow relational theorists are confused.

2. The standard way to manage text is via a full-text index, designed like this: For hundreds of thousands of words, the index maintains a list of which documents the word appears in, and at what positions in the document it appears. This is a columnar, memory-centric approach, that doesn’t work well with the architecture of mainstream relational products. Oracle pulled off a decent single-server integration nonetheless, although performance concerns linger to this day. Others, like Sybase, which attempted a Verity integration, couldn’t make it work reasonably at all. Microsoft, which started from the Sybase architecture, didn’t even try, or if they tried it wasn’t for long; Microsoft’s text search strategy has been multi-server more or less from the getgo.

3. Notwithstanding point #2, Oracle, IBM, Microsoft, and others have SQL DBMS extended to handle text via the SQL3 (or SQL/MM ) standard. (Truth be told, I get the names and sequencing of the SQL standard versions mixed up.) From this standpoint, the full text of a document is in a single column, and one can write WHERE clauses on that column using a rich set of text search operators.

But while such SQL statements formally fit into the relational predicate logic model, the fit is pretty awkward. Text search functions aren’t two-valued binary yes/no types of things; rather, they give scores, e.g. with 101 possible values (the integers from 0 – 100). Compounding them into a two-valued function typically throws away information, especially since that compounding isn’t well understood (which is why it’s so hard to usefully federate text searches across different corpuses).

4. Something even trickier is going on. Text search can be carried out against many different kinds of things. One increasingly useful target is the tables of a relational database. Where a standard SQL query might have trouble finding all the references in a whole database to a particular customer organization or product line or whatever, a text search can do a better job. This kind of use is becoming increasingly frequent. And while it works OK against relational products, it doesn’t fit into the formal relational model at all (at least not without a tremendous amount of contortion).

5. Relational DBMS typically manage the data they index. Text search systems often don’t. But that difference is almost a small one compared with some of the others mentioned above, especially since it’s a checkmark item for leading RDBMS to have some sort of formal federation capability.

Comments

13 Responses to “Relational DBMS versus text data”

  1. Text Technologies»Blog Archive » Misunderstandings of text management on December 12th, 2005 5:35 pm

    [...] One of the biggest is in the area of text. They fail to see how text data management is fundamentally different from tabular data management. Here’s a little article explaining why text doesn’t fit well into the relational model. • • • [...]

  2. Avi Rappoport on February 9th, 2006 6:16 pm

    I agree with this entirely — wrote an article on the related topic of searching text and how DBMSs just tend to be the wrong model, http://www.searchtools.com/info/database-search.html. I will link to this as well, it as you have more of the database architecture details.

  3. EnterpriseSearchEngines.net » The problem of text in relational databases on April 18th, 2006 6:27 pm

    [...] Curt Monash has some observations about search, meaning, semantics, the suitability of relational DBMS to manage text data and similar subjects: [...]

  4. Text Technologies»Blog Archive » FAST said to be pursuing BI on January 28th, 2007 7:05 am

    [...] EDIT:   Point #4 of my post on the mismatch between relational databases and text search is pretty relevant here. • • • [...]

  5. The 4 main approaches to datatype extensibility | DBMS2 -- DataBase Management System Services on June 27th, 2008 2:26 pm

    [...] Text search is a huge business on the web, and a separate big business in enterprises. And text doesn’t fit well into the relational paradigm at [...]

  6. Complete Rewrite on October 23rd, 2008 7:48 am

    Text Search and the Relational Model…

    Since text search is one of my top areas of expertise, I hope I can explain to you why the relational model is perfectly capable of capturing the structure of text. I’ll start at the very bottom, explaining what text search really is….

  7. imitatio creatio » Blog Archive » SQL przyszłości on December 2nd, 2008 8:48 pm

    [...] Po pierwsze, SQL/MM Full-Text na razie jest tylko w ułamkowej części implementowany przez niektóre silniki bazodanowe. Po drugie, relacyjne bazy danych mają swoje ograniczenia. Widoczne są one także na polu wyszukiwania “ze zrozumieniem” – bo w sumie do tego dążymy jak się chwilę zastanowić. Pisze o tym np. Curt Monash w artykule Relational DBMS versus text data. [...]

  8. What to do about “unstructured data” | DBMS 2 : DataBase Management System Services on May 15th, 2011 4:54 pm

    [...] In December, 2005 I expounded on the mismatch between text data and the relational model. [...]

  9. Text data management, Part 1: Confusion | DBMS 2 : DataBase Management System Services on October 10th, 2011 8:58 pm

    [...] are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly [...]

  10. Approximate query results | DBMS 2 : DataBase Management System Services on July 12th, 2012 1:30 pm

    [...] First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model. [...]

  11. Scott Meyer on July 12th, 2012 5:15 pm

    “Text search functions aren’t two-valued binary yes/no types of things”

    Actually, they are. It is just that the volume of results one gets from such operators applied to a text corpus is so large as to be useless to a human being, the usual consumer. Instead the results are sorted by relevance, an expensive total ordering which was computed at index build time and embodied in document id order. The human sees only the most relevant portion of an exact result set.

    Arguably, it is IR which has it right (every relationship that actually occurs is automatically indexed) and the RDBMS world which got it wrong (between tables, it is unclear which relationships actually exist and indexes are created by hand).

  12. Curt Monash on July 12th, 2012 5:52 pm

    Scott,

    Text search is, in principle, a function from a set of documents to a set of possible relevancy scores, which set has cardinality a lot higher than 2.

    You are correct that an early stage in evaluation is apt to be a binary filter, but that doesn’t contradict my point.

  13. fitness training Winter Park on February 20th, 2014 12:22 am

    Good answers in return of this question with real arguments and telling all concerning that.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.