There seems to be tremendous confusion about “search,” “meaning,” “semantics,” the suitability of relational DBMS to manage text data, and similar subjects. Here are some observations that may help sort some of that out.
1. Relational database theorists like to talk about the “meaning” or “semantics” of data as being in the database (specifically its metadata, and more specifically its constraints). This is at best a very limited use of the words “meaning” or “semantics,” and has little to do with understanding the meaning of plain English (or other language) phrases, sentences, paragraphs, etc. that may be stored in the database. Hugh Darwen is right and his fellow relational theorists are confused.
2. The standard way to manage text is via a full-text index, designed like this: For hundreds of thousands of words, the index maintains a list of which documents the word appears in, and at what positions in the document it appears. This is a columnar, memory-centric approach, that doesn’t work well with the architecture of mainstream relational products. Oracle pulled off a decent single-server integration nonetheless, although performance concerns linger to this day. Others, like Sybase, which attempted a Verity integration, couldn’t make it work reasonably at all. Microsoft, which started from the Sybase architecture, didn’t even try, or if they tried it wasn’t for long; Microsoft’s text search strategy has been multi-server more or less from the getgo.
3. Notwithstanding point #2, Oracle, IBM, Microsoft, and others have SQL DBMS extended to handle text via the SQL3 (or SQL/MM ) standard. (Truth be told, I get the names and sequencing of the SQL standard versions mixed up.) From this standpoint, the full text of a document is in a single column, and one can write WHERE clauses on that column using a rich set of text search operators.
But while such SQL statements formally fit into the relational predicate logic model, the fit is pretty awkward. Text search functions aren’t two-valued binary yes/no types of things; rather, they give scores, e.g. with 101 possible values (the integers from 0 – 100). Compounding them into a two-valued function typically throws away information, especially since that compounding isn’t well understood (which is why it’s so hard to usefully federate text searches across different corpuses).
4. Something even trickier is going on. Text search can be carried out against many different kinds of things. One increasingly useful target is the tables of a relational database. Where a standard SQL query might have trouble finding all the references in a whole database to a particular customer organization or product line or whatever, a text search can do a better job. This kind of use is becoming increasingly frequent. And while it works OK against relational products, it doesn’t fit into the formal relational model at all (at least not without a tremendous amount of contortion).
5. Relational DBMS typically manage the data they index. Text search systems often don’t. But that difference is almost a small one compared with some of the others mentioned above, especially since it’s a checkmark item for leading RDBMS to have some sort of formal federation capability.