What needs to be updated anyway?
Shayne Nelson is posting some pretty wild ideas on data architecture and redundancy. In the process of doing so, he’s reopening an old discussion topic:
Why would data ever need to be erased?
and the natural follow-on
If it doesn’t need to be erased, what exactly do we have to update?
Here are some quick cuts at answering the second question:
- “Primary” data usually doesn’t really need to be updated, exactly. But it does need to be stored in such a way that it can immediately be found again and correctly identified as the most recent information.
- Analytic data usually doesn’t need to be updated with full transactional integrity; slight, temporary errors do little harm.
- “Derived” data such as bank balances (derived from deposits and withdrawals) and inventory levels (derived from purchases and sales) commonly needs to be updated with full industrial-strength protections.
- Certain kinds of primary transactions, such as travel reservations, need the same treatment as “derived” data. When the item sold is unique, the primary/derived distinction largely goes away.
- Notwithstanding the foregoing, it must be possible to update anything for error-correction purposes (something Nelson seems to have glossed over to date).
Respondents to Nelson’s blog generally argue that it’s better to store data once and have redundant subcopies of it in the form of indexes. I haven’t yet seen any holes in those arguments. Still, it’s a discussion worth looking at and noodling over.
Categories: Theory and architecture | 2 Comments |
Solid state (Flash) memory vs. RAM vs. disks
I just wrote a column and a blog post on the potential for diskless PCs based on flash drives. It was a fun exercise, and I think I kept it general enough that my lack of knowledge about hardware technology details didn’t lead me into significant error.
The first vendor response I got was from Bit Micro Networks, who seem to sell such drives for PCs and enterprise storage alike. One of their press releases touts an Oracle implementation. Interesting idea. It’s far from a substitute for full memory-centric data management, but it’s kind of an intermediate way of getting some of the benefits without altering your traditional software setup much at all.
Categories: Memory-centric data management, Oracle, Solid-state memory | 1 Comment |
Application logic in the database
I’m highly in favor of modularity in application development, but suspicious of folks who promote it to extremes as a panacea. (Perhaps another legacy of my exaggerated infatuation with LISP in the 1980s?) Thus, I was one of the chief drumbeaters for OO programming before Java made it de rigeur, but I also was one of the chief mockers of Philippe Kahn’s claims that Borland would outdevelop Microsoft in office productivity tools just because it used OO tools. (Analyst Michelle Preston bought that pitch lock, stock, and barrel, and basically was never heard from again.)
I’ve held similar views on stored procedures. A transactional DBMS without stored procedures is for many purposes not a serious product. CASE tools that use stored procedures to declaratively implement integrity constraints have been highly valuable for a decade. But more general use of stored procedures has been very problematic, due to the lack of development support for writing and maintaining them in any comprehensive way. Basically, stored procedures have been database-resident spaghetti.
Microsoft claims to have changed all this with the relationship between the new releases of SQL Server and Visual Studio, and have touted this as one of the few “game changers” in SQL Server 2005. I haven’t actually looked at their offering, but I’m inclined to give them the benefit of the doubt — i.e., absent verification I tentatively believe they are making it almost as practical from a team development standpoint to implement code in the database as it is on the middle tier.
Between the Microsoft announcement and the ongoing rumblings of the business rules folks, there’s considerable discussion of putting application logic in the database, including by the usual suspects over on Alf Pedersen’s blog. (Eric’s response in that thread is particularly good.) Here are some of my thoughts:
1. As noted above, putting logic in the database, to the extent the tools are good, has been a good thing. If the tools are indeed better now, it may become a better thing.
2. The myth that an application is just database-logic-plus-the-obvious-UI has been with us for a LONG time. It’s indeed a myth, for several reasons. There’s business process, for one thing. For another, UIs aren’t as trivial as that story would make them sound. (I keep promising to write on the UI point and never get around to it. I will. Stay tuned. For one thing, I have a white paper in the works on portals. For another, I’m not writing enough about analytics, and UI is one of the most interesting things going in analytics these days.) Plus there are many apps for which a straightforward relational/tabular database design doesn’t make sense anyway. (That’s a primary theme of this blog.)
3. It’s really regrettable that the term “business rules” is used so carelessly. It conflates integrity constraints and general application logic. Within application logic, it conflates those which are well served by a development and/or implementation paradigm along the line of a rules engine, and those for which a rules engine would make little sense. It’s just bad semantics.
4. Besides everything else, I mainly agree with SAP’s belief that the DBMS is the wrong place to look for module interfaces.
Categories: Microsoft and SQL*Server, Theory and architecture | 2 Comments |
Reasons to use native XML
From a DevX article on Microsoft’s SQL Server 2005
Depending on your situation, XML can also be the best choice for storing even highly structured data. Here are a few practical reasons to consider storing data in a field of type XML:
* Repeated shredding or publishing—On-demand transformations carry a performance penalty. If you have to shred or publish the same document over and over again, consider storing it natively as XML. You can always expose it to relational consumers with an XML view.
* Rapidly changing data structures—When modeled correctly, XML lives up to its name: It’s extensible. Developers can add new pieces of data—even new hierarchies—to a schema without compromising existing software. Extensibility is an extra advantage when prototyping, or when working with rapidly changing problem domains such as bioinformatics.
* Atomic data—Sometimes, you’ll have XML data that’s never consumed except as a whole. Think of this as logical atomicity—if you never access the parts individually, you might as well store it in one big chunk.
* Debugging—Especially for new releases, it can be a good idea to tuck away a copy of your XML imports. The data may be redundant, but keeping the original makes tracking down problems a whole lot easier.
Nothing there to disagree with too heavily, although I can think of some other reasons that might rank higher yet.
Categories: Microsoft and SQL*Server, Structured documents | 4 Comments |
Two kinds of DBMS extensibility
Microsoft took slight exception to my claim that they lack fully general DBMS extensibility. The claim is actually correct, but perhaps it could lead to confusion. And anyhow there’s a distinction here worth drawing, namely:
There are two different kinds of DBMS extensibility.
The first one, which Microsoft has introduced in SQL Server 2005 (but which other vendors have had for many years) is UDTs (User-Defined Types), sometimes in other systems called user-defined functions. These are in essence datatypes that are calculated functions of existing datatypes. You could use a UDT, for example, to make the NULLs in SQL go away, if you hate them. Or you can calculate bond interest according to the industry-standard “360 day year.” Columns of these datatypes can be treated just like other columns — one can use them in joins, one can index on them, the optimizer can be aware of them, etc.
The second one, commonly known by the horrible name of abstract datatypes (ADTs), is found mainly in Oracle, DB2, and previously the Informix/Illustra products. Also, if my memory is accurate, Ingres has a very partial capability along those lines, and PostgresSQL is said to be implementing them too. ADTs offer a way to add totally new datatypes into a relational system, with their own data access methods (e.g., index structures). That’s how a DBMS can incorporate a full-text index, or a geospatial datatype. It can also be a way to more efficiently implement something that would also work as a UDT.
In theory, Oracle et al. expose the capability to users to create ADTs. In practice, you need to be a professional DBMS developer to write them, and they done either by the DBMS vendors themselves, or by specialist DBMS companies. E.g., much geospatial data today is stored in ESRI add-ons to Oracle; ESRI of course offered a speciality geospatial DBMS before ADTs were on the market.
Basically, implementing a general ADT capability is a form of modularity that lets new datatypes be added more easily than if you don’t have it. But it’s not a total requirement for new datatypes. E.g., I was wrong about Microsoft’s native XML implementation; XML is actually managed in the relational system. (More on that in a subsequent post.)
Categories: Actian and Ingres, Data types, IBM and DB2, Microsoft and SQL*Server, Open source, Oracle, Theory and architecture | 3 Comments |
SAP’s version of DBMS2
I just spent a couple of days at SAP’s analyst meeting, and realized something I’d somewhat forgotten – much of the DBMS2 concept was inspired by SAP’s technical strategy. That’s not to say that SAP’s techies necessarily agree with me on every last point. But I do think it is interesting to review SAP’s version of DBMS2, to the extent I understand it.
1. SAP’s Enterprise Services Architecture (ESA) is meant to be, among other things, an abstraction layer over relational DBMS. The mantra is that they’re moving to a “message-based architecture” as opposed to a “database architecture.” These messages are in the context of a standards-based SOA, with a strong commitment to remaining open and standards-based, at least on the data and messaging levels. (The main limitation on openness that I’ve detected is that they don’t think much of standards such as BPEL in the business process definition area, which aren’t powerful enough for them.)
2. One big benefit they see to this strategy is that it reduces the need to have grand integrated databases. If one application manages data for an entity that is also important to another application, the two applications can exchange messages about the entity. Anyhow, many of their comments make it clear that, between partner company databases (a bit of a future) and legacy app databases (a very big factor in the present day), SAP is constantly aware of situations in which a single integrated database in infeasible.
3. SAP is still deeply suspicious of redundant transactional data. They feel that with redundant data you can’t have a really clean model – unless, of course, you code up really rigorous synchronization. However, if for some reason synchronization is preferred – e.g., for performance reasons — it can be hidden from users and most developers.
4. One area where SAP definitely favors redundancy and synchronization is data warehousing. Indeed, they have an ever more elaborate staging system to move data from operational to analytic systems.
5. In general, they are far from being relational purists. For example, Shai Agassi referred to doing things that you can’t do in a pure relational approach. And Peter Zencke reminded me that this attitude is nothing new. SAP has long had complex business objects, and even done some of its own memory management to make them performant, when they were structured in a manner that RDBMS weren’t well suited for. (I presume he was referring largely to BAPI.)
6. That said, they’re of course using relational data stores today for most things. One exception is text/content, which they prefer to store in their own text indexing/management system TREX. Another example is their historical support for MOLAP, although they seem to be edging as far away from that as they can without offending the MOLAP-loving part of their customer base.
Incidentally, the whole TREX strategy is subject to considerable doubt too. It’s not a state-of-the-art product, and they currently don’t plan to make it into one. In particular, they have a prejudice against semi-automated ontology creation, and that has clearly become a requirement for top-tier text technologies.
7. One thing that Peter said which confused me a bit is when we were talking about nonrelational data retrieval. The example he used was retrieving information on all of a specific sales reps’ customers, or perhaps on several sales reps’ customers. I got the feeling he was talking about the ability to text search on multiple columns and/or multiple tables/objects/whatever at once, but I can’t honestly claim that I connected all the dots.
And of course, the memory-centric ROLAP tool BI Accelerator — technology that’s based on TREX — is just another example of how SAP is willing to go beyond passively connecting to a single RDBMS. And while their sponsorship of MaxDB isn’t really an example of that, it is another example of how SAP’s strategy is not one to gladden the hearts of the top-tier DBMS vendors.
Categories: EAI, EII, ETL, ELT, ETLT, Memory-centric data management, MOLAP, OLTP, SAP AG, Theory and architecture | 9 Comments |
Relational DBMS versus text data
There seems to be tremendous confusion about “search,” “meaning,” “semantics,” the suitability of relational DBMS to manage text data, and similar subjects. Here are some observations that may help sort some of that out.
1. Relational database theorists like to talk about the “meaning” or “semantics” of data as being in the database (specifically its metadata, and more specifically its constraints). This is at best a very limited use of the words “meaning” or “semantics,” and has little to do with understanding the meaning of plain English (or other language) phrases, sentences, paragraphs, etc. that may be stored in the database. Hugh Darwen is right and his fellow relational theorists are confused.
2. The standard way to manage text is via a full-text index, designed like this: For hundreds of thousands of words, the index maintains a list of which documents the word appears in, and at what positions in the document it appears. This is a columnar, memory-centric approach, that doesn’t work well with the architecture of mainstream relational products. Oracle pulled off a decent single-server integration nonetheless, although performance concerns linger to this day. Others, like Sybase, which attempted a Verity integration, couldn’t make it work reasonably at all. Microsoft, which started from the Sybase architecture, didn’t even try, or if they tried it wasn’t for long; Microsoft’s text search strategy has been multi-server more or less from the getgo.
3. Notwithstanding point #2, Oracle, IBM, Microsoft, and others have SQL DBMS extended to handle text via the SQL3 (or SQL/MM ) standard. (Truth be told, I get the names and sequencing of the SQL standard versions mixed up.) From this standpoint, the full text of a document is in a single column, and one can write WHERE clauses on that column using a rich set of text search operators.
But while such SQL statements formally fit into the relational predicate logic model, the fit is pretty awkward. Text search functions aren’t two-valued binary yes/no types of things; rather, they give scores, e.g. with 101 possible values (the integers from 0 – 100). Compounding them into a two-valued function typically throws away information, especially since that compounding isn’t well understood (which is why it’s so hard to usefully federate text searches across different corpuses).
4. Something even trickier is going on. Text search can be carried out against many different kinds of things. One increasingly useful target is the tables of a relational database. Where a standard SQL query might have trouble finding all the references in a whole database to a particular customer organization or product line or whatever, a text search can do a better job. This kind of use is becoming increasingly frequent. And while it works OK against relational products, it doesn’t fit into the formal relational model at all (at least not without a tremendous amount of contortion).
5. Relational DBMS typically manage the data they index. Text search systems often don’t. But that difference is almost a small one compared with some of the others mentioned above, especially since it’s a checkmark item for leading RDBMS to have some sort of formal federation capability.
Categories: Data types, Database diversity, Memory-centric data management, Text, Theory and architecture | 13 Comments |
More flame war stupidity
Robert Seiner (publisher of TDAN) and Fabian Pascal are now claiming that Computerworld approached Bob and asked him to do something about the false charge that I personally engaged in censorship. To the best of my knowledge, they’re both lying. It was just me, and me alone, who approached Bob, which is exactly what one would think, if for some odd reason one cared about the matter at all. I don’t have the faintest idea why they fabricated this story, or what they think it demonstrates — but they did.
Seiner also picked a title for an article of mine he published, then published one by Fabian attacking me for the title. Classy.
Bob also made two promises in the matter which he didn’t keep. Nor did he have the courtesy to inform me that he’d changed his mind, nor did he really address it when I called him on it.
I wondered why Seiner kept on publishing Pascal’s stuff, even for free, when most of Fabian’s other publishers have dropped him. Now I have a better idea. They’re soulmates.
A pity. Partway through our discussions, Bob sounded eminently reasonable. That’s why I jumped at his suggestion I write an article for him. Oh well; live and learn.
And for the record — no, I won’t respond to Pascal’s critiques point by point. He typically attacks straw men, rather than restricting his barbs to my actual opinions. In those areas where we do actually disagree, I haven’t hesitated to publish follow-on arguments, repeatedly and at length, here and elsewhere. I’ve given that relative nonentity much more attention than he deserves.
Also for the record — even though I don’t respond to every nasty shot Pascal and his associates take at me, I’m of course not conceding that his other libels and opinions are actually correct. I just think that by and large he’s a waste of bandwidth, because even his coherent ideas are quickly sidetracked by highly illogical fulminations. Even in articles where he’s otherwise making enough sense to respond to, he usually goes off on some extremist semantics-related kick that doesn’t mesh well with his own imperfect command of the English language.
(I really want to respond to his film contracts example from a three-year-old anti-XML diatribe. But the article gets bogged down with various “definitions” that are not easily reconciled to normal usage of the words, and it’s too much trouble to sort through them all. Maybe I’ll respond to the idea without linking to the article itself, when I get around to it.)
Exception to the above slam at Pascal — he recently posted a good interchange he had with Hugh Darwen, which I’m referencing in another post in this blog. His side was wrong, but both sides were well-presented.
Categories: About this blog, Data models and architecture | 2 Comments |
Some Moore’s Law data points
I’m not a hardware guy, but here are some data points around the subject of Moore’s Law, quasi-Moore laws, and their bearing on random access times to disk and RAM. This line of inquiry is central to my argument favoring memory-centric data management.
Human-Computer Interaction cites 10 nanoseconds for RAM access time, 7 milliseconds for disk, and those figures are a couple of years old — that’s pretty supportive of my figure, namely the 1,000,000:1 ratio.
Don Burleson asserts that RAM speed has been 50 nanoseconds for decades. Hmm. I’m not sure what that means, since I’d think that RAM access speeds are bounded by clock speed. Of course, 20 megahertz is 50 nanoseconds per cycle, so some multiple of 20 megahertz would suffice to allow true 50 nanosecond access.
EDIT: I looked again, and he says that the 50 nanosecond limit is based on “speed of light and Proximity to CPU.” I’m even more confused than before, since light travels about 30 centimeters per nanocecond (at least in a vacuum).
A summary of a February, 2000 Jim Gray article makes some interesting points and claims in the same general subject area. One that stands out:
Storage capacity improves 100x / decade, while storage device throughput increases 10x / decade. At the same time the ratio between disk capacity and disk accesses/second is increasing more than 10x / decade. Consequently, disk accesses become more precious and disk data becomes colder with time, at a rate of 10x / decade.
Also, the disk capacity analog to Moore’s Law is sometimes named after Mark Kryder, and the network capacity version is attributed to George Gilder.
And finally — the fastest disks now made seem to spin at 15000 RPM. Those would take 2 milliseconds to spin halfway around, for the most naive estimate of their average random access time. And the naive estimate seems not to be too bad — depending on the exact model, they’re actually advertised with 3.3-3.9 millisecond seek times.
Categories: Memory-centric data management | Leave a Comment |