Data types
Analysis of data management technology optimized for specific datatypes, such as text, geospatial, object, RDF, or XML. Related subjects include:
- Any subcategory
- Database diversity
Big scientific databases need to be stored somehow
A year ago, Mike Stonebraker observed that conventional DBMS don’t necessarily do a great job on scientific data, and further pointed out that different kinds of science might call for different data access methods. Even so, some of the largest databases around are scientific ones, and they have to be managed somehow. For example:
- Microsoft just put out an overwrought press release. The substance seems to be that Pan-STARRS — a Jim Gray legacy also discussed in an August, 2008 Computerworld article — is adding 1.4 terabytes of image data per night, and one not so new database adds 15 terabytes per year of some kind of computer simulation output used to analyze protein folding. Both run on SQL Server, of course.
- Kognitio has an astronomical database too, at Cambridge University, adding 1/2 a terabyte of data per night.
- Oracle is used for a McGill University proteonomics database called CellMapBase. A figure of 50 terabytes of “mass storage” is included, which doesn’t include tape backup and so on.
- The Large Hadron Collider, once it actually starts functioning, is projected to generate 15 petabytes of data annually, which will be initially stored on tape and then distributed to various computing centers around the world.
- Netezza is proud of its ability to serve images and the like quickly, although off the top of my head I’m not thinking of a major customer it has in that area. (But then, if you just sell software, your academic discount can approach 100%; but if like Netezza you have an actual cost of goods sold, that’s not as appealing an option.)
Long-term, I imagine that the most suitable DBMS for these purposes will be MPP systems with strong datatype extensibility — e.g., DB2, PostgreSQL-based Greenplum, PostgreSQL-based Aster nCluster, or maybe Oracle.
| Categories: Aster Data, Data types, Greenplum, IBM and DB2, Kognitio, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, PostgreSQL, Scientific research | 1 Comment |
Teradata Geospatial, and datatype extensibility in general
As part of it’s 13.0 release this week, Teradata is productizing its geospatial datatype, which previously was just a downloadable library. (Edit: More precisely, Teradata announced 13.0, which will actually be shipped some time in 2009.) What Teradata Geospatial now amounts to is:
- User-defined functions (UDF) written by Teradata (this is the part that existed before).
- (Possibly new) Enhanced implementations of the Teradata geospatial UDFs, for better performance.
- (Definitely new) Optimizer awareness of the Teradata geospatial UDFs.
Teradata also intends in the future to implement actual geospatial indexing; candidates include r-trees and tesselation.
Hearing this was a good wake-up call for me, because in the past I’ve conflated two issues on datatype extensibility, namely:
- Whether the query executer uses a special access method (i.e., index type) for the datatype
- Whether the optimizer is aware of the datatypes.
But as Teradata just pointed out, those two issues can indeed be separated from each other.
| Categories: Data types, Data warehousing, GIS and geospatial, Teradata | 1 Comment |
Schema flexibility and XML data management
Conor O’Mahony, marketing manager for IBM’s DB2 pureXML talks a lot about one of my favorite hobbyhorses — schema flexibility — as a reason to use an XML data model. In a number of industries he sees use cases based around ongoing change in the information being managed:
- Tax authorities change their rules and forms every year, but don’t want to do total rewrites of their electronic submission and processing software.
- The financial services industry keeps inventing new products, which don’t just have different terms and conditions, but may also have different kinds of terms and conditions.
- The same, to some extent, goes for the travel industry, which also keeps adding different and different kinds of destinations.
- The energy industry keeps adding new kinds of highly complex equipment it has to manage.
Conor also thinks market evidence shows that XML’s schema flexibility is important for data interchange.
| Categories: Data models and architecture, EAI, EII, ETL, ELT, ETLT, IBM and DB2, Native XML, pureXML | 3 Comments |
Vertical market XML standards
Tracking the alphabet soup of vertical market XML standards is hard. So as a starting point, I’m splitting a list I got from IBM into a standalone post.
Among the most important or successful IBM pureXML-supported standards, in terms of downloads and other evidence of customer interest, are:
| Categories: Application areas, EAI, EII, ETL, ELT, ETLT, IBM and DB2, Native XML, pureXML | 2 Comments |
Overview of IBM DB2 pureXML
On August 29, I had a great call with IBM about DB2 pureXML (most of the IBM side of the talking was done by Conor O’Mahony and Qi Jin). I’m finally getting around to writing it up now. (The world of tabular data warehousing has kept me just a wee bit busy …)
As I write it, I see there are a considerable number of holes, but that’s the way it seems to go when researching XML storage. I’m also writing up a September call from which I finally figured out (I think) the essence of how MarkLogic Server works – but only after five months of trying. It turns out that MarkLogic works rather differently from DB2 pureXML. Not coincidentally, IBM and Mark Logic focus on rather different use cases for native XML storage.
What I understand so far about the basic DB2 pureXML architecture goes like this:
| Categories: EAI, EII, ETL, ELT, ETLT, IBM and DB2, Native XML, pureXML | 4 Comments |
MarkLogic architecture deep dive
While I previously posted in great detail about how MarkLogic Server is an ACID-compliant XML-oriented DBMS with integrated text search that indexes everything in real time and executes range queries fairly quickly, I didn’t have a good feel for how all those apparently contradictory characteristics fit into a single product. But I finally had a call with Mark Logic Director of Engineering Ron Avnur, and think I have a better grasp of the MarkLogic architecture and story.
Ron described MarkLogic Server as a DBMS for trees.
| Categories: Mark Logic, Native XML, Text | 3 Comments |
Netezza and Teradata on analytic geospatial data management
Geospatial data management is one of the flavors of the month:
- Last week, Teradata claimed it has the most sophisticated analytic geospatial data management capability.
- Also last week, Netezza’s newly acquired Netezza Spatial technology attracted a lot of attention.
- This week, Oracle called attention to its geospatial capabilities.
So I asked Netezza and Teradata what this geospatial analytics stuff is all about.
| Categories: Analytic technologies, Data warehousing, GIS and geospatial, Netezza, Teradata | 3 Comments |
Oracle spotlights its datatype support
Oracle put out a flurry of press releases today in conjunction with Oracle OpenWorld. One, which was simply positioned as a report on some “mission-critical” customer apps, caught my eye because all four detailed examples involved nonstandard datatypes:
- Two Oracle Spatial
- One “semantic,” which in Oracle lingo seems to mean — you guessed it — RDF
- One DICOM, which seems to be a medical imaging datatype.
| Categories: Data types, GIS and geospatial, Oracle, RDF and graphs | 3 Comments |
Peter Batty on Netezza Spatial
As previously noted, I’m not up to speed on Netezza Spatial. Phil Francisco of Netezza has promised we’ll fix that ASAP. In the mean time, I found a blog by a guy named Peter Batty, who evidently:
- Knows a lot about geospatial data and its uses
- Is consulting to Netezza
- Is smart
Batty offers a lot of detail in two recent posts, intermixed with some gollygeewhiz about Netezza in general. If you’re interested in this stuff, Batty’s blog is well worth checking out. Read more
| Categories: Analytic technologies, Data warehousing, GIS and geospatial, Netezza, Telecommunications | 2 Comments |
Teradata decides to compete head-on as a data warehouse appliance vendor
In a press release today that is surely timed to impinge on the Netezza user conference news cycle, Teradata has come out swinging. Highlights include:
- Teradata, which long avoided the “appliance” term, now says it sells both “data warehouse appliances” and “data mart appliances.” Indeed, it claims to have “invented the original appliance” — which is pretty close to being true.*
- Teradata claims its “new appliance easily delivers up to 5 to 10 times performance improvement over competitors’ appliances,” at $119,000 per terabyte US list price.
- Teradata claims a 150% faster “scan rate” than competitors. Teradata is surely thinking of Netezza when saying that.
- Teradata claims 10X performance improvement on “selected queries” vs. the “competition.”
- Teradata thinks its geospatial data management capability is better than competitors’, and that this is an important indicator of Teradata’s general overall greater sophistication.
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, GIS and geospatial, Netezza, Teradata | 3 Comments |
Known applications of MapReduce
Most of the actual MapReduce applications I’ve heard of fall into a few areas:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what’s in the two big sources I found online.
| Categories: MapReduce, RDF and graphs, Text | 11 Comments |
Who is doing what in XML data management these days?
A comment thread to a post on a different subject has opened up a discussion of XML storage. Frankly, I haven’t kept up with my briefings on the subject, in part because XML support hasn’t proved to be very important yet to the big DBMS vendors, somewhat to my surprise. When last I looked, the situation wasn’t much different from what it was back in November, 2005. Unless I’ve missed something (and please tell me if I have!), here’s what’s going on: Read more
| Categories: IBM and DB2, Intersystems and Cache', Mark Logic, Microsoft and SQL*Server, Native XML, Oracle | 7 Comments |
Detailed analysis of Perst and other in-memory object-oriented DBMS
Dan Weinreb — inspired by but not linking to my recent short post on McObject’s object-oriented in-memory DBMS Perst — has posted a detailed discussion of Perst on his own blog. For context, he compares it briefly to analogous products, most especially Progress’s — which used to be ObjectStore, of which Dan was the chief architect.
This was based on documentation and general sleuthing (Dan figured out who McObject got Perst from), rather than hands-on experience, so performance figures and the like aren’t validated. Still, if you’re interested in such technology, it’s a fascinating post.
| Categories: In-memory DBMS, McObject, Memory-centric data management, Object | Leave a Comment |
Open source in-memory DBMS
I’ve gotten email about two different open source in-memory DBMS products/projects. I don’t know much about either, but in case you care, here are some pointers to more info.
First, the McObject guys — who also sell a relational in-memory product — have an object-oriented, apparently Java-centric product called Perst. They’ve sent over various press releases about same, the details of which didn’t make much of an impression on me. (Upon review, I see that one of the main improvements they cite in Perst 3.0 is that they added 38 pages of documentation.)
Second, I just got email about something called CSQL Cache. You can read more about CSQL Cache here, if you’re willing to navigate some fractured English. CSQL’s SourceForge page is here. My impression is that CSQL Cache is an in-memory DBMS focused on, you guessed it, caching. It definitely seems to talk SQL, but possibly its native data model is of some other kind (there are references both to “file-based” and “network”.)
| Categories: Cache, DBMS product categories, In-memory DBMS, McObject, Memory-centric data management, OLTP, Object, Open source | 5 Comments |
McObject eXtremeDB — a solidDB alternative
McObject — vendor of memory-centric DBMS eXtremeDB — is a tiny, tiny company, without a development team of the size one would think needed to turn out one or more highly-reliable DBMS. So I haven’t spent a lot of time thinking about whether it’s a serious alternative to solidDB for embedded DBMS, e.g. in telecom equipment. However:
- IBM’s acquisition of Solid seems to suggest a focus on DB2 caching rather than the embedded market
- McObject actually has built up something of a customer list, as per the boilerplate on any of its press releases.
And they do seem to have some nice features, including Patricia tries (like solidDB), R-trees (for geospatial), and some kind of hybrid disk-centric/memory-centric operation.
| Categories: GIS and geospatial, In-memory DBMS, McObject, Memory-centric data management, solidDB | 5 Comments |
Database blades are not what they used to be
In which we bring you another instantiation of Monash’s First Law of Commercial Semantics: Bad jargon drives out good.
When Enterprise DB announced a partnership with Truviso for a “blade,” I naturally assumed they were using the term in a more-or-less standard way, and hence believed that it was more than a “Barney” press release.* Silly me. Rather than referring to something closely akin to “datablade,” EnterpriseDB’s “blade” program turns out to just to be a catchall set of partnerships.
*A “Barney” announcement is one whose entire content boils down to “I love you; you love me.”
According to EnterpriseDB CTO Bob Zurek, the main features of the “blade” program include:
| Categories: Data types, Emulation, transparency, portability, EnterpriseDB and Postgres Plus, Open source, PostgreSQL | 3 Comments |
Truviso and EnterpriseDB blend event processing with ordinary database management
Truviso and EnterpriseDB announced today that there’s a Truviso “blade” for Postgres Plus. By email, EnterpriseDB Bob Zurek endorsed my tentative summary of what this means technically, namely:
There’s data being managed transactionally by EnterpriseDB.
Truviso’s DML has all along included ways to talk to a persistent Postgres data store.
If, in addition, one wants to do stream processing things on the same data, that’s now possible, using Truviso’s usual DML.
The Mark Logic story in XML database management
Mark Logic* has an interesting, complex story. They sell a technology stack based on an XML DBMS with text search designed in from the get go. They usually want to be known as a “content” technology provider rather than a DBMS vendor, but not quite always.
*Note: Product name = MarkLogic, company name = Mark Logic.
I’ve agreed to do a white paper and webcast for Mark Logic (sponsored, of course). But before I start serious work on those, I want to blog based on what I know. As always, feedback is warmly encouraged.
Some of the big differences between MarkLogic and other DBMS are:
-
MarkLogic’s primary DML/DDL (Data Manipulation/Description Language) is XQuery. Indeed, Mark Logic is in many ways the chief standard-bearer for pure XQuery, as opposed to SQL/XQuery hybrids.
-
MarkLogic’s XML processing is much faster than many alternatives. A client told me last year that – in an application that had nothing to do with MarkLogic’s traditional strength of text search – MarkLogic’s performance beat IBM DB2/Viper’s by “an order of magnitude.” And I think they were using the phrase correctly (i.e., 10X or so).
-
MarkLogic indexes all kinds of entities and facts, automagically, without any schema-prebuilding. (Nor, I gather, do they depend on individual documents carrying proper DTDs.) So there actually isn’t a lot of DDL. (Mark Logic claims in one test MarkLogic had more or less 0 DDL, vs. 20,000 lines in DB2/Viper.) What MarkLogic indexes includes, as Mark Logic puts it:
- Every word
- Every piece of structure
- Every parent-child relationship
- Every value.
-
As opposed to most extended-relational DBMS, MarkLogic indexes all kinds of information in a single, tightly integrated index. Mark Logic claims this is part of the reason for MarkLogic’s good performance, and asserts that competitors’ lack of full integration often causes overhead and/or gets in the way of optimal query plans. (For example, Mark Logic claims that Microsoft SQL Server’s optimizer is so FUBARed that it always does the text part of a search first.) Interestingly, Intersystems’ object-oriented Cache’ does pretty much the same thing.
-
MarkLogic is proud of its text search extensions to XQuery. I’ve neglected to ask how that relates to the XQuery standards process. (For example, text search wasn’t integrated into the SQL standard until SQL3.)
Other architectural highlights include:
| Categories: Data types, IBM and DB2, Mark Logic, Native XML | 2 Comments |
XML versus sparse columns in variable schemas
Simon Sabin makes an interesting point: If you can have 30,000 columns in a table without sparsity management blowing up, you can handle entities with lots of different kinds of attributes. (And in SQL Server you can now do just that.) The example he uses is products — different products can have different sets of possible colors, different kinds of sizes, and so on. An example I’ve used in the past is marketing information — different prospects can reveal different kinds of information, which may have been gathered via non-comparable marketing programs.
I’ve suggested this kind of variability as a reason to actually go XML — you’re constantly adding not just new information, but new kinds of information, so your fixed schema is never up to date. But I haven’t detected many actual application designers who agree with me …
| Categories: MySQL, Native XML, Theory and architecture | 3 Comments |
Microsoft SQL Server Data Services
As usual, Microsoft forgot to brief me, but Mary Jo Foley reports on Microsoft SQL Server Data Services. A look at the official site clarifies that this database-in-a-cloud offering uses “Microsoft SQL Server as a data storage node.” However, there seems to be a software layer on top of SQL Server providing scale-out and appropriate management.
In addition to the more-than-SQL-Server layer, there seems to be a less-than-SQL-Server aspect as well. In a particular, Microsoft SQL Server Data Services boasts “Support for simple types: string, numeric, datetime, boolean.” XML is the “primary wire format,” and hints dropped about the schema philosophy sound XMLish too.
Interestingly, Foley reports that Microsoft plans to offer an on-premises version of Microsoft SQL Server Data Services as well.
| Categories: Cloud computing, Microsoft and SQL*Server, Native XML | Leave a Comment |
Mike Stonebraker’s DBMS taxonomy
In a response to my recent five-part series on DBMS diversity, Mike Stonebraker has proposed his own taxonomy of data management technologies over on Vertica’s Database Column blog.
- OLTP DBMSs focused on fast, reliable transaction processing
- Analytic/Data Warehouse DBMSs focused on efficient load and ad-hoc query performance
- Science DBMSs — after all MatLab does not scale to disk-sized arrays
- RDF stores focused on efficiently storing semi-structured data in this format
- XML stores focused on semi-structured data in this format
- Search engines — the big players all use proprietary engines in this area
- Stream Processing Engines focused on real-time StreamSQL
- “Lean and Mean,” less-than-a-database engines focused on doing a small number of things very well (embedded databases are probably in this category)
- MapReduce and Hadoop — after all Google has enough “throw weight” to define a category
He goes on to say that each will be architected differently, except that — as he already convinced me back in July — RDF will be well-managed by specialty data warehouse DBMS. Read more
| Categories: Data types, Database diversity, Michael Stonebraker, Mid-range, OLTP, RDF and graphs, Theory and architecture | 2 Comments |
Database management system choices – beyond relational
This is the fifth of a five-part series on database management system choices. For the first post in the series, please click here.
Relational database management systems have three essential elements:
- Rows and columns. Theoretically, rows and columns may be inessential to the relational model. But in reality, they are built into the design of every real-world relational product. If you don’t have rows and columns, you’re not using the product to do what it was well-designed for.
- Predicate logic. Theoretically, everything can be fitted into a predicate Procrustean bed. But if you’re looking for relevancy rankings on a text search, binary logic is a highly convoluted way to get them.
- Fixed schemas. Database theorists commonly assume that databases have fixed schemas. If this means that 90%+ of all information is null or missing, they have elegant ways of dealing with that. Even so, as computing gets ever more concerned with individuals — each with his/her/its unique “profile(s)” — fixed schemas get ever harder to maintain.
If any of these three elements is missing or inappropriate, then a traditional relational database management system may not be the best choice.
| Categories: Data types, Database diversity, Theory and architecture | 1 Comment |
Dan Weinreb on ObjectStore
Dan Weinreb was one of the key techies at Object Design, the company that made the object-oriented database management system ObjectStore. (Object Design later merger into Excelon, which was eventually sold to Progress, which has deemphasized but still supports ObjectStore.) Recently he wrote a pair of long and fascinating articles about Object Design, ObjectStore, and OODBMS, the first of which makes the case that “object-oriented database management systems succeeded.” Read more
| Categories: Object, Progress, Apama, and DataDirect | Leave a Comment |
CouchDB — lazy database design taken to excess?
I’ve run into a research/alpha/whatever project called CouchDB a couple of times now. It’s yet another “Who needs relational databases? Who needs schemas?” kind of idea. Rather, CouchDB is for taking random documents and banging them into databases, then calculating views on the fly as needed. It’s REST-friendly. Lucene and a web server are built in.
Damien Katz seems to be the driving force behind CouchDB, and his discussion of document-oriented development seems to be a good starting point. Read more
| Categories: CouchDB, Data models and architecture, Database diversity, Native XML, Theory and architecture | 4 Comments |
5 kinds of data structure and 16 kinds of data access method
My recent post about datatype extensibility zoomed over at least one head, as per the comment thread. Since then I’ve googled, and come to suspect that part of what I was assuming as common knowledge may not be so common after all. So I’m going to back up and explain a bit about data access methods, as well as the sub-topic of data structures. If you take nothing else away from this post, I hope it will at least remind of you of the sheer variety of ways data can be stored on disk or in RAM.
First, let’s define the concept of data access method in three steps:
| Categories: Data types | 2 Comments |
