Another category of derived data
Six months ago, I argued the importance of derived analytic data, saying
… thereโs no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:
- Aggregates, when they are maintained, generally for reasons of performance or response time.
- Calculated scores, commonly based on data mining/predictive analytics.
- Text analytics.
- The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
- Adjusted data, especially in scientific contexts.
Probably there are yet more examples that I am at the moment overlooking.
Well, I did overlook at least one category. ๐
A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.
Categories: Data models and architecture, Derived data, Hadoop, MarkLogic | 2 Comments |
When it’s still best to use a relational DBMS
There are plenty of viable alternatives to relational database management systems. For short-request processing, both document stores and fully object-oriented DBMS can make sense. Text search engines have an important role to play. E. F. “Ted” Codd himself once suggested that relational DBMS weren’t best for analytics.* Analysis of machine-generated log data doesn’t always have a naturally relational aspect. And I could go on with more examples yet.
*Actually, he didn’t admit that what he was advocating was a different kind of DBMS, namely a MOLAP one — but he was. And he was wrong anyway about the necessity for MOLAP. But let’s overlook those details. ๐
Nonetheless, relational DBMS dominate the market. As I see it, the reasons for relational dominance cluster into four areas (which of course overlap):
- Data re-use. Ted Codd’s famed original paper referred to shared data banks for a reason.
- The benefits of normalization, which include:
- You only have to do programming work of writing something once …
- … and you don’t have to do the programming work of keeping multiple versions of the information consistent.
- You only have to do processing work of writing something once.
- You only have to buy storage to hold each fact once.
- Separation of concerns.
- Different people can worry about programming and “database stuff.”
- Indeed, even performance optimization can sometimes be separated from programming (i.e., when all you have to do to get speed is implement the correct indexes).
- Maturity and momentum, as reflected in the availability of:
- People.
- A broad variety of mature relational DBMS.
- Vast amounts of packaged software that “talks” SQL.
Generally speaking, I find the reasons for sticking with relational technology compelling in cases such as:ย Read more
Categories: Analytic technologies, Data models and architecture, Database diversity, MOLAP, NoSQL, Object, Theory and architecture | 21 Comments |
Slashdot venting thread about Oracle/Sun hardware
Slashdot has what amounts to a venting thread about Oracle/Sun hardware. The one consistent favorable theme is that Sun hardware is good stuff if you want to run Oracle. Otherwise, comments repeatedly say:
- Product discounts are down, effectively creating a price increase.
- Service prices are way up for some customers, because cheaper service options have been eliminated.
- Service quality is unsatisfactory.
- Oracle is difficult to do business with.
So far, I haven’t seen any comments to the effect “I don’t know what you guys are talking about; we’re perfectly happy with Sun”, but surely those will come too.
Categories: Oracle, Pricing | 4 Comments |
Quick thoughts on Oracle-on-Amazon
Amazon has a page up for what it calls Amazon RDS for Oracle Database. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a “License Included” instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets).
My quick thoughts start:
- Mainly, this isn’t for production usage. But exceptions might arise when:
- An application, from creation to abandonment, is only expected to have a short lifespan, in support of a specific project.
- There is an extreme internal-politics bias to operating versus capital expenses, or something like that, forcing a user department to cloud production deployment even when it doesn’t make much rational sense.
- An application is small enough, or the situation is sufficiently desperate, that any inefficiencies are outweighed by convenience.
- There is non-production appeal. In particular:
- Spinning up a quick cloud instance can make a lot of sense for a developer.
- The same goes if you want to sell an Oracle-based application and need to offer demo/test capabilities.
- The same might go for off-site replication/disaster recovery.
Of course, those are all standard observations every time something that’s basically on-premises software is offered in the cloud. They’re only reinforced by the fact that the only Oracle software Amazon can actually license you is a particularly low-end edition.
And Oracle is indeed on-premises software. In particular, Oracle is hard enough to manage when it’s on your premises, with a known hardware configuration; who would want to try to manage a production instance of Oracle in the cloud?
Categories: Amazon and its cloud, Cloud computing, Oracle | 7 Comments |
Notes from the Fusion-io S-1 filing
Fusion-io has filed for an initial public offering. With public offerings go S-1 filings which, along with 10-Ks, are the kinds of SEC filing that typically contain a few nuggets of business information. Notes from Fusion-io’s S-1 include:
Fusion-io is growing very, very fast, doubling or better in revenue every 6 months.
Fusion-io’s marketing message revolves around “data centralization”. Fusion-io is competing against storage-area networks and storage arrays.
Fusion-io’s list of application types includes
… systems dedicated to decision support, high performance financial analysis, web search, content delivery and enterprise resource planning.
Fusion-io says it has shipped over 20 petabytes of storage.
Fusion-io has a shifting array of big customers, including OEMs:ย Read more
Categories: Analytic technologies, Data warehousing, Facebook, Solid-state memory, Storage | Leave a Comment |
Traditional databases will eventually wind up in RAM
In January, 2010, I posited that it might be helpful to view data as being divided into three categories:
- Human/Tabular data โi.e., human-generated data that fits well into relational tables or arrays.
- Human/Nontabular data โ i.e., all other data generated by humans.
- Machine-Generated data.
I won’t now stand by every nuance in that post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to dispute:
Traditional database data — records of human transactional activity, referred to as “Human/Tabular data above” — will not grow as fast as Moore’s Law makes computer chips cheaper.
And that point has a straightforward corollary, namely:
It will become ever more affordable to put traditional database data entirely into RAM.ย Read more
Object-oriented database management systems (OODBMS)
There seems to be a fair amount of confusion about object-oriented database management systems (OODBMS). Let’s start with a working definition:
An object-oriented database management system (OODBMS, but sometimes just called “object database”) is a DBMS that stores data in a logical model that is closely aligned with an application program’s object model. Of course, an OODBMS will have a physical data model optimized for the kinds of logical data model it expects.
If you’re guessing from that definition that there can be difficulties drawing boundaries between the application, the application programming language, the data manipulation language, and/or the DBMS — you’re right. Those difficulties have been a big factor in relegating OODBMS to being a relatively niche technology to date.
Examples of what I would call OODBMS include:ย Read more
Categories: Cache, In-memory DBMS, Intersystems and Cache', Memory-centric data management, Objectivity and Infinite Graph, OLTP, Software as a Service (SaaS), Starcounter | 21 Comments |
Starcounter high-speed memory-centric object-oriented DBMS, coming soon
Since posting recently about Starcounter, I’ve had the chance to actually talk with the company (twice). Hence I know more than before. ๐ Starcounter:
- Has been around as a company since 2006.
- Has developed memory-centric object-oriented DBMS technology that has been OEMed by a few application software companies (especially in bricks-and-mortar retailing and in online advertising).
- Is planning to actually launch an OODBMS product sometime this summer.
- Has 14 employees (most or all of whom are in Sweden, which is also where I think Starcounter’s current customers are centered).
- Is planning to shift emphasis soon to the US market.
Starcounter’s value propositions are programming ease (no object/relational impedance mismatch) and performance. Starcounter believes its DBMS has 100X the performance of conventional DBMS at short-request transaction processing, and 10X the performance of other memory-centric and/or object-oriented DBMS (e.g. Oracle TimesTen, or Versant). That said, Starcounter has not yet tested VoltDB. Starcounter does not claim performance much beyond that of disk-based DBMS on analytic tasks such as aggregations.
The key technical aspect to Starcounter is integration between the DBMS and the virtual machine, so that the same copy of the data is accessed by both the DBMS and the application program, without any movement or transformation being needed. (Starcounter isn’t aware of any other object-oriented DBMS that work this way.) Transient and persistent data are handled in the same way, seamlessly.
Other Starcounter technical highlights include:ย Read more
Categories: Data models and architecture, In-memory DBMS, Memory-centric data management, Object, OLTP, Starcounter, Theory and architecture | 3 Comments |
Terminology: poly-structured data, databases, and DBMS
My recent argument that the common terms “unstructured data” and “semi-structured data” are misnomers, and that a word like “multi-” or “poly-structured”* would be better, seems to have been well-received. But which is it — “multi-” or “poly-“?
*Everybody seems to like “poly-structured” better when it has a hyphen in it — including me. ๐
The big difference between the two is that “multi-” just means there are multiple structures, while “poly-” further means that the structures are subject to change. Upon reflection, I think the “subject to change” part is essential, so poly-structured it is.
The definitions I’m proposing are:
- A database is poly-structured to the extent that its structure is apt to be changed in the ordinary course of query, update, or programming.
- Data is poly-structured to the extent that it is best represented in a poly-structured database.
- A DBMS is poly-structured to the extent that it is oriented to managing poly-structured databases.
Categories: Object, Structured documents, Text, Theory and architecture | 23 Comments |
What to do about “unstructured data”
We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.
*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.
Second, a more accurate distinction is not whether a database has one structure or none — it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.
One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.
*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.
All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example:ย Read more
Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB, NoSQL, Splunk, Theory and architecture | 19 Comments |