MOLAP – DBMS 2 : DataBase Management System Services

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.

Multi-model database managers

Curt Monash — Mon, 24 Aug 2015 08:07:00 +0000

I’d say:

Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

One database to rule them all systems aren’t very realistic, but even so, …
… single-model systems will become increasingly obsolete.

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

Operational applications have always needed to accept immediate writes. (Losing data is bad.)
Operational applications have always needed to serve small query result sets based on the freshest data. (If you write something into a database, you might need to immediately retrieve it to finish the business operation.)
It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
… most such analysis should look at historical data as well.
Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

The two-policemen joke seems ever more relevant.
My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.

A new logical data layer?

Curt Monash — Mon, 23 Mar 2015 05:36:44 +0000

I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.

Here are some thoughts as to why, and also as to challenges that need to be overcome.

There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:

When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.

Trivial query routing or federation is … trivial.

Databases have or can be given some kind of data catalog interface. Of course, this is easier for databases that are tabular, whether relational or MOLAP (Multidimensional OnLine Analytic Processing), but to some extent it can be done for anything.
Combining the catalogs can be straightforward. So can routing queries through the system to the underlying data stores.

In fact, what I just described is Business Objects’ original innovation — the semantic layer — two decades ago.

Careless query routing or federation can be a performance nightmare. Do a full scan. Move all the data to some intermediate server that lacks capacity or optimization to process it quickly. Wait. Wait. Wait. Wait … hmmm, maybe this wasn’t the best data-architecture strategy.

Streaming goes well with federation. Some data just arrived, and you want to analyze it before it ever gets persisted. You want to analyze it in conjunction with data that’s been around longer. That’s a form of federation right there.

There are ways to navigate schema messes. Sometimes they work.

Polishing one neat relational schema for all your data is exactly what people didn’t want to do when they decided to store a lot of the data non-relationally instead. Still, memorializing some schema after that fact may not be terribly painful.
Even so, text search can help you navigate the data wilds. So can collaboration tools. Neither helps all the time, however.

Neither extreme view here — “It’s easy!” or “It will never work!” — seems right. Rather, I think there’s room for a lot of effort and differentiation in exposing cross-database schema information.

I’m leaving out one part of the story on purpose — how these data layers are going to be packaged, and specifically what other functionality they will be bundled with. Confidentially would screw up that part of the discussion; so also would my doubts as to whether some of those plans are fully baked yet. That said, there’s an aspect of logical data layer to CDAP, and to Kiji as well. And of course it’s central to BI (Business Intelligence) and ETL (Extract/Transform/Load) alike.

One way or another, I don’t think the subject of logical data layers is going away any time soon.

Related link

Implicit in this post is the belief that enterprises should and do use many different data stores (June, 2014)

Notes and comments, May 6, 2014

Curt Monash — Tue, 06 May 2014 13:46:54 +0000

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
My CitusDB post picked up a few clarifying comments.

Here is a catch-all post to complete the set.

1. The recently-announced Cloudera/MongoDB relationship* is still at the Barney stage. That said, I’m optimistic that their stated intention to add substance to the relationship will eventually come to fruition. If nothing else, the two companies have high regard for each other, at least at the Mike Olson/Max Schireson level.

*That’s one of numerous deals with my fingerprints on it, but in this case only lightly. It was probably on track to happen even without my nudges.

2. Most of what I talked about when I visited MongoDB is confidential; the public stuff was mainly in my recent MongoDB technology post. But in one exception, I asked Max for an update as to MongoDB enterprise use cases. He reported a cluster in data combination, especially but not only in use cases which have both a high-volume part and dynamic-schema aspects. Specific examples Max cited included:

Tracking financial holdings from a variety of asset classes — especially if derivatives are involved, because they have a dynamic-schema aspect.
Product catalogs, including for use on web sites.
Customer information.
Patient information.

3. I didn’t ask everybody I saw in California about business trends, and much of what we did discuss was confidential. That said:

MapR was proud of its numbers.
So was DataStax.
ClearStory has a bunch of Very Big Enterprises as customers, mainly but not only in consumer sectors (e.g. retail, packaged goods).

4. Platfora is focusing a bit, starting with clickstream and security — i.e., event series stuff. And by the way, they report that the term “event series” is working well for them.

5. I gather from a variety of comments and conversations that Amazon Redshift has achieved considerable traction.

6. Something I can’t find evidence of having posted before: I think multiple businesses monitor online sales or similar business successes as a guide to network problems. eBay did this via a custom in-memory MOLAP (Multidimensional Online Analytic Process) system years ago. Best evidence that this is hardly restricted to eBay: all the “me-too” responses I get from telling that story.

7. Citus Data tells me that as of PostgreSQL 9.4, Postgres will be able to return just the part of a JSON column needed for a query. This is as opposed to storing the whole thing as text and only retrieving it in its entirety.

8. In the comments to my “Spark on fire” post, Patrick McFadin pointed out that Mahout is transitioning from MapReduce to Spark. (All new work will be on Spark, although old MapReduce-based routines will continue to be supported.) It turns out that Derrick Harris wrote about that over a month ago, and I just missed the news.

9. Also in predictive analytics — there are rumblings that R could eventually be supplanted by Julia, although R’s massive libraries of algorithms still give it the advantage now.

10. Multiple vendors, fed up with the intermittent slowdowns from garbage collection, are moving some processing off the Java heap. Unfortunately, I neglected to ask any of them what the remaining differences then were between Java and C++ programming.

11. And to finish on a light note: BDAS — the project of which Spark is only a part — is pronounced “bad-ass”, something I first heard from Dave Patterson.

The two sides of BI

Curt Monash — Wed, 14 Aug 2013 05:29:26 +0000

As is the case for most important categories of technology, discussions of BI can get confused. I’ve remarked in the past that there are numerous kinds of BI, and that the very origin of the term “business intelligence” can’t even be pinned down to the nearest century. But the most fundamental confusion of all is that business intelligence technology really is two different things, which in simplest terms may be categorized as user interface (UI) and platform* technology. And so:

The UI aspect is why BI tends to be sold to business departments; the platform aspect is why it also makes sense to sell BI to IT shops attempting to establish enterprise standards.
The UI aspect is why it makes sense to sell and market BI much as one would applications; the platform aspect is why it makes sense to sell and market BI much as one would database technology.
The UI aspect is why vendors want to integrate BI with transaction-processing applications; the platform aspect is, I suppose, why they have so much trouble making the integration work.
The UI aspect is why BI is judged on … well, on snazzy UIs and demos. The platform aspect is a big reason why the snazziest UI doesn’t always win.

*I wanted to say “server” or “server-side” instead of “platform”, as I dislike the latter word. But it’s too inaccurate, for example in the case of the original Cognos PowerPlay, and also in various thin-client scenarios.

Key aspects of BI platform technology can include:

Query and data management. That’s the area I most commonly write about, for example in the cases of Platfora, QlikView, or Metamarkets. It goes back to the 1990s — notably the Business Objects semantic layer and Cognos PowerPlay MOLAP (MultiDimensional OnLine Analytic Processing) engine — and indeed before that to the report writers and fourth-generation languages of the 1970s. This overlaps somewhat with …
… data integration and metadata management. Business Objects, Qlik, and other BI vendors have bought data integration vendors. Arguably, there was a period when Information Builders’ main business was data connectivity and integration. And sometimes the main value proposition for a BI deal is “We need some way to get at all that data and bring it together.”
Security and access control — authentication, authorization, and all the additional As.
Scheduling and delivery. When 10s of 1000s of desktops are being served, these aren’t entirely trivial. Ditto when dealing with occasionally-connected mobile devices.

The set of business intelligence vendors that have prospered without noteworthy platform technology is approximately {Tableau}. Candidates I omitted from that set include Spotfire (didn’t get far before being acquired by TIBCO, and perhaps not afterwards either) and Xcelsius (I’m not sure how far it got before being acquired by Business Objects).

Just as platform technology has been essential to BI innovation in the past, I think the same will remain true in the future. For example:

ClearStory is throwing a lot of platform-side tech at what amounts to BI for third-party data.
Alerting and metrics management is a long-standing opportunity, and definitely calls for platform-side effort.
I think the same is true of BI integration with predictive modeling as well.

Bottom line: BI innovation usually depends upon serious platform technology.

Related links

I observed in January, 2012 that analytic technologies tend to be adopted departmentally. I reiterated that for the specific case of BI in my “Things I keep needing to say” this week.
Endeca was another BI vendor whose UI differentiation was based on a proprietary DBMS-like engine.

It’s hard to make data easy to analyze

Curt Monash — Thu, 14 Feb 2013 04:05:00 +0000

It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.

Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:

“We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- Splunk.
“Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- Hadoop.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- Splunk.
“Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
- Splunk.

*Complex event/stream processing terminology is always problematic.

My thoughts on all this start:

There are many possibilities for the “right” way to manage analytic data. Generally, these are not the same as the “right” way to write the data, as that choice needs to be optimized for user experience (including performance), reliability, and of course cost.
I.e., it is usually best to move data from where you write it to where you (at least in part) analyze it.
Vendors who suggest they have a complete solution for getting data ready to be analyzed are … optimists.
This specifically includes “magic data stores”, such as fast analytic RDBMS (on which I’m very bullish) or in-memory analytic DBMS (about which I’m more skeptical). They’re great starting points, but they’re not the whole enchilada.
There are many ways to help with preparing data for analysis. Some of them are well-served by the industry. Some, however, are not.

Further:

1. There are many terms for all this. I once titled a post “Data that is derived, augmented, enhanced, adjusted, or cooked”. “Data munging” and “data wrangling” are in the mix too. And I’ve heard the term data preparation used several different ways.

2. Microsoft told me last week that the leading paid-for data products in their data-for-sale business are for data cleaning. (I.e., authoritative data to help with the matching/cleaning of both physical and email addresses.) Salesforce.com/data.com told me something similar a while back. This underscores the importance of data cleaning/data quality, and more generally of master data management.

Yes, I just said that data cleaning is part of master data management. Not coincidentally, I buy into to the view that MDM is an attitude and a process, not just a specific technology.

3. Everybody knows that Hadoop usage involves long-ish workflows, in which data keeps get massaged and written back to the data store. But that point is not as central to how people think about Hadoop as it probably should be.

4. One thing people have no trouble recalling is that Hadoop is a great place to dump stuff and get it out later. Depending on exactly what you have in mind, there are various metaphors for this, most of which have something to do with liquids. Most famous is “big bit bucket”, but also used have been “data refinery”, “data lake”, and “data reservoir”.

5. For years, DBMS and Hadoop vendors have bundled low-end text analytics capabilities rather than costlier state-of-the-art ones. I think that may be changing, however, mainly in the form of Attensity partnerships.

Truth be told, I’m not wholly current on text mining vendors — but when I last was, Attensity was indeed the best choice for such partnerships. And I’m not aware of any subsequent developments that would change that conclusion.

Related links:

Merv Adrian’s contrast between Hadoop and data integration tours some of the components of ETL suites. (February, 2013)
Part of why analytic applications are usually incomplete are the issues discussed in this post.
De-anonymization is an important — albeit privacy-threatening — way of making data more analyzable. (January, 2011)
I updated my thoughts on Gartner’s Logical Data Warehouse concept earlier this month.

Do you need an analytic RDBMS?

Curt Monash — Mon, 05 Nov 2012 18:24:27 +0000

I can think of seven major reasons not to use an analytic RDBMS. One is good; but the other six seem pretty questionable, niche circumstances excepted, especially at this time.

The good reason to not have an analytic RDBMS is that most organizations can run perfectly well on some combination of:

SaaS (Software as a Service).
A low-volume static website.
A network focused on office software.
A single cheap server, likely running a single instance of a general-purpose RDBMS.

Those enterprises, however, are generally not who I write for or about.

The six bad reasons to not have an analytic RDBMS all take the form “Can’t some other technology do the job better?”, namely:

A data warehouse that’s just another instance of your OLTP (OnLine Transaction Processing) RDBMS. If your problem is that big, it’s likely that a specialized analytic RDBMS will be more cost-effective and generally easier to deal with.
MOLAP (Multi-Dimensional OnLine Analytic Processing). That ship has sailed … and foundered … and been towed to drydock.
In-memory BI. QlikView, SAP HANA, Oracle Exalytics, and Platfora are just four examples of many. But few enterprises will want to confine their analytics to such data as fits affordably in RAM.
Non-tabular* approaches to investigative analytics. There are many examples in the Hadoop world — including the recent wave of SQL add-ons to Hadoop — and some in the graph area as well. But those choices will rarely suffice for the whole job, as most enterprises will want better analytic SQL performance for (big) parts of their workloads.
Tighter integration of analytics and OLTP (OnLine Transaction Processing). Workday worklets illustrate that business intelligence/OLTP integration is a really good idea. And it’s an idea that Oracle and SAP can be expected to push heavily, when they finally get their product acts together. But again, that’s hardly all the analytics you’re going to want to do.
Tighter integration of analytics and other short-request processing. An example would be maintaining a casual game’s leaderboard via a NoSQL write-optimized database. Yet again, that’s hardly all the analytics a typical enterprise will want to do.

*I’ve long used “tabular” to cover both relational and MOLAP structures, the point being that in both cases you have a neat and regular schema, well-represented as a set of arrays.

What could change this picture would be a future in which:

All your tabular business data fits into RAM.
Also, the OLTP/analytic DBMS distinction becomes less important.

In that case, it might be reasonable to get by with:

A single in-memory relational DBMS, handling OLTP and some analytics alike.
Whichever additional short-request systems you need (mainly for internet-heavy uses).
A Hadoop-based analytic data store.

I’m on record as suggesting that traditional databases will indeed wind up in RAM. But I’m more doubtful that a single in-memory DBMS will suffice for OTLP and analytics alike.

What are some key aspects of a specialized analytic RDMS? My partial and overlapping list starts:

Fast, high-volume analytic I/O.
Smart query planning.
Smart, high-volume internal data movement.
Smart workload management.
Good data compression, including in cache and during query execution.
Strong analytic platform capabilities.
Fast execution of analytic requests — standard SQL, advanced SQL, or other.

An analytic RDBMS typically:

Is optimized for reads, which are often large, and perhaps temporary large writes as well.
Reduces I/O bottlenecks via, for example, compression, columnar storage, and/or scale-out.

If all the data is in RAM, these problems are indeed lessened. Also, Oracle Exadata is dedicated to the premise that, even using conventional computer parts, I/O bottlenecks can be reduced with enough hardware — and price aside, it seems to work. Still, if you talk with analytic RDBMS designers, you repeatedly hear that it’s not that simple — even getting data efficiently out of RAM is different in the analytic and OLTP cases.

Query planning/execution, data movement, and workload management go together — they’re all about getting the most work done with the least machine effort, and they all depend on determining which specific execution choices might be synergistic or anti-synergistic with each other. Taken together, they form a very tough optimization challenge, which is different in the OLTP and analytic cases. Adding in analytic platform capabilities adds yet more difficulty to the optimization problem. And so:

A fast analytic database manager is a hard thing to build; expecting it to be fast at OLTP as well may be too much to ask for.

Given that, the discussion pivots to:

OK, but can we overprovision the RAM by so much that suboptimal performance doesn’t matter?

My guess is “Not any time soon” — because efficiency is always a good thing, databases will always grow, and RAM will never be free.

Bottom line: Analytic RDBMS will likely be needed for a long time.

Related link

Integrating short-request and analytic processing (March, 2011)

The Ted Codd guarantee

Curt Monash — Sun, 31 Jul 2011 22:44:21 +0000

I write a lot about whether or not to use relational DBMS. For example:

In May I surveyed relational vs. non-relational pros and cons at some length.
Last November I mused about when it might be OK to do without joins.
The question is implicit in a variety of posts about, say, document-oriented or object-oriented DBMS.

Before going further in that vein, I’d like to do a quick review of what E. F. “Ted” Codd was getting at with the relational model in the first place.

The first sentence of Codd’s famous 1970 paper introducing the relational database concept reads:

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).

In modern terms, that means “all you have to know to use the database is its logical schema; you don’t need to know anything about its physical representation.”

Over the next 15 years, Codd’s thinking — and his employer IBM’s technology — evolved to the point that Codd proposed 12 rules for a relational DBMS, the three most fundamental of which are:

Foundation Rule
A relational database management system must manage its stored data using only its relational capabilities.

Information Rule
All information in the database should be represented in one and only one way — as values in a table.

Guaranteed Access Rule
Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

I.e., Codd was positively asserting that a database should have a fixed logical schema, in a tabular form. The clear implication was that programmers could or should be able to write anything they wanted to against that schema, without database performance being unduly compromised.

Of course, things never quite worked out that way. For most of the history of tabular DBMS, the best-performing short-request and analytic DBMS have been designed quite differently from each other.* Non-relational systems — from IBM’s own IMS to various object-oriented DBMS — outperformed relational DBMS on particular applications. Designers of high-performance applications were sensitive to the database’s physical design, sometimes even going to the extreme of non-transparent sharding. But on the whole, it was generally agreed that programming against a fixed logical schema is a good thing.

*Codd acknowledged this himself by promoting multidimensional OLAP over traditional RDBMS. (I regard the multidimensional/relational divide to be a distinction without significant difference; it’s all just fixed-logical-schema tabular processing with different data manipulation languages.)

In my next post, I’ll return to the subject of why fixed schemas might not always be such a good idea after all.

Eight kinds of analytic database (Part 2)

Curt Monash — Tue, 05 Jul 2011 08:18:18 +0000

In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear.

Bit bucket

Kinds of data likely to be included: Logs, other technical/external
Likely use styles: Staging/ETL, investigative
Canonical example: Log files in a Hadoop cluster
Stresses: TCO, scale-out, transform/big-query performance, ETL functionality

With the explosion of machine-generated data has come the need for a place to put it all, sometimes called the big bit bucket. This is like the investigative data mart for big databases, but more poly-structured. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.

The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.

Archival data store

Kinds of data likely to be included: Operational, CDR (call detail record), security log
Likely use styles: Archival, reporting (for compliance), possibly also investigative
Examples: Any long-term detailed historical store
Stresses: TCO, compression, scale-out, performance (if multi-use)

Analytic DBMS vendors have been insulting each other with the claim “that’s just an archival data store,” dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only Rainstor truly embraces the archival positioning, and I’ve become pretty dubious about their technical claims and their company alike.

Still, there’s a legitimate need for data stores — especially relational analytic DBMS that:

Store data cheaply, with high rates of compression.
Have decent performance if you do want to query the data.
May have archiving/compliance-specific features as well.

Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.

Outsourced data mart

Kinds of data likely to be included: All
Likely use styles: Traditional BI, investigative analytics, staging/ETL
Examples: Advertising tracking, SaaS CRM
Stresses: Performance, TCO, reliability, concurrency

Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I’ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that’s an analytic data management challenge. The possibilities expand from there.

Data outsourcers are in the IT business, and so their IT development is — hopefully! — more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.* Multitenancy is commonly an issue, as is running in the cloud.

*Even so, there’s often That Guy who doesn’t want to migrate away from Oracle, no matter what.

Vertica gets the nod in a number of these cases; it’s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you’re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.

Operational analytic(s) server

Kinds of data likely to be included: Customer-centric, log, financial trade
Likely use styles: Advanced operational analytics
Examples:
- Lower latency: Web or call-center personalization, anti-fraud
- Higher latency: Customer profiling, Basel 3 risk analysis
Stresses: Performance, reliability, analytic functionality, perhaps concurrency

Even with eight different choices, I need a “catch-all” category; this is it.

Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in integrating short-request and analytic processing. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you’re set. Otherwise, you may want to pipe derived data into a more “industrial-strength” DBMS, ideally the one that runs your operational apps anyway

Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you’re starting out with the data in a convenient bit bucket.

Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.

So did I get them all? Or are there yet other analytic data management use cases that I don’t fit into my eight categories?

Eight kinds of analytic database (Part 1)

Curt Monash — Tue, 05 Jul 2011 08:17:44 +0000

Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.

Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning.

Enterprise data warehouse (Full or partial)

Kinds of data likely to be included: All, but especially operational
Likely use styles: All
Canonical example: Central EDW for a big enterprise
Stresses: Concurrency, reliability, workload management

The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. Full EDWs are pipedreams. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you’re going to stress concurrency and/or operational use cases.

Traditional data mart

Kinds of data likely to be included: All
Likely use styles: Business intelligence, budgeting/consolidation, investigative
Examples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.
Stresses: Performance, concurrency, TCO

Whether or not you have something like an enterprise data warehouse, it’s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation. Some investigative analytics may be in the mix as well.

Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well. Ted Codd pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.

Investigative data mart — agile

Kinds of data likely to be included: All, especially customer-centric
Likely use styles: Investigative
Canonical example: A few analysts getting a few TB to examine
Stresses: Ease of setup/load, ease of admin, price/performance

Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they’re differentiated by database size.

If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) — and if that data is “single-subject” and fairly homogenous — your watchwords should be “cheap”, “easy”, and “fast”. You don’t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs), nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).

*If you have dozens or even hundreds of analysts hitting the same database, you’re probably back to the more concurrency-oriented scenarios outlined above.

Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don’t rule out Netezza’s lowest-end products (even if they’d really rather sell you something bigger). Or, if you’re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.

Investigative data mart — big

Kinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientific
Likely use styles: Investigative
Canonical example: Single-subject 20 TB – 20 PB relational database
Stresses: Performance, scale-out, analytic functionality

But if you’re looking at tens of terabytes of relational data, or even more, you really do have a “big data” problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.

Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression — e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.

Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full analytic platforms, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.

Continued in Part 2, where we cover some of the more difficult use cases.