Greenplum – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Data messes

Curt Monash — Mon, 03 Aug 2015 09:58:23 +0000

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)

Inconsistency can take multiple forms, including:

Variant names.
Variant spellings.
Variant data structures (not to mention datatypes, formats, etc.).

Addressing the first two is the province of master data management (MDM), and also of the same data cleaning technologies that might help with outright errors. Addressing the third is the province of other data integration technology, which also may be what’s needed to break down the barriers between data silos.

So far I’ve been assuming that data is neatly arranged in fields in some kind of database. But suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be present.

All of the above issues occur with analytic data too. In some cases it probably makes sense not to fix them until the data is shipped over for analysis. In other cases, it should be fixed earlier, but isn’t. And in hybrid cases, data is explicitly shipped to an operational data warehouse where the problems are presumably fixed.

Further, some problems are much greater in their analytic guise. Harmonization and integration among data silos are likely to be much more intense. (What is one table for analytic purposes might be many different ones operationally, for reasons that might span geography, time period, or application legacy.) Addressing those issues is the province of data integration technologies old and new. Also, data transformation and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those needs.

Let’s now consider missing data. In operational cases, there are three main kinds of missing data:

Missing values, as a special case of inaccuracy.
Data that was only collected over certain time periods, as a special case of changing data structure.
Data that hasn’t been derived yet, as the main case of a need for data enhancement.

All of those cases can ripple through to cause analytic headaches. But for certain inherently analytic data sets — e.g. a weblog or similar stream — the problem can be even worse. The data source might stop functioning, or might change the format in which it transmits; but with no immediate operations compromised, it might take a while to even notice. I don’t know of any technology that does a good, simple job of addressing these problems, but I am advising one startup that plans to try.

Further analytics-mainly data messes can be found in three broad areas:

Problems caused by new or changing data sources hit much faster in analytics than in operations, because analytics draws on a greater variety of data.
Event recognition, in which most of a super-high-volume stream is discarded while the “good stuff” is kept, is more commonly a problem in analytics than in pure operations. (That said, it may arise on the boundary of operations and analytics, namely in “real-time” monitoring.
Analytics has major problems with data scavenger hunts, in which business analysts and data scientists don’t know what data is available for them to examine.

That last area is the domain of a lot of analytics innovation. In particular:

It’s central to the dubious Gartner concept of a Logical Data Warehouse, and to the more modest logical data layers I advocate as alternative.
It’s been part of BI since the introduction of Business Objects’ “semantic layer”. (See, for example, my recent post on Zoomdata.)
It’s a big part of the story of startups such as Alation or Tamr.
In a failed effort, it was part of Greenplum’s pitch some years back, as an aspect of the “enterprise data cloud”.
It led to some of the earliest differentiated features at Gooddata.
It’s implicit in the some BI collaboration stories, in some BI/search integration, and in ClearStory’s “Data You May Like”.

Finally, suppose we return to the case of operational data, assumed to be accurately stored in fielded databases, with sufficient data integration technologies in place. There’s still a whole other kind of possible mess than those I cited above — applications may not be doing a good job of understanding and using it. I could write a whole series of posts on that subject alone … but it’s going slowly. So I’ll leave that subject area for another time.

Greenplum is being open sourced

Curt Monash — Wed, 18 Feb 2015 21:51:39 +0000

While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:

Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.

Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.

Further, the low-cost alternatives for analytic RDBMS are adding up.

Amazon Redshift has considerable traction.
Hadoop (even just with Hive) has offloaded a lot of ELT (Extract/Load/Transform) from analytic RDBMS such as Teradata.
Now Greenplum is in the mix as well.

For many analytic RDBMS use cases, at least one of those three will be an appealing possibility.

By no means do I want to suggest those are the only alternatives.

Smaller-vendor offerings, such as CitusDB or Infobright, may well be competitive too.
Larger vendors can always slash price in specific deals.
MonetDB is still around.

But the three possibilities I cited first should suffice as proof for almost all enterprises that, for most use cases not requiring high concurrency, analytic RDBMS need not cost an arm and a leg.

Related link

Greenplum revenue at EMC was problematic from the get-go.

Hadoop: And then there were three

Curt Monash — Wed, 18 Feb 2015 21:50:37 +0000

Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:

An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
An excuse for press releases.
A source of an extra logo graphic to put on marketing slides.

Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).

The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?

Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:

Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
MapR’s flavor, as software from MapR.
Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.

In saying that, I’m glossing over a few points, such as:

There are various remote services that run Hadoop, most famously Amazon’s Elastic MapReduce.
You could get Apache Hadoop directly, rather than using the free or paid versions of a vendor distro. But why would you make that choice, unless you’re an internet bad-ass on the level of Facebook, or at least think that you are?
There will surely always be some proprietary stuff mixed into, for example, IBM’s BigInsights, so as to preserve at least the perception of all-important vendor lock-in.

But the main point stands — big computer companies, such as IBM, EMC (Pivotal) and previously Intel, are figuring out that they can’t bigfoot something that started out as an elephant — stuffed or otherwise — in the first place.

If you think I’m not taking this whole ODP thing very seriously, you’re right.

Related links

It’s a bit eyebrow-raising to see Mike Olson take a “more open source than thou” stance about something, but basically his post about this news is spot-on.
My take on Hadoop distributions two years ago might offer context. Trivia question: What’s the connection between the song that begins that post and the joke that ends it?

Notes and links, December 12, 2014

Curt Monash — Fri, 12 Dec 2014 11:05:15 +0000

1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.

For starters:

The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.

I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:

Suppose we have lots of logs about lots of things.* Machine learning can help:
- Notice what’s an anomaly.
- Group* together things that seem to be experiencing similar anomalies.
That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me. (Edit: ScalingData subsequently launched, under the name Rocana.)

* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.

Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.

2. Discussion of graph DBMS can get confusing. For example:

Use cases run the gamut from short-request to highly analytic; no graph DBMS is well-suited for all graph use cases.
Graph DBMS have huge problems scaling, because graphs are very hard to partition usefully; hence some of the more analytic use cases may not benefit from a graph DBMS at all.
The term “graph” has meanings in computer science that have little to do with the problems graph DBMS try to solve, notably directed acyclic graphs for program execution, which famously are at the heart of both Spark and Tez.
My clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much acknowledgement of that from the mainstream MDM community.

I mention this in part because that “MDM” use case actually has some merit. The idea is that hierarchies such as organization charts, product hierarchies and so on often aren’t actually strict hierarchies. And even when they are, they’re usually strict only at specific points in time; if you care about their past state as well as their present one, a hierarchical model might have trouble describing them. Thus, LDAP (Lightweight Directory Access Protocol) engines may not be an ideal way to manage and reference such “hierarchies:; a graph DBMS might do better.

3. There is a surprising degree of controversy among predictive modelers as to whether more data yields better results. Besides, the most common predictive modeling stacks have difficulty scaling. And so it is common to model against samples of a data set rather than the whole thing.*

*Strictly speaking, almost the whole thing — you’ll often want to hold at least a sample of the data back for model testing.

Well, WibiData’s couple of Very Famous Department Store customers have tested WibiData’s ability to model against an entire database vs. their alternative predictive modeling stacks’ need to sample data. WibiData says that both report significantly better results from training over the whole data set than from using just samples.

4. Scaling Data is on the bandwagon for Spark Streaming and Kafka.

5. Derrick Harris and Pivotal turn out to have been earlier than me in posting about Tachyon bullishness.

6. With the Hortonworks deal now officially priced, Derrick was also free to post more about/from Hortonworks’ pitch. Of course, Hortonworks is saying Hadoop will be Big Big Big, and suggesting we should thus not be dismayed by Hortonworks’ financial performance so far. However, Derrick did not cite Hortonworks actually giving any reasons why its competitive position among Hadoop distribution vendors should improve.

Beyond that, Hortonworks says YARN is a big deal, but doesn’t seem to like Spark Streaming.

21st Century DBMS success and failure

Curt Monash — Mon, 14 Jul 2014 06:37:31 +0000

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say:

At large enterprises, their internet operations perhaps excepted:
- Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
- NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

Performance and/or scalability (many examples).
Developer features (for example dynamic schema).
License/maintenance cost (for example several open source categories).
Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
… appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

Reliability, especially but not only in short-request use cases.
Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
- I think this is one of several reasons NoSQL has been much more successful than NewSQL.
- It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
As for MySQL:
- MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
- MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
- In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

The drive for uninterrupted DBMS operation.
Short-request DBMS trade-offs and alternatives.

Using multiple data stores

Curt Monash — Wed, 18 Jun 2014 16:03:10 +0000

I’m commonly asked to assess vendor claims of the kind:

“Our system lets you do multiple kinds of processing against one database.”
“Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”

So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:

The most extreme vendor marketing claims are false.
There are many different choices that make sense in at least some use cases each.

Horses for courses

It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:

Short-request vs. analytic.
SQL vs. non-SQL (NoSQL or otherwise).
Expensive/heavy-duty vs. cheap/easy-to-support.

Vendors are part of this consensus; already in 2005 I observed

For all practical purposes, there are no DBMS vendors left advocating single-server strategies.

Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.

Multiple data stores for a single application

We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree:

It’s common and sensible to manage authentication and authorization data in its own data store. Commonly, the data format is LDAP (Lightweight Directory Access Protocol).
It’s common and sensible to manage the “content” and “e-commerce transaction records” aspects of websites separately.
Even beyond that case, there are often performance reasons to manage BLOBs (Binary Large OBjects) outside your relational database.
Internet “interaction” data is also often best managed outside an RDBMS, in part because of its very non-tabular data structures.

The spectacular 2010 JP Morgan Chase outage was largely caused, I believe, by disregard of these precepts.

There also are cases in which applications dutifully get all their data via SQL queries, but send those queries to two or more DBMS. Teradata is proud that its systems can support rather transactional queries (for example in call-center use cases), but the same application may read from and write to a true OTLP database as well.

Further, many OLTP (OnLine Transaction Processing) applications do some fraction of their work via inbound or outbound messaging. Many buzzwords can come into play here, including but not limited to:

SOA (Service-Oriented Architecture). This is the most current and flexible one.
EAI (Enterprise Application Integration). This was a hot concept in the late 1990s, but was generally implemented with difficulties that SOA was later designed to alleviate.
Message-oriented middleware (MOM) and Publish/Subscribe. These are even older, and overlap greatly.

Finally, every dashboard that combines information from different data stores could be assigned to this category as well.

Multiple storage approaches in a single DBMS

In theory, a single DBMS could operate like two or more different ones glued together. A few functions should or must be centralized, such as administration, and communication with the outside world (connection handling, parsing, etc.). But data storage, query execution and so on could for the most part be performed by rather loosely coupled subsystems. And so you might have the best of both worlds — something that’s multiple data stores in the ways you want that diversity, but a single system in how it fits into your environment.

I discussed this idea last year with cautious optimism, writing:

So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.

… multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.

…

Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …

… strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM [have] been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.

In 2005 I had been more ambivalent, in part because my model was a full 1990s-dream “universal” DBMS:

IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML.

That covers most of the waterfront, but I’d like to more explicitly acknowledge three trends:

Among other things, Hadoop is a collection of DBMS (HBase, Impala, et al.) that in some cases are very loosely coupled to each other. The question is less how well the various data stores work together, and more how mature any one of them is on its own.
The multiple-data-models idea has been extended into schema-on-need, which is sometimes but not always housed in Hadoop.
Even on the relational side, multiple storage capabilities exist in one product.
- Vertica was designed that way from the get-go. (Like the old joke about police duos, one is to read and one is to write.)
- IBM, Microsoft and Oracle have all recently added some kind of in-memory columnar capability.
- Teradata, Aster (before Teradata bought them), Greenplum and Vertica all added some variant on row/column dual stores.

Related links

SQL vs. NoSQL, legacy vs. clean-up. (March, 2014)
The difficulty of DBMS development, including Hadoop-based ones (March, 2013)

Introduction to CitusDB

Curt Monash — Fri, 02 May 2014 08:00:08 +0000

One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:

CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.

*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.

Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.

Notwithstanding what I said about “fat heads”, CitusDB does have a concept of Master nodes. These:

Also use single-node copies of PostgreSQL.
Are blessedly able to scale out, although their underlying databases are entirely replicated.
Store no actual data, but do store metadata about each virtual node, including:
- Structural metadata.
- Location.
- Min/max column values (for data skipping).
- But not (yet) stats to help with query optimization.
Do some query planning and rewriting.
Handle administration, some of which is nicely parallelized/centralized. (E.g., an index choice can be made once and automatically propagated across all the relevant virtual nodes.)

CitusDB is definitely in its early days. For example:

If I understand correctly, the recent CitusDB 3.0 release is the first one on which data is redistributed among shards. Before that, you could only join tables that were either sharded on the same key, or else small enough to be broadcast-replicated across the whole cluster.
SQL coverage isn’t great. (E.g., no Windowing.)
Some hard-to-parallelize things aren’t implemented yet, e.g. exact median or generally-usable COUNT DISTINCT.
ACID is still lacking. Writes are batch-only, micro-batch or otherwise as the case may be.
CitusDB’s backup story is primitive, with the main options being:
- You can rely on having replicas on multiple nodes, even — if you like — in different data centers.
- You can backup each of the PostgreSQL nodes separately; CitusDB doesn’t yet offer automation for that.
CitusDB’s query optimization sounds pretty primitive.
I don’t recall Citus telling me of serious workload management.
CitusDB compression is block-level only. (PostgreSQL’s version of Lempel-Ziv.)

Still, the Citus Data folks seem to have good ideas, including some — as yet undisclosed — plans going forward. So if it sounds as if CitusDB might fit your needs better than more established scale-out RDBMS do, I’d encourage you to take a look at what Citus offers.

How Revolution Analytics parallelizes R

Curt Monash — Tue, 19 Nov 2013 05:18:41 +0000

I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.

There are four primary ways that people try to parallelize predictive modeling:

They can run the same algorithm on different parts of a dataset on different nodes, then return all the results, and claim they’ve parallelized. This is trivial and not really a solution. It is also the last-ditch fallback position for those who parallelize more seriously.
They can generate intermediate results from different parts of a dataset on different nodes, then generate and return a single final result. This is what Revolution does.
They can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in statistical computing “If you’re using matrices, you’re doing it wrong”; he thinks shortcuts and workarounds are almost always the better way to go.
They can jack up the speed of inter-node communication, perhaps via MPI (Messaging Passing Interface), so that full parallelization isn’t needed. That’s SAS’ main approach.

One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:

External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
… algorithms that can be parallelized by:
- Operating on data in parts.
- Getting intermediate results.
- Combining them in some way for a final result.
Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.

To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.

The canonical example of how to parallelize an algorithm via intermediate results is taking the mean of a large set of numbers. Specifically:

For each subset of data, you both count the entries and sum the values.
Then to combine those intermediate results:
- You sum the sums.
- You also sum the counts.
- You divide the former result by the latter.

Unfortunately, it’s hard to clearly articulate a precise description of these parallelizable algorithms. That said:

What you want is for the end result to be identical irrespective of how the data is split up. (Duh!)
Lee suggested that it is sufficient but not necessary that the way of combining the intermediate results be both commutative and associative.
To date, all of Revolution’s algorithms are — you guessed it! — commutative and associative.

I asked Lee about algorithms that were inherently difficult to parallelize in this style, and he expressed optimism that some other approach would usually work in practice. In particular, we had a lively discussion about finding the exact median, or more generally finding n-tiles and the whole “empirical distribution”. Lee said that, for example, it is extremely fast to bin billions of values into 10,000 buckets. Further, he suggested it is very fast in general to do the operation for integer values, and hence also for any values with a reasonably limited number of significant digits.

As should be clear from this discussion, Revolution’s parallel algorithms are indeed parallel for any reasonable kind of distribution of work. Although they were shipped first for multi-core single-server and MPI environments, the recent ports to Teradata and generic Hadoop MapReduce seem to have been fairly straightforward. Revolution seems to have good modularity between the algorithms themselves, the intermediate data passing, and the original algorithm launch, and hence makes strong claims of R code portability — but the list of exceptions in “portable except for ____” did seem to lengthen a bit each time we returned to the subject.

Finally, notes on Revolution’s Teradata implementation include:

There’s a master process (external stored procedure) which then generates SQL and invokes table operators.
The whole thing runs in protected mode (i.e. out-of-process). Lee thinks that there’s only a small performance penalty vs. in-process.
(For some reason I found this amusing) When you send an R job to Teradata, the R code itself is shipped via ODBC.

while notes on Revolution’s initial Hadoop implementation start:

One way it talks to data in HDFS (Hadoop Distributed File System) is through LibHDFS. The other, when available, is ODBC.
It uses generic MapReduce. Faster alternatives may be implemented down the road.

Related link

Teradata is seeing interest in in-database R. (September, 2013)

RDBMS and their bundle-mates

Curt Monash — Sun, 10 Nov 2013 19:22:48 +0000

Relational DBMS used to be fairly straightforward product suites, which boiled down to:

A big SQL interpreter.
A bunch of administrative and operational tools.
Some very optional add-ons, often including an application development tool.

Now, however, most RDBMS are sold as part of something bigger.

Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
Microsoft SQL Server is part of a stack, starting with the Windows operating system.
Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
… SAP HANA, which is closely associated with SAP’s applications.
Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
MySQL’s glory years were as part of the “LAMP” stack.
Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.

This phenomenon is, I think, much more driven by vendors than users. Most of the examples I listed work or could work perfectly well on their own.* But relational database management systems are seen as “strategic” products, which means in particular:

They’re often expensive to adopt (software, hardware, people costs).
They’re also often expensive to switch away from.

And strategic products, high price tags, and thick product stacks commonly go together.

*Netezza is an exception. But Exadata is not; while Oracle data warehousing was in a bad technical place before Exadata, Exadata software is what cleaned the problem up.

Also relevant is that I took those examples from relatively mature RDBMS market segments — high-end OLTP/general purpose (OnLine Transaction processing), mid-range OLTP/general-purpose, and analytic. Products in those sectors have had enough time to be built out. They also tend to have fairly close competitors, as the most important product features (e.g. columnar storage in analytic RDBMS, or online backup across the board) have been imitated numerous times each.

NewSQL, by way of contrast, is just as thin-stack as NoSQL is. Products in those sectors are immature; vendors are completing them first before wedding them to other technology layers. They’re also strongly differentiated; if you tell me what topology you need and which style(s) of API or DML (Data Manipulation Language) you prefer, the list of product candidates I give you may be short indeed.

HBase is the obvious exception to my “NoSQL products stand alone” generalization, but its market position is a matter of debate.

I have mixed feelings about this trend. For starters, I’m grudgingly becoming more sympathetic to DBMS/hardware bundles, notwithstanding their role as a way to gouge more money from customers than the hardware is actually worth. Why? Because of my opinion that there’s a general move toward appliances, clusters and clouds. In particular:

As DBMS become better at straddling and melding RAM, flash and disk, legitimate reasons to optimize hardware/software integration will increase.
Microsoft (with Parallel Data Warehouse) and SAP (with HANA) induce customers to adopt hardware “appliances” even though they don’t sell and profit from the hardware themselves. This shoots down the argument that appliances are only a vendor trick to squeeze out more profits.
Netezza’s super-easy installation was a really nice feature.

When it comes to RDBMS/business intelligence bundles, my thoughts start:

As a general rule, a benefit of BI is that it can get at data from lots of different sources. This speaks against tying it to a specific DBMS.
The vendor-specific evidence is mixed:
- IBM has never explained any user advantages to including Cognos in its analytic “appliance” product lines.
- Teradata did some special optimizations for MicroStrategy. This suggests that, conversely, MicroStrategy could benefit from DBMS-specific features.
- QlikView built a custom in-memory data store.
- Specialized business intelligence stacks are on the rise, although generally with a beyond-just-relational flavor.

And so I’m skeptical about RDBMS/BI integration, but willing to be persuaded otherwise.

The integration of advanced analytics with RDBMS leaves me perplexed. Gains in performance, scalability and/or development ease would seem, in many cases, too great to pass up. (E.g.. the Teradata Aster 6 story, analytic libraries and all.) And indeed most analytic platform vendors report some level of adoption. But the whole thing is moving more slowly than I expected. Meanwhile in the Hadoop world, a much lesser SQL capability — Hive — seems to be integrated into other analytic processing with enthusiasm. Perhaps the problem is that enterprises have to figure out which analytic techniques to use in the first place, before they worry too much about making them efficient.

And finally, when it comes to bundling of packaged applications with RDBMS — that depends on the class of application.

At the high end, it’s almost purely a pricing ploy, as those apps are usually written for lowest-common-denominator SQL functionality, so as to preserve portability.
A lot of mid-range apps are written against a specific DBMS, which is then resold along with the app. What’s more …
… most of those apps will migrate over time to a SaaS (Software as a Service) delivery model, which allows for a wholly integrated stack. And as the Workday example teaches us, database choices for SaaS apps can be pretty imaginative.

Related links

The refactoring of everything (July, 2013)
Comments about Gartner’s comments about a bunch of DBMS products (November, 2013)
The cardinal rules of DBMS development (March, 2013)