I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:
- Many of the independent vendors were swooped up by acquisition.
- None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
- Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
- Impala and other Hadoop-based alternatives are technology options.
- Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.
Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.
My answer, in a nutshell, is:
Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud – are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.
To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”
- If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
- If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
- If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.
Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.
And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.
My technical observations around these trends include:
- Advanced analytics commonly require flexible, iterative processing.
- Spark is much better at such processing than earlier Hadoop …
- … which in turn is better than anything that’s been built into an analytic RDBMS.
- Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
- Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.
And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.
- They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
- Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.
But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:
- They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
- One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
- Other integrations, as noted above, can also be weak.
Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.
Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.
Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.