ParAccel – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Actian Vector Hadoop Edition

Curt Monash — Thu, 07 Aug 2014 11:12:35 +0000

I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.

That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call.

In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that

Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,

… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.

and

Actian … is now “one company, with one voice and one platform” according to its John Santaferraro

The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.

All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include:

Vectorwise, while being proudly multi-core, was previously single-server. The new Vector Hadoop Edition is the first version with node parallelism.
Actian’s Vector Hadoop edition uses HDFS (Hadoop Distributed File System) and YARN to manage an Actian-proprietary file format. There is currently no interoperability whereby Hadoop jobs can read these files. However …
… Actian’s Vector Hadoop edition relies on Hadoop for cluster management, workload management and so on.
Peter thinks there are two paying customers, both too recent to be in production, who between then paid what I’d call a remarkable amount of money.*
Roadmap futures* include:
- Being able to update and indeed trickle-update data. Peter is very proud of Vectorwise’s Positional Delta Tree updating.
- Some elasticity they’re proud of, both in terms of nodes (generally limited to the replication factor of 3) and cores (not so limited).
- Better interoperability with Hadoop.

Actian actually bundles Vector Hadoop Edition with DataFlow — the old Pervasive DataRush — into what it calls “Actian Analytics Platform – Hadoop SQL Edition”. DataFlow/DataRush has been working over Hadoop since the latter part of 2012, based on a visit with my then clients at Pervasive that December.

*Peter gave me details about revenue, pipeline, roadmap timetables etc. that I’m redacting in case Actian wouldn’t like them shared. I should say that the timetable for some — not all — of the roadmap items was quite near-term; however, pay no attention to any phrasing in Peter’s blog post that suggests the roadmap features are already shipping.

The Actian Vector Hadoop Edition optimizer and query-planning story goes something like this:

Vectorwise started with the open-source Ingres optimizer. After a query is optimized, it is rewritten to reflect Vectorwise’s columnar architecture. Peter notes that these rewrites rarely change operator ordering; they just add column-specific optimizations, whatever that means.
Now there are rewrites for parallelism as well.
These rewrites all seem to be heuristic/rule-based rather than cost-based.
Once Vectorwise became part of the Ingres company (later renamed to Actian), they had help from Ingres engineers, who helped them modify the base optimizer so that it wasn’t just the “stock” Ingres one.

As with most modern MPP (Massively Parallel Processing) analytic RDBMS, there doesn’t seem to be any concept of a head-node to which intermediate results need to be shipped. This is good, because head nodes in early MPP analytic RDBMS were dreadful bottlenecks.

Peter and I also talked a bit about SQL-oriented HDFS file formats, such as Parquet and ORC. He doesn’t like their lack of support for columnar compression. Further, in Parquet there seems to be a requirement to read the whole file, to an extent that interferes with Vectorwise’s form of data skipping, which it calls “min-max indexing”.

Frankly, I don’t think the architectural choice “uses Hadoop for workload management and administration” provides a lot of customer benefit in this case. Given that, I don’t know that the world needs another immature MPP analytic RDBMS. I also note with concern that Actian has two different MPP analytic RDBMS products. Still, Vectorwise and indeed all the stuff that comes out Martin Kersten and Peter’s group in Amsterdam has always been interesting technology. So the Actian Vector Hadoop Edition might be worth taking a look at before you redirect your attention to products with more convincing track records and futures.

Things I keep needing to say

Curt Monash — Mon, 12 Aug 2013 06:45:54 +0000

Some subjects just keep coming up. And so I keep saying things like:

Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.

Most generalizations about Hadoop are false. Reasons include:

Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
The transition from Hadoop 1 to Hadoop 2 will be drastic.
For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.

Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.

Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.

Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)

“Big Data” doesn’t create rapid IT growth. If we only had traditional kinds of data, IT growth would be drastically negative, since Moore’s Law swamps traditional data growth. Whole new categories of data are always needed to fill the gap. And these days, they’re all categorized as “Big Data”.

The single central database is a myth. Things are never that simple, at least at large enterprises. Hence, in particular, the ideal EDW (Enterprise Data Warehouse) is a myth.

Analytic RDBMS and appliances aren’t necessarily expensive. Deals can be had. Yes, most vendors want at least a few hundred thousand dollars for most sales, but there are plenty of exceptions even to that rule. And at either large or small scales, things get very cheap, for example:

Various vendors’ free/”community” editions.
The $2 million/petabyte hardware+software price I published for Vertica.

And Infobright is typically an economical option inbetween those extremes, if you’re cool with its focus on machine-generated data.

Columnar relational DBMS are relational. Examples include Sybase IQ, Vertica, ParAccel, Infobright and numerous others.

Yes, that’s a tautology. Even so, distressingly many people forget it, columnar RDBMS vendor employees not excepted.

Amazon Redshift proves very little about ParAccel. Amazon bought some stock in ParAccel, and got a cheap license to a subset of ParAccel’s code, perhaps in the same deal. Big whoop. Yes,

It is claimed that there are a lot of Redshift users, I presume low-end ones.
ParAccel is fast.*

But none of that speaks to some profound, ongoing Amazon/ParAccel/Actian relationship.

*I hear that ParAccel is usually faster than Vertica and other alternatives in POCs/benchmarks (Proofs of Concept). But I also hear that ParAccel’s installation complexity continues to be a POC problem.

New technology in old categories of application will only be adopted as quickly as firms replace their apps. Yes, that’s a tautology too. Even so, it puts an upper bound on, for example, the speed with which on-premises applications will be replaced by cloud alternatives.

SAP HANA is not yet a serious OLTP (OnLine Transaction Processing) DBMS. Yes,

HANA has in some form been under development for a long time; its major antecedent is BI Accelerator, which shipped back in 2006.
RAM-centric processing makes sense.
HANA has a cool-sounding feature list.
SAP claims lots of HANA sales, and not just in conjunction with a few new SAP apps that require HANA to run.

But the stories of HANA sales and deployment momentum sure seem concentrated on analytic use cases. And by the way — even among analytic DBMS vendors, I don’t hear much emphasis on competing vs. HANA.

Current BI trends reflect 1990s deja vu. The hottest business intelligence products and vendors are adopted by departments, on the strength of their snazzy interfaces and short adoption cycles.* That’s exactly how BI spread in the 1990s, only now the word “visualization” gets used more.

*A common phrase for that is land-and-expand.

And finally,

I’m not impressed that your future products will in some small ways be superior to what your competitors have had in production for over a year.

More on Actian/ParAccel/VectorWise/Versant/etc.

Curt Monash — Mon, 29 Apr 2013 11:50:52 +0000

My quick reaction to the Actian/ParAccel deal was negative. A few challenges to my views then emerged. They didn’t really change my mind.

Amazon Redshift

Amazon did a deal with ParAccel that amounted to:

Amazon got a very cheap license to a limited subset of ParAccel’s product …
… so that it could launch a service called Amazon Redshift.
Amazon also invested in ParAccel.

Some argue that this is great for ParAccel’s future prospects. I’m not convinced.

No doubt there are and will be Redshift users, evidently including Infor. But so far as I can tell, Redshift uses very standard SQL, so it doesn’t seed a ParAccel market in terms of developer habits. The administration/operation story is similar. So outside of general validation/bragging rights, Redshift is not a big deal for ParAccel.

OEMs and bragging rights

It’s not just Amazon and Infor; there’s also a MicroStrategy deal to OEM ParAccel — I think it’s the real ParAccel software in that case — for a particular service, MicroStrategy Wisdom. But unless I’m terribly mistaken, HP Vertica, Sybase IQ and even Infobright each have a lot more OEMs than ParAccel, just as they have a lot more customers than ParAccel overall.

This OEM success is a great validation for the idea of columnar analytic RDBMS in general, but I don’t see where it’s an advantage for ParAccel vs. the columnar leaders.

Concurrency

As I admitted in the comment thread to my first Actian/ParAccel post, I’m confused about what kind of concurrent usage ParAccel can really support. The data I have, e.g. in the link immediately above, is not conclusive. Googling suggests that VectorWise was at one user per core a couple of years ago, supportive of my hypothesis that it doesn’t have some big concurrency edge on ParAccel. But to repeat — I don’t really know.

DBMS acquisitions in the past

My history blog on DBMS acquisitions yielded more favorable examples than I was expecting. (Of course, I omitted a lot of small and boring failures.) And DBMS conglomerates are the rule more than the exception, with IBM, Sybase, Teradata and Oracle all adopting acquisition-aided multi-DBMS strategies, at least to some extent.

That said, Sybase is the main example of a vendor of a slow-growth DBMS (Adaptive Server Enterprise) doing well with a faster-growing one (Sybase IQ). Perhaps not coincidentally, Actian’s latest management team draws significantly on Sybase. So yes; ParAccel is now owned by a company run by guys who know something about selling columnar DBMS.

But the whole thing would be more convincing if Ingres had shown more life under Actian’s ownership, or indeed at any point in the past 20 years. My bottom line is that Actian was floundering badly in the DBMS market 1 1/2 years ago, and not a lot of favorable news has emerged in the interim — except, quite arguably, for the management changes and acquisitions themselves.

Goodbye VectorWise, farewell ParAccel?

Curt Monash — Thu, 25 Apr 2013 23:59:20 +0000

Actian, which already owns VectorWise, is also buying ParAccel. The argument for why this kills VectorWise is simple. ParAccel does most things VectorWise does, more or less as well. It also does a lot more:

ParAccel scales out.
ParAccel has added analytic platform capabilities.
I don’t know for sure, but I’d guess ParAccel has more mature management/plumbing capabilities as well.

One might conjecture that ParAccel is bad at highly concurrent, single-node use cases, and VectorWise is better at them — but at the link above, ParAccel bragged of supporting 5,000 concurrent connections. Besides, if one is just looking for a high-use reporting server, why not get Sybase IQ?? Anyhow, Actian hasn’t been investing enough in VectorWise to make it a major market player, and they’re unlikely to start now that they own ParAccel as well.

But I expect ParAccel to fail too. Reasons include:

ParAccel’s small market share and traction.
The disruption of any acquisition like this one.
My general view of Actian as a company.

2 years after being acquired, Vertica — which conceptually has always been ParAccel’s closest competitor — has finally taken major hits on engineering staffing. Even so, I expect HP Vertica to reopen what was once a large technology and momentum gap vs. ParAccel.

My views on Actian start:

Actian is attempting to build a database software conglomerate on the cheap, starting with Ingres, ParAccel, VectorWise, Pervasive (itself a small conglomerate) and Versant.
Actian hasn’t accomplished much with Ingres, its original acquisition.
Actian hasn’t accomplished much with VectorWise.
Actian’s brief, embarrassing pivot away from database software was a joke. (The comments at that link also show VectorWise’s positioning as very different in September, 2011 than it is now.)
I’ve had some very bad experiences with Actian management, although it seems to have largely turned over since then.
I can’t identify the folks to make this work at the acquired pieces either (even though I think well of a few of them, e.g. Mike Hoskins and Rick Glick).

I.e., building a database conglomerate is hard, and Actian isn’t up to the challenge.

Actian has three main paths it can follow for synergy:

Acquire a lot of pieces and flip the whole thing for more money to a foolish buyer. This strategy worked splendidly for Autonomy, and to some extent for Sybase as well. But it’s a longshot, and not necessarily a win for customers even if investors do well.
Sell a bunch of disparate products through the same sales force. Tough to execute. And at best it raises sales coverage up to the level of that for the most successful product — and Actian doesn’t really have successful new products.
Integrate the technologies. Blech. You don’t integrate DBMS with wildly different architectures, as Informix died trying in the 1990s.

I don’t see enough opportunity there for the whole thing to work out, with sales synergy being the best opportunity to prove me wrong.

Related links

Doug Henschen and Derrick Harris offer quotes and numbers about the deal.
VectorWise’s academic founders Peter Boncz and Marcin Zukowski seem to have left the company.

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Curt Monash — Tue, 05 Feb 2013 13:25:15 +0000

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:

I think HP’s troubles are less relevant to HP Vertica than Gartner does.
In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.

That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.

2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.

*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated

What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:

ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
Gartner dings ParAccel for a variety of product weaknesses.
Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)

That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.

I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.

*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.

I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.

Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
Gartner’s view of Exasol seems similar to mine.
I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)

Amazon Redshift and its implications

Curt Monash — Sun, 09 Dec 2012 16:59:10 +0000

Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:

Amazon Redshift is missing parts of ParAccel, notably the extensibility framework.
ParAccel did some engineering to make its DBMS run better in the cloud.
Amazon did some engineering in the areas it knows better than ParAccel — cloud provisioning, cloud billing, and so on.

“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.

The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.

So who should and will use Amazon Redshift? For starters, I’d say:

If data isn’t already in the Amazon cloud, getting it there remains a pain. Locating your analytic RDBMS on the same premises where the data is created makes life simpler.
Over 3 years ago, $20,000/terabyte was a great list price for purchasing a data warehouse appliance that required little administration. Imagine negotiated discounts and further declines from there. Even so, Amazon’s <$1K/terabyte/year is a low figure.
Amazon’s marketing suggests companies should put their whole data warehousing on Redshift. But in fact, that almost never happens even with ParAccel.

Also — if Amazon Redshift is your analytic RDBMS, what’s the rest of your analytic environment? I can think of three possibilities that could work pretty straightforwardly:

Business intelligence and just BI.
Statistics and just statistics.
Hadoop (i.e. Elastic MapReduce) plus a lot of hand-coding.

Anything else would seem hard to stitch together at this time.

Putting that together, I see three kinds of users for whom Amazon Redshift might make sense:

Web startups, whose data is all in the Amazon cloud anyway, and who need better analytic SQL performance than they can get from Hadoop.
Data mart outsourcers/data sellers, again probably startups, whose whole business is in the cloud.
Individual analysts with small budgets, or very small analytic groups within enterprises or other organizations.

All three of those are “traditional” markets for new-generation analytic DBMS and data warehouse appliances, except that those DBMS are rarely put into production in the cloud. But for the most part, vendors have moved upscale — enterprise users, analytic platform features, etc. So the biggest threat from Amazon Redshift is to markets that other vendors have somewhat left behind.

So how should and will the analytic RDBMS industry respond? My thoughts on that begin:

Doing nothing would be a poor choice.
They’re already open to having cheap or free low-end offerings — Vertica Community Edition, open-source Infobright, and so on.
Tweaking their systems to work well in the cloud becomes easier all the time, as cloud platforms mature.
A natural solution would be something like a Starter/Standard/Enterprise Edition split, with at least the Starter and Standard Editions being cloud-friendly.

ParAccel update

Curt Monash — Sun, 09 Dec 2012 16:58:16 +0000

In connection with Amazon’s Redshift announcement, ParAccel reached out, and so I talked with them for the first time in a long while. At the highest level:

ParAccel now has 60+ customers, up from 30+ two years ago and 40ish soon thereafter.
ParAccel is now focusing its development and marketing on analytic platform capabilities more than raw database performance.
ParAccel is focusing on working alongside other analytic data stores — relational or Hadoop — rather than supplanting them.

There wasn’t time for a lot of technical detail, but I gather that the bit about working alongside other data stores:

Is relatively new.
Works via SELECT statements that reach out to the other data stores.
Is called “on-demand integration”.
Is built in ParAccel’s extensibility/analytic platform framework.
Uses HCatalog when reaching into Hadoop.

Also, it seems that ParAccel:

Is in the early stages of writing its own analytic functions.
Bundles Fuzzy Logix and actually has some users for that.

Various of my questions were answered more in email than on the phone, which I got permission to quote and lightly edit. (Text-bolding is by me.)

ParAccel’s customer stats turn out to be:

We currently have won more than 60 paying customers, since opening our doors in 2005. In the last year, we have added 16 additional customers. It’s also important to note that our existing customers continue to expand their use of ParAccel. We have closed 11 new deals with existing customers in the last year. These are all paying customers with ParAccel in production or in development and moving to production.

On the phone, ParAccel cited Evernote and Home Depot as recent new customer wins.

In response to a question about concurrent usage, ParAccel wrote:

We support up to 5,000 connections on a single ParAccel instance. Our emphasis is on analytic performance, and in Proof of Value projects, we encourage testing of our advanced analytic capabilities in real-world scenarios – the more complex the analytics, the better ParAccel performs against the competition. For example, one of our retail customers has an algorithm running within ParAccel with 10,000 lines of code. Another financial services customer runs a 25,000 line SQL query for dynamic risk stress testing.

Here are some of the projects with the largest number of users accessing a ParAccel-based system:

MicroStrategy Wisdom has more than 38,000 users of the system.

We have some large retailers with 1000 to 2000 stores regularly accessing store-specific analytics.

Regarding large databases, ParAccel wrote:

Here are some large current implementations, all expected to grow significantly in the next year:

Merkle: 50 TB

Large internet company: 40 TB

Large financial institution: 20 TB

We also know that Amazon is running a very large, PB+ implementation on their platform based on components licensed from ParAccel.

I don’t know what happened to the two 100+ terabyte customers ParAccel cited in 2010.

Notes on some basic database terminology

Curt Monash — Tue, 07 Aug 2012 10:25:42 +0000

In a call Monday with a prominent company, I was told:

Teradata, Netezza, Greenplum and Vertica aren’t relational.
Teradata, Netezza, Greenplum and Vertica are all data warehouse appliances.

That, to put it mildly, is not accurate. So I shall try, yet again, to set the record straight.

In an industry where people often call a DBMS just a “database” — so that a database is something that manages a database! — one may wonder why I bother. Anyhow …

1. The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.

2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:

Relational:
- Was designed to do relational stuff* from the get-go, even if it now does other things too.
- Supports a lot of SQL.
Non-relational:
- Was designed primarily to do non-relational things.*
- Doesn’t support all that much SQL.

*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.

3. There are two chief kinds of relational DBMS:

RDBMS that are designed for, among other things, online transaction processing (OLTP). Examples include Oracle, DB2, SQL Server, Sybase ASE, PostgreSQL, and MySQL. It is reasonable to refer to these as general-purpose or OLTP RDBMS.*
RDBMS that are designed strictly for analytic uses. Examples include Sybase IQ, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio and the DBMS software inside systems from Teradata and Netezza. It is most accurate to refer to these as analytic RDBMS or just analytic DBMS (sometimes abbreviated ADBMS).

* “General-purpose” is usually a better term than “OLTP”; most OLTP DBMS can handle at least basic reporting, and the leading ones go well beyond that.

4. Some analytic RDBMS were designed to be columnar. Some were designed to be row-based. Multiple systems from both groups now offer both column- and row-based storage options. But they’re all equally relational.

And once again, I remind you that columnar storage and columnar compression are not the same thing.

5. An appliance can include a DBMS, and indeed exist for no purpose other than to run a DBMS; but a DBMS is not an appliance. At a minimum, a data warehouse appliance is a computing system (hardware, storage, operating system, etc.) with an analytic RDBMS preinstalled.

Occasionally somebody suggests that a “virtual appliance” doesn’t have to have hardware included, but they usually draw little attention.

However, reasonable people can disagree about pickier questions, such as:

Does appliance hardware have to be in any way purpose-built? I lean to a No — but I prefer those “appliance” stories that include an actual a hardware advantage.
Does appliance hardware have to have custom silicon, or at least FPGAs (Field-Programmable Gate Arrays)? My answer is an emphatic No.
Does an appliance have to be super-easy to install and administer? I lean to a No — but two of the top appliance benefits are ease of deployment and administration.

For example, I think:

All hardware systems Teradata makes are appliances, even the ones it thinks aren’t.
Similarly, Oracle Exadata systems are appliances.
IBM Netezza is the classic line of data warehouse appliances.
IBM’s “Smart Analytic Systems” can justifiably be called appliances if IBM wishes — but IBM would be wise to save that word for its Netezza line.

Again, reasonable people can disagree — just so long as they don’t slap the label “appliance” onto software-only analytic RDBMS.

Kognitio’s story today

Curt Monash — Wed, 23 May 2012 01:36:17 +0000

I had dinner tonight with the Kognitio folks. So far as I can tell:

Branding has been mercifully simplified. Everything is now called “Kognitio” (as opposed to, for example, “WX2”).
Notwithstanding its long history of selling disk-based DBMS and denigrating memory-only configurations, Kognitio now says that in fact it’s always been an in-memory DBMS vendor.
Notwithstanding its long history of selling (or attempting to sell) analytic DBMS, Kognitio wants to be viewed as an accelerator to your existing DBMS. This is apparently inspired in part by SAP HANA, notwithstanding that HANA’s direction is to evolve into a hybrid OLTP/analytic general-purpose DBMS.
Notwithstanding its lack of analytic platform features, Kognitio wants to be viewed as selling an analytic platform.
Notwithstanding its memory-centric focus, Kognitio doesn’t want to compress data. Kognitio’s opinion — which to my knowledge is shared by few people outside Kognitio — seems to be that the CPU cost of compression/decompression isn’t justified by the RAM savings from compression.
Kognitio still is pushing a cloud/SaaS (Software as a Service) story. Even if you want to use Kognitio (the product) on-premises, Kognitio (the company) calls that “private cloud” and offers to let you pay annually.

Kognitio believes that this story is appealing, especially to smaller venture-capital-backed companies, and backs that up with some frieNDA pipeline figures.

Between that success claim and SAP’s HANA figures, it seems that the idea of using an in-memory DBMS to accelerate analytics has legs. This makes sense, as the BI vendors — Qlik Tech excepted — don’t seem to be accomplishing much with their proprietary in-memory alternatives. But I’m not sure that Kognitio would be my first choice to fill that role. Rather, if I wanted to buy an unsuccessful analytic RDBMS to use as an in-memory accelerator, I might consider ParAccel, which is columnar, has an associated compression story, has always had a hybrid memory-centric flavor much as Kognitio has, and is well ahead of Kognitio in the analytic platform derby. That said, I’ll confess to not having talked with or heard much about ParAccel for a while, so I don’t know if they’ve been able maintain technical momentum any more than Kognitio has.