VectorWise – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Actian Vector Hadoop Edition

Curt Monash — Thu, 07 Aug 2014 11:12:35 +0000

I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.

That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call.

In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that

Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,

… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.

and

Actian … is now “one company, with one voice and one platform” according to its John Santaferraro

The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.

All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include:

Vectorwise, while being proudly multi-core, was previously single-server. The new Vector Hadoop Edition is the first version with node parallelism.
Actian’s Vector Hadoop edition uses HDFS (Hadoop Distributed File System) and YARN to manage an Actian-proprietary file format. There is currently no interoperability whereby Hadoop jobs can read these files. However …
… Actian’s Vector Hadoop edition relies on Hadoop for cluster management, workload management and so on.
Peter thinks there are two paying customers, both too recent to be in production, who between then paid what I’d call a remarkable amount of money.*
Roadmap futures* include:
- Being able to update and indeed trickle-update data. Peter is very proud of Vectorwise’s Positional Delta Tree updating.
- Some elasticity they’re proud of, both in terms of nodes (generally limited to the replication factor of 3) and cores (not so limited).
- Better interoperability with Hadoop.

Actian actually bundles Vector Hadoop Edition with DataFlow — the old Pervasive DataRush — into what it calls “Actian Analytics Platform – Hadoop SQL Edition”. DataFlow/DataRush has been working over Hadoop since the latter part of 2012, based on a visit with my then clients at Pervasive that December.

*Peter gave me details about revenue, pipeline, roadmap timetables etc. that I’m redacting in case Actian wouldn’t like them shared. I should say that the timetable for some — not all — of the roadmap items was quite near-term; however, pay no attention to any phrasing in Peter’s blog post that suggests the roadmap features are already shipping.

The Actian Vector Hadoop Edition optimizer and query-planning story goes something like this:

Vectorwise started with the open-source Ingres optimizer. After a query is optimized, it is rewritten to reflect Vectorwise’s columnar architecture. Peter notes that these rewrites rarely change operator ordering; they just add column-specific optimizations, whatever that means.
Now there are rewrites for parallelism as well.
These rewrites all seem to be heuristic/rule-based rather than cost-based.
Once Vectorwise became part of the Ingres company (later renamed to Actian), they had help from Ingres engineers, who helped them modify the base optimizer so that it wasn’t just the “stock” Ingres one.

As with most modern MPP (Massively Parallel Processing) analytic RDBMS, there doesn’t seem to be any concept of a head-node to which intermediate results need to be shipped. This is good, because head nodes in early MPP analytic RDBMS were dreadful bottlenecks.

Peter and I also talked a bit about SQL-oriented HDFS file formats, such as Parquet and ORC. He doesn’t like their lack of support for columnar compression. Further, in Parquet there seems to be a requirement to read the whole file, to an extent that interferes with Vectorwise’s form of data skipping, which it calls “min-max indexing”.

Frankly, I don’t think the architectural choice “uses Hadoop for workload management and administration” provides a lot of customer benefit in this case. Given that, I don’t know that the world needs another immature MPP analytic RDBMS. I also note with concern that Actian has two different MPP analytic RDBMS products. Still, Vectorwise and indeed all the stuff that comes out Martin Kersten and Peter’s group in Amsterdam has always been interesting technology. So the Actian Vector Hadoop Edition might be worth taking a look at before you redirect your attention to products with more convincing track records and futures.

More on Actian/ParAccel/VectorWise/Versant/etc.

Curt Monash — Mon, 29 Apr 2013 11:50:52 +0000

My quick reaction to the Actian/ParAccel deal was negative. A few challenges to my views then emerged. They didn’t really change my mind.

Amazon Redshift

Amazon did a deal with ParAccel that amounted to:

Amazon got a very cheap license to a limited subset of ParAccel’s product …
… so that it could launch a service called Amazon Redshift.
Amazon also invested in ParAccel.

Some argue that this is great for ParAccel’s future prospects. I’m not convinced.

No doubt there are and will be Redshift users, evidently including Infor. But so far as I can tell, Redshift uses very standard SQL, so it doesn’t seed a ParAccel market in terms of developer habits. The administration/operation story is similar. So outside of general validation/bragging rights, Redshift is not a big deal for ParAccel.

OEMs and bragging rights

It’s not just Amazon and Infor; there’s also a MicroStrategy deal to OEM ParAccel — I think it’s the real ParAccel software in that case — for a particular service, MicroStrategy Wisdom. But unless I’m terribly mistaken, HP Vertica, Sybase IQ and even Infobright each have a lot more OEMs than ParAccel, just as they have a lot more customers than ParAccel overall.

This OEM success is a great validation for the idea of columnar analytic RDBMS in general, but I don’t see where it’s an advantage for ParAccel vs. the columnar leaders.

Concurrency

As I admitted in the comment thread to my first Actian/ParAccel post, I’m confused about what kind of concurrent usage ParAccel can really support. The data I have, e.g. in the link immediately above, is not conclusive. Googling suggests that VectorWise was at one user per core a couple of years ago, supportive of my hypothesis that it doesn’t have some big concurrency edge on ParAccel. But to repeat — I don’t really know.

DBMS acquisitions in the past

My history blog on DBMS acquisitions yielded more favorable examples than I was expecting. (Of course, I omitted a lot of small and boring failures.) And DBMS conglomerates are the rule more than the exception, with IBM, Sybase, Teradata and Oracle all adopting acquisition-aided multi-DBMS strategies, at least to some extent.

That said, Sybase is the main example of a vendor of a slow-growth DBMS (Adaptive Server Enterprise) doing well with a faster-growing one (Sybase IQ). Perhaps not coincidentally, Actian’s latest management team draws significantly on Sybase. So yes; ParAccel is now owned by a company run by guys who know something about selling columnar DBMS.

But the whole thing would be more convincing if Ingres had shown more life under Actian’s ownership, or indeed at any point in the past 20 years. My bottom line is that Actian was floundering badly in the DBMS market 1 1/2 years ago, and not a lot of favorable news has emerged in the interim — except, quite arguably, for the management changes and acquisitions themselves.

Goodbye VectorWise, farewell ParAccel?

Curt Monash — Thu, 25 Apr 2013 23:59:20 +0000

Actian, which already owns VectorWise, is also buying ParAccel. The argument for why this kills VectorWise is simple. ParAccel does most things VectorWise does, more or less as well. It also does a lot more:

ParAccel scales out.
ParAccel has added analytic platform capabilities.
I don’t know for sure, but I’d guess ParAccel has more mature management/plumbing capabilities as well.

One might conjecture that ParAccel is bad at highly concurrent, single-node use cases, and VectorWise is better at them — but at the link above, ParAccel bragged of supporting 5,000 concurrent connections. Besides, if one is just looking for a high-use reporting server, why not get Sybase IQ?? Anyhow, Actian hasn’t been investing enough in VectorWise to make it a major market player, and they’re unlikely to start now that they own ParAccel as well.

But I expect ParAccel to fail too. Reasons include:

ParAccel’s small market share and traction.
The disruption of any acquisition like this one.
My general view of Actian as a company.

2 years after being acquired, Vertica — which conceptually has always been ParAccel’s closest competitor — has finally taken major hits on engineering staffing. Even so, I expect HP Vertica to reopen what was once a large technology and momentum gap vs. ParAccel.

My views on Actian start:

Actian is attempting to build a database software conglomerate on the cheap, starting with Ingres, ParAccel, VectorWise, Pervasive (itself a small conglomerate) and Versant.
Actian hasn’t accomplished much with Ingres, its original acquisition.
Actian hasn’t accomplished much with VectorWise.
Actian’s brief, embarrassing pivot away from database software was a joke. (The comments at that link also show VectorWise’s positioning as very different in September, 2011 than it is now.)
I’ve had some very bad experiences with Actian management, although it seems to have largely turned over since then.
I can’t identify the folks to make this work at the acquired pieces either (even though I think well of a few of them, e.g. Mike Hoskins and Rick Glick).

I.e., building a database conglomerate is hard, and Actian isn’t up to the challenge.

Actian has three main paths it can follow for synergy:

Acquire a lot of pieces and flip the whole thing for more money to a foolish buyer. This strategy worked splendidly for Autonomy, and to some extent for Sybase as well. But it’s a longshot, and not necessarily a win for customers even if investors do well.
Sell a bunch of disparate products through the same sales force. Tough to execute. And at best it raises sales coverage up to the level of that for the most successful product — and Actian doesn’t really have successful new products.
Integrate the technologies. Blech. You don’t integrate DBMS with wildly different architectures, as Informix died trying in the 1990s.

I don’t see enough opportunity there for the whole thing to work out, with sales synergy being the best opportunity to prove me wrong.

Related links

Doug Henschen and Derrick Harris offer quotes and numbers about the deal.
VectorWise’s academic founders Peter Boncz and Marcin Zukowski seem to have left the company.

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Curt Monash — Tue, 05 Feb 2013 13:25:15 +0000

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:

I think HP’s troubles are less relevant to HP Vertica than Gartner does.
In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.

That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.

2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.

*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated

What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:

ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
Gartner dings ParAccel for a variety of product weaknesses.
Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)

That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.

I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.

*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.

I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.

Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
Gartner’s view of Exasol seems similar to mine.
I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)

Highlights of a busy news week

Curt Monash — Mon, 26 Sep 2011 05:50:35 +0000

I put up 14 posts over the past week, so perhaps you haven’t had a chance yet to read them all. Highlights included:

My most important post of the week was a general guide to IT vendor strategy. That one has already spawned discussion at many companies, from the tiny to the multi-billion-dollar.
The best comment thread of the week was probably on my post about scale-out relational OLTP choices, in which people discussed the merits of various particular alternatives.
I recommended that people strongly consider attending XLDB 5 in Menlo Park on October 18-19.

Most of the posts, however, were reactions to news events. In particular:

Teradata announced that Teradata 14 will be hybrid-columnar, more in Vertica’s way than in Greenplum’s or Aster Data’s. (Pay no attention to the Wall Street Journal’s apparent belief that no other analytic DBMS is hybrid-columnar at all.)
Aster announced the unsurprising news that there will be a Teradata Aster appliance. Also, Aster talked about greater analytic flexibility in the forthcoming Aster 5.0.
With Oracle OpenWorld coming up, Oracle decided to get some of its announcing out of the way early. In particular, it announced the Oracle Database Appliance, which is small-business-friendly hardware for running the Oracle DBMS. However, the Oracle Database Appliance doesn’t seem to do much about the complexity of running the Oracle DBMS software.
In a catch-all Hadoop post, I noted that:
- Oracle has now clearly said it has a Hadoop appliance coming, no doubt next week at OpenWorld.
- I still can’t see why Hadoop appliances would succeed, but a lot of smart folks seem to disagree with me.
- Greenplum announced what looks like a nice but unimportant little product upgrade.
- It’s a really good thing that previously reported plans to revamp Hadoop are underway.
DataStax announced that it really is a Cassandra company after all. Pay no attention to previous marketing that seemed to put DataStax in the same Hadoop-alternative category as, say, MapR.
Ingres has changed its name to Actian. The announcement seems like a confession that Ingres and VectorWise are going nowhere.

Ingres deemphasized, company now named Actian

Curt Monash — Sun, 25 Sep 2011 11:48:18 +0000

Ingres, the company, is:

Changing its name to Actian.
Deemphasizing Ingres, the product.
Emphasizing a set of products that don’t exist yet (or at least aren’t shipping), namely lightweight mobile apps that are business-intelligence-plus-an-action, and technology for building them. These are called “Action Apps”, and are discussed on the Actian company blog.
Positioning all this as something to do with “big data” (what a shock).

It turns out that Actian was the name of an ancient athletic competition commemorating Augustus’ defeat of Anthony at Actium, a battle that was more recently memorialized in the movie Cleopatra. Frankly, I think Cleopatra Software might have been a more interesting company name, although that could mean execs would have to arrive at sales calls rolled up in a carpet.

One article said:

Greg Wood, chief financial officer for Actian, told V3 that while the firm would continue to develop and maintain the Ingres database platform, its would be placing the spotlight on its Cloud Action Platform and its line of Action Apps.

“The Ingres database is well-recognised and we will continue to support it, but at the same time that brand was more associated with an older-generation technology,” Wood said.

“We think Actian better reflects where we are going as a company, particularly the application strategy.”

Wood explained that the platform would look to expand on the emerging field of big data applications by adding functionality for end users. The small, specialised applications would link up with data analytics tools, providing alerts and actions when various conditions are spotted within a database.

So what about VectorWise? Notwithstanding Actian’s stated focus on “big data”, I think VectorWise’s chances for market success are slim.* Reasons include:

The market for shared-disk columnar analytic DBMS is crowded (Sybase IQ, Infobright, SAND). Those vendors also have to compete with MPP columnar analytic DBMS offerings from Vertica and ParAccel.
I’ve never heard anything to make me believe VectorWise is getting significant market traction.
Indeed, Daniel Abadi’s well-known flirtation with the idea of using VectorWise in HadoopDB/Hadapt excepted, I don’t recall any marketplace mention of VectorWise at all.

*The possibility of some kind of Action App synergy leads me to elevate them to “slim” from “none”.

The Action App idea actually sounds cool, but it’s quite a change from Ingres’ previous positioning and technology, and I have no basis for judging it as likely to succeed. On the other hand, companies have occasionally made successful transitions into business intelligence from relatively unrelated businesses before, most notably Cognos in the mid-1990s.

Hadapt update

Curt Monash — Wed, 06 Jul 2011 23:43:49 +0000

I met with the Hadapt guys today. I think I can be a bit crisper than before in positioning Hadapt and its use cases, namely:

Hadapt is additional software on a cluster that also runs fully functional Hadoop/HDFS. (Cloudera Hadoop more than straight-from-Apache Hadoop to date, but that’s not a requirement.)
The cluster also runs a DBMS on every node, such as PostgreSQL or one of Infobright/Vectorwise.
Hadapt’s software manages parallel SQL queries by distributing them to the DBMS living on each node. Hadapt says that the resulting query performance far outshines Hive’s.
Hadapt further says that, by exploiting the partner DBMS, its SQL functionality outpaces Hive’s as well.
Target Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the relational store.
In particular, Hadapt seems like an interesting choice when you want to use that relational data as you work on other data that’s still in HDFS, or if you want to keep using the relational data in other kinds of MapReduce jobs.
That all fits well with my thoughts about the importance of derived data.

Other evolution from what I wrote about Hadapt a few months ago includes:

Hadapt is in beta now.
Hadapt has added adult supervision in the form of Philip Wickline, late of Endeca.

In other news, Hadapt is our newest client.

Hadapt (commercialized HadoopDB)

Curt Monash — Wed, 23 Mar 2011 12:35:52 +0000

The HadoopDB company Hadapt is finally launching, based on the HadoopDB project, albeit with code rewritten from scratch. As you may recall, the core idea of HadoopDB is to put a DBMS on every node, and use MapReduce to talk to the whole database. The idea is to get the same SQL/MapReduce integration as you get if you use Hive, but with much better performance* and perhaps somewhat better SQL functionality.** Advantages vs. a DBMS-based analytic platform that includes MapReduce — e.g. Aster Data — are less clear.

*At least if the underlying DBMS is a fast one. Hadapt likes VectorWise for that purpose, and is showing performance comparisons that assume VectorWise is underneath.

**It seems that Hadapt in the future is assured of having more SQL coverage than Hive does today.

It’s still early days for the Hadapt company. Funding is on the angel level. There seem to be six employees — Yale professor Daniel Abadi, CEO Justin Borgman, Chief Scientist Kamil Bajda-Pawlikowski,* and three other coders. The Hadapt product will go into beta at an unspecified future time; there currently are a couple of alpha users/design partners. The Hadapt company, a Yale spin-off, obviously needs to move from Connecticut soon. I wasn’t able to detect any particular outside experience in the form of directors or advisors. And Hadapt’s marketing efforts are still somewhat ragged. So basically, the reasons for believing in Hadapt pretty much boil down to:

Daniel Abadi is a star.**
Hadapt’s own tests show that Hadapt is a whole lot faster than Hive.

*Bajda-Pawlikowski is one of the two Abadi students who did the HadoopDB work. It turns out he had numerous years of coding experience before entering graduate school. (The other student, Azza Abouzeid, is pursuing an academic career.)

**Vertica was built around Daniel’s C-Store Ph.D. thesis. He was involved in H-Store as well. He has a really good blog. He’s a really nice guy. Etc.

As you might have guessed from the name, the Hadapt guys are proud that their technology is “adaptive,” which communicates their fond belief that Hadapt’s query optimization and planning are more modern and cool than other folks’ query planning and optimization. In particular, Daniel suggested that Hadapt is more thoughtful than most DBMS are about looking at the size of intermediate result sets and then replanning queries accordingly.

However, the really cool adaptivity point is that Hadapt watches the performance of individual nodes, and takes that into account in query replanning. Daniel asserts, credibly, that this is a Really Good Feature to have in cloud and/or virtualized environments, where Hadapt might not have full control and use of its nodes. I’d add that it could also give Hadapt a lot of flexibility to be run on clusters of non-identical machines.

On the negative side, Hadapt will not at first have any awareness of how its underlying DBMS are optimized; it will plan for VectorWise the same way it does for PostgreSQL. In that regard, this is a DATAllegro 1.0 story. If I understood correctly, Hadapt has specific connectors for a couple of DBMS (probably exactly those two), and can also talk JDBC to anything. PostgreSQL was apparently 5X faster than MySQL when tested (with either ISAM or InnoDB); Daniel snorted about, for example, MySQL’s apparent fondness for nested-loop joins over hybrid hash. On the other hand, he was more circumspect about his reasons for favoring VectorWise over, to name another open source columnar DBMS, Infobright.

And finally, a couple of other points:

Hadapt will be closed source, although it will of course rely on large amounts of other people’s open source software. Pay no attention to the importance Daniel previously ascribed to HadoopDB’s open source nature.
Hadapt decompresses data before moving it from node to node, and also before doing non-SQL MapReduce operations on it. Pay no attention to the years Daniel spent insisting columnar DBMS absolutely must operate on data in compressed form.

Ingres VectorWise technical highlights

Curt Monash — Fri, 11 Jun 2010 11:28:18 +0000

After working through problems w/ travel, cell phones, and so on, Peter Boncz of VectorWise finally caught up with me for a regrettably brief call. Peter gave me the strong impression that what I’d written in the past about VectorWise had been and remained accurate, so I focused on filling in the gaps. Highlights included:

VectorWise is indeed a shared-everything analytic DBMS.
The VectorWise front-end is Ingres. Ingres VectorWise supports almost all SQL that Ingres does (there are a few edge-case exceptions).
Conversely, Ingres VectorWise doesn’t support any SQL Ingres doesn’t, most notably SQL-99 Analytics. Naturally, SQL-99 Analytics is a roadmap item for Ingres/VectorWise.
Ingres VectorWise 1.0 is pretty purely columnar. There’s a bit of PAX, but it’s mainly automagic/under the covers. The one user-controlled exception I understood was that one can ensure that composite keys are stored together.
The main Ingres VectorWise performance secret sauce ingredients we touched on were:
- Vectorization of operations (hence VectorWise’s name).
- Compression that is tuned for speed rather than to minimize storage utilization.
We unfortunately didn’t have time to revisit the other big part of the Ingres VectorWise performance story, namely clever design for modern microprocessor architectures. High-level generalities about that do pervade the Ingres VectorWise press release, but – well, they’re very high level.
Unlike Vertica but like most other columnar DBMS vendors, Ingres VectorWise wants you to store your data once. You can index-organize the data. You can also organize multiple tables in the same order, to make joins among them fast.
Support for actual join indexes is an Ingres VectorWise roadmap item.
As do ever more analytic DBMS, Ingres VectorWise has something akin to Netezza zone maps.
When I asked Peter what had changed most from the initial VectorWise development plan, other than the above, he basically said that their performance priorities had shifted a bit. Specifically, he said.
- They had originally been “blinded” (his word) by the TPC-H benchmark, but figured out that they were overly focused on it. (Well, duh.)
- They learned about the importance of other things such as data loading speeds.