Infobright – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Greenplum is being open sourced

Curt Monash — Wed, 18 Feb 2015 21:51:39 +0000

While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:

Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.

Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.

Further, the low-cost alternatives for analytic RDBMS are adding up.

Amazon Redshift has considerable traction.
Hadoop (even just with Hive) has offloaded a lot of ELT (Extract/Load/Transform) from analytic RDBMS such as Teradata.
Now Greenplum is in the mix as well.

For many analytic RDBMS use cases, at least one of those three will be an appealing possibility.

By no means do I want to suggest those are the only alternatives.

Smaller-vendor offerings, such as CitusDB or Infobright, may well be competitive too.
Larger vendors can always slash price in specific deals.
MonetDB is still around.

But the three possibilities I cited first should suffice as proof for almost all enterprises that, for most use cases not requiring high concurrency, analytic RDBMS need not cost an arm and a leg.

Related link

Greenplum revenue at EMC was problematic from the get-go.

Some stuff I’m thinking about (early 2014)

Curt Monash — Sun, 02 Feb 2014 18:51:49 +0000

From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:

Hadoop (always, and please see below).
Analytic RDBMS (ditto).
NoSQL and NewSQL.
Specifically, SQL-on-Hadoop
Schema-on-need.
Spark and other memory-centric technology, including streaming.
Public policy, mainly but not only in the area of surveillance/privacy.
General strategic advice for all sizes of tech company.

Other stuff on my mind includes but is not limited to:

1. Certain categories of buying organizations are inherently leading-edge.

Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
… as have national-security agencies …
… as have pharmaceutical research departments.

Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.

2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.

Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.

3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.

Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.

Analytic RDBMS prices are still shaking out.

4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.

One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.

Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.

5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)

6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.

Data management is getting ever more complex.

Thoughts on SaaS

Curt Monash — Mon, 25 Nov 2013 01:16:05 +0000

Generalizing about SaaS (Software as a Service) is hard. To prune some of the confusion, let’s start by noting:

SaaS has been around for over half a century, and at times has been the dominant mode of application delivery.
The term multi-tenancy is being used in several different ways.
Multi-tenancy, in the purest sense, is inessential to SaaS. It’s simply an implementation choice that has certain benefits for the SaaS provider. And by the way, …
… salesforce.com, the chief proponent of the theory that true multi-tenancy is the hallmark of true SaaS, abandoned that position this week.
Internet-based services are commonly, if you squint a little, SaaS. Examples include but are hardly limited to Google, Twitter, Dropbox, Intuit, Amazon Web Services, and the company that hosts this blog (KnownHost).
Some of the core arguments for SaaS’ rise, namely the various efficiencies of data center outsourcing and scale, apply equally to the public cloud, to SaaS, and to AEaaS (Anything Else as a Service).
These benefits are particularly strong for inherently networked use cases. For example, you really don’t want to be hosting your website yourself. And salesforce.com got its start supporting salespeople who worked out of remote offices.
In theory and occasionally in practice, certain SaaS benefits, namely the outsourcing of software maintenance and updates, could be enjoyed on-premises as well. Whether I think that could be a bigger deal going forward will be explored in future posts.

For smaller enterprises, the core outsourcing argument is compelling. How small? Well:

What’s the minimum level of IT operations headcount needed for mission-critical systems? Let’s just say “several”.
What does that cost? Fully burdened, somewhere in the six figures.
What fraction of the IT budget should such headcount be? As low a double digit percentage as possible.
What fraction of revenues should be spent on IT? Some single-digit percentage.

So except for special cases, an enterprise with less than $100 million or so in revenue may have trouble affording on-site data processing, at least at a mission-critical level of robustness. It may well be better to use NetSuite or something like that, assuming needed features are available in SaaS form.*

*Truth be told, I’m not up to speed on mid-range SaaS application suite alternatives.

Continuing that thought — if you’re a mid-range application software provider, you have to develop a SaaS version of your product line. That’s a very different business model than the apps + OEMed platform you’re probably providing now, but it’s the best way to serve your customers going forward. And by the way — while mid-range application software is commonly sold on a regional basis, SaaS can be sold more globally; after all, the the need for onsite service is eliminated, and price points should in most cases fit with telephone sales. Yes, national language and regional data privacy rules are both concerns, but they still leave the available markets looking much bigger than regional resellers have traditionally enjoyed. So expect shake-outs in a whole lot of vertical markets, as vendors horn in on each other’s territories, and a few elephantine winners perhaps emerge.

The argument above assumes that extreme reliability is needed. So there’s nothing necessarily wrong with a small team of business analysts sticking an RDBMS appliance* in a corner and managing it themselves. If it sputters from time to time, who cares; using it still may be easier than getting that data in and out of the cloud. But eventually, if all the data is remote anyway — SaaS, website, etc. — then it may make sense to do analytics remotely as well.

*Previously, that appliance might have been from Netezza; now, my first thought is the cheaper — albeit more limited — Infobright.

The arguments that direct smaller companies toward SaaS apply to large enterprises to, but they aren’t as dispositive. Larger enterprises can actually afford to do their own IT operations if they want to. What’s more, moving away from in-house operations is harder for big firms, due to the larger and more customized portfolio of legacy systems they’re likely to have. That said:

Almost all enterprises should have their internet-facing systems offsite, even if just via co-location. The core reasons are that ingesting high-volume inbound network traffic is inherently difficult, and security issues make it much tougher yet. In addressing these challenges, specialists enjoy significant economies of scale.
Most enterprises will have plenty of SaaS silos. If nothing else:
- Complex machinery will increasingly “phone home” for help staying in good working order. That’s a form of SaaS.
- Information providers and aggregators tend to deliver via SaaS.
- Various kinds of collaboration and communication apps, from Google Mail to Dropbox, live in the cloud. Personal productivity applications, from word processing to Photoshop, may be following.
- “Rodney Dangerfield” departments — i.e., ones unhappy with the respect and attention they get from central IT — often turn to SaaS or similar outsourcing. Human resources is an obvious example, from Automatic Data Processing to Employease to, these days, Workday.

That leaves us with the questions as to when and how large enterprises should or will move their core applications to SaaS and/or the cloud. Given the length of this post, I won’t try to answer them now. But for starters:

Enterprises don’t like to rip and replace their apps, except in consolidation projects, as long as they can avoid doing so.
Cloud/remote computing economies are less convincing if you already have your computer rooms staffed and set up.
A key benefit of SaaS is that vendors control and drive the upgrade cycles. One cost of that is restrictions on customization, although you can also build apps and app extensions on Paas//DBaaS/Waas (Platform/DataBase/Whatever as a Service) offerings such as force.com.
Lock-in is a serious concern, for application and platform offerings alike. Not only are you betting on one vendor’s software black box, you’re also betting on its remote computing operation. If you grow dissatisfied with either, or with their pricing, you may not have much opportunity to escape.

Things I keep needing to say

Curt Monash — Mon, 12 Aug 2013 06:45:54 +0000

Some subjects just keep coming up. And so I keep saying things like:

Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.

Most generalizations about Hadoop are false. Reasons include:

Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
The transition from Hadoop 1 to Hadoop 2 will be drastic.
For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.

Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.

Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.

Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)

“Big Data” doesn’t create rapid IT growth. If we only had traditional kinds of data, IT growth would be drastically negative, since Moore’s Law swamps traditional data growth. Whole new categories of data are always needed to fill the gap. And these days, they’re all categorized as “Big Data”.

The single central database is a myth. Things are never that simple, at least at large enterprises. Hence, in particular, the ideal EDW (Enterprise Data Warehouse) is a myth.

Analytic RDBMS and appliances aren’t necessarily expensive. Deals can be had. Yes, most vendors want at least a few hundred thousand dollars for most sales, but there are plenty of exceptions even to that rule. And at either large or small scales, things get very cheap, for example:

Various vendors’ free/”community” editions.
The $2 million/petabyte hardware+software price I published for Vertica.

And Infobright is typically an economical option inbetween those extremes, if you’re cool with its focus on machine-generated data.

Columnar relational DBMS are relational. Examples include Sybase IQ, Vertica, ParAccel, Infobright and numerous others.

Yes, that’s a tautology. Even so, distressingly many people forget it, columnar RDBMS vendor employees not excepted.

Amazon Redshift proves very little about ParAccel. Amazon bought some stock in ParAccel, and got a cheap license to a subset of ParAccel’s code, perhaps in the same deal. Big whoop. Yes,

It is claimed that there are a lot of Redshift users, I presume low-end ones.
ParAccel is fast.*

But none of that speaks to some profound, ongoing Amazon/ParAccel/Actian relationship.

*I hear that ParAccel is usually faster than Vertica and other alternatives in POCs/benchmarks (Proofs of Concept). But I also hear that ParAccel’s installation complexity continues to be a POC problem.

New technology in old categories of application will only be adopted as quickly as firms replace their apps. Yes, that’s a tautology too. Even so, it puts an upper bound on, for example, the speed with which on-premises applications will be replaced by cloud alternatives.

SAP HANA is not yet a serious OLTP (OnLine Transaction Processing) DBMS. Yes,

HANA has in some form been under development for a long time; its major antecedent is BI Accelerator, which shipped back in 2006.
RAM-centric processing makes sense.
HANA has a cool-sounding feature list.
SAP claims lots of HANA sales, and not just in conjunction with a few new SAP apps that require HANA to run.

But the stories of HANA sales and deployment momentum sure seem concentrated on analytic use cases. And by the way — even among analytic DBMS vendors, I don’t hear much emphasis on competing vs. HANA.

Current BI trends reflect 1990s deja vu. The hottest business intelligence products and vendors are adopted by departments, on the strength of their snazzy interfaces and short adoption cycles.* That’s exactly how BI spread in the 1990s, only now the word “visualization” gets used more.

*A common phrase for that is land-and-expand.

And finally,

I’m not impressed that your future products will in some small ways be superior to what your competitors have had in production for over a year.

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Curt Monash — Tue, 05 Feb 2013 13:25:15 +0000

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:

I think HP’s troubles are less relevant to HP Vertica than Gartner does.
In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.

That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.

2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.

*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated

What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:

ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
Gartner dings ParAccel for a variety of product weaknesses.
Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)

That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.

I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.

*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.

I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.

Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
Gartner’s view of Exasol seems similar to mine.
I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)

Amazon Redshift and its implications

Curt Monash — Sun, 09 Dec 2012 16:59:10 +0000

Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:

Amazon Redshift is missing parts of ParAccel, notably the extensibility framework.
ParAccel did some engineering to make its DBMS run better in the cloud.
Amazon did some engineering in the areas it knows better than ParAccel — cloud provisioning, cloud billing, and so on.

“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.

The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.

So who should and will use Amazon Redshift? For starters, I’d say:

If data isn’t already in the Amazon cloud, getting it there remains a pain. Locating your analytic RDBMS on the same premises where the data is created makes life simpler.
Over 3 years ago, $20,000/terabyte was a great list price for purchasing a data warehouse appliance that required little administration. Imagine negotiated discounts and further declines from there. Even so, Amazon’s <$1K/terabyte/year is a low figure.
Amazon’s marketing suggests companies should put their whole data warehousing on Redshift. But in fact, that almost never happens even with ParAccel.

Also — if Amazon Redshift is your analytic RDBMS, what’s the rest of your analytic environment? I can think of three possibilities that could work pretty straightforwardly:

Business intelligence and just BI.
Statistics and just statistics.
Hadoop (i.e. Elastic MapReduce) plus a lot of hand-coding.

Anything else would seem hard to stitch together at this time.

Putting that together, I see three kinds of users for whom Amazon Redshift might make sense:

Web startups, whose data is all in the Amazon cloud anyway, and who need better analytic SQL performance than they can get from Hadoop.
Data mart outsourcers/data sellers, again probably startups, whose whole business is in the cloud.
Individual analysts with small budgets, or very small analytic groups within enterprises or other organizations.

All three of those are “traditional” markets for new-generation analytic DBMS and data warehouse appliances, except that those DBMS are rarely put into production in the cloud. But for the most part, vendors have moved upscale — enterprise users, analytic platform features, etc. So the biggest threat from Amazon Redshift is to markets that other vendors have somewhat left behind.

So how should and will the analytic RDBMS industry respond? My thoughts on that begin:

Doing nothing would be a poor choice.
They’re already open to having cheap or free low-end offerings — Vertica Community Edition, open-source Infobright, and so on.
Tweaking their systems to work well in the cloud becomes easier all the time, as cloud platforms mature.
A natural solution would be something like a Starter/Standard/Enterprise Edition split, with at least the Starter and Standard Editions being cloud-friendly.

Notes on some basic database terminology

Curt Monash — Tue, 07 Aug 2012 10:25:42 +0000

In a call Monday with a prominent company, I was told:

Teradata, Netezza, Greenplum and Vertica aren’t relational.
Teradata, Netezza, Greenplum and Vertica are all data warehouse appliances.

That, to put it mildly, is not accurate. So I shall try, yet again, to set the record straight.

In an industry where people often call a DBMS just a “database” — so that a database is something that manages a database! — one may wonder why I bother. Anyhow …

1. The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.

2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:

Relational:
- Was designed to do relational stuff* from the get-go, even if it now does other things too.
- Supports a lot of SQL.
Non-relational:
- Was designed primarily to do non-relational things.*
- Doesn’t support all that much SQL.

*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.

3. There are two chief kinds of relational DBMS:

RDBMS that are designed for, among other things, online transaction processing (OLTP). Examples include Oracle, DB2, SQL Server, Sybase ASE, PostgreSQL, and MySQL. It is reasonable to refer to these as general-purpose or OLTP RDBMS.*
RDBMS that are designed strictly for analytic uses. Examples include Sybase IQ, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio and the DBMS software inside systems from Teradata and Netezza. It is most accurate to refer to these as analytic RDBMS or just analytic DBMS (sometimes abbreviated ADBMS).

* “General-purpose” is usually a better term than “OLTP”; most OLTP DBMS can handle at least basic reporting, and the leading ones go well beyond that.

4. Some analytic RDBMS were designed to be columnar. Some were designed to be row-based. Multiple systems from both groups now offer both column- and row-based storage options. But they’re all equally relational.

And once again, I remind you that columnar storage and columnar compression are not the same thing.

5. An appliance can include a DBMS, and indeed exist for no purpose other than to run a DBMS; but a DBMS is not an appliance. At a minimum, a data warehouse appliance is a computing system (hardware, storage, operating system, etc.) with an analytic RDBMS preinstalled.

Occasionally somebody suggests that a “virtual appliance” doesn’t have to have hardware included, but they usually draw little attention.

However, reasonable people can disagree about pickier questions, such as:

Does appliance hardware have to be in any way purpose-built? I lean to a No — but I prefer those “appliance” stories that include an actual a hardware advantage.
Does appliance hardware have to have custom silicon, or at least FPGAs (Field-Programmable Gate Arrays)? My answer is an emphatic No.
Does an appliance have to be super-easy to install and administer? I lean to a No — but two of the top appliance benefits are ease of deployment and administration.

For example, I think:

All hardware systems Teradata makes are appliances, even the ones it thinks aren’t.
Similarly, Oracle Exadata systems are appliances.
IBM Netezza is the classic line of data warehouse appliances.
IBM’s “Smart Analytic Systems” can justifiably be called appliances if IBM wishes — but IBM would be wise to save that word for its Netezza line.

Again, reasonable people can disagree — just so long as they don’t slap the label “appliance” onto software-only analytic RDBMS.

Approximate query results

Curt Monash — Thu, 12 Jul 2012 17:30:16 +0000

In theory:

A database query is a predicate.
A DBMS matches the data it manages against the predicate and send back those records for which the predicate is true.

And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.

Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.

First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.

Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:

You do a SPARQL query.
You modify the query to accept results higher up in the taxonomy. (Which is likely to be possible, because where there’s SPARQL, there’s apt to be a taxonomy as well.) For example, if you really want to query on two people living in the house, you might extend the query to cover two people connected by any kind of address or building.

Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.

Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.

As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.

Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:

Infobright, whose “Rough Query” gives fast approximate results to a broad range of queries.
Metamarkets, which does fast cardinality estimates via HyperLogLog.
Aster Data, which was the first company to point out to me that median, decile, quintile, and so on calculations are a lot faster in a shared-nothing setting if you’re willing to settle for approximate results.

The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:

(Always) Well, there’s a lot of custom coding.
(Sometimes) We’re working with partner BI vendors to make direct use of the capabilities, but that’s not done yet, so it’s too early to talk about any details.

Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.

Our clients, and where they are located

Curt Monash — Sat, 31 Mar 2012 20:36:32 +0000

From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:

This is a list of Monash Advantage members.
All our vendor clients are Monash Advantage members, unless …
… we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
We do not usually disclose our user clients.
We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
Excluded from this round of disclosure is one vendor I have never written about.
Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.

For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me.

City of San Francisco

KXEN
Metamarkets
PivotLink
salesforce.com
Splunk
WibiData

Other San Francisco area

10gen
ClearStory
Cloudera
Couchbase
DataStax
Hortonworks
MarketShare
MarkLogic
Schooner
Sybase, an SAP company
VMware
Yarcdata, a division of Cray

Boston and Cambridge

Akiban
Cloudant
Hadapt
Vertica, an HP company

Other Boston area

Netezza, an IBM company
StreamBase

Everywhere else

CodeFutures
Infobright
SAND
Syncsort
Tableau
Teradata

For most of the companies listed above, you can find coverage here, and specifically a blog category in the list on the right. The exceptions, for now, are:

Cloudant
MarketShare
Metamarkets
PivotLink
VMware
Yarcdata

The main reason I threw in the geographical notes is to support the idea that there’s a real suburb-to-urban shift in the startup tech industry. Mike Arrington made the point recently about the San Francisco area, primarily with respect to the mass/consumer tech areas he focuses on, and of course it’s echoed by the rise of the New York City tech sector. My point is to add that it’s also true for system and enterprise technology, at least in the areas I cover.

In particular, the re-urbanization of the Boston-area software industry is striking:

Akiban and Cloudant are in the same office complex in the city of Boston. I was surprised to find even one tech startup in the city of Boston proper, but it seems there is a two-building complex packed with them.
Hadapt moved from Connecticut to the Kendall Square area of Cambridge. Edit: Actually, see the comments below.
Vertica moved from Burlington to the Alewife area of Cambridge. That’s (evidently deliberately) at the boundary of what might be regarded as the “urban” and “suburban” parts of metro Boston.

The cluster in the city of San Francisco — which also half-includes Cloudera — is relatively new as well.