Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Three happy 100 terabyte-plus customers for DATAllegro
Over on my Network World blog, I asked the question “So who are DATAllegro’s actual current customers?” As regular readers know, that’s a fairly hard question to answer. TEOCO is widely known as DATAllegro’s flagship reference, but after that the list gets thin in a hurry.
As a by-the-by to other discussions, DATAllegro Stuart Frost undertook to respond in part himself. Specifically, he gave me two names of two other happy customers that are or imminently will be running DATAllegro against 100+ terabytes of user data. Read more
| Categories: DATAllegro, DBMS product categories, Data warehouse appliances, Data warehousing | Leave a Comment |
Exasol technical briefing
It took 5 ½ months after my non-technical introduction, but I finally got a briefing from Exasol’s technical folks (specifically, the very helpful Mathias Golombek and Carsten Weidmann). Here are some highlights: Read more
| Categories: Analytic technologies, Columnar database management, Data warehousing, Exasol, In-memory DBMS, Memory-centric data management | Leave a Comment |
Patent nonsense in the data warehouse DBMS market
There are two recent patent lawsuits in the data warehouse DBMS market. In one, Sybase is suing Vertica. In another, an individual named Cary Jardin (techie founder of XPrime, a sort of predecessor company to ParAccel) is suing DATAllegro. Naturally, there’s press coverage of the DATAllegro case, due in part to its surely non-coincidental timing right after the Microsoft acquisition was announced and in part to a vigorous PR campaign around it. And the Sybase case so excited a troll who calls himself Bill Walters that he posted identical references to it on about 12 different threads in this blog, as well as to a variety of Vertica-related articles in the online trade press. But I think it’s very unlikely that any of these cases turn out to much matter. Read more
| Categories: Columnar database management, DATAllegro, Data warehousing, Database compression, Sybase, Vertica Systems | 4 Comments |
Compare/constrast of Vertica, ParAccel, and Exasol
I talked with Exasol today – at 5:00 am! — and of course want to blog about it. For clarity, I’d like to start by comparing/contrasting the fundamental data structures at Vertica, ParAccel, and Exasol. And it feels like that should be a separate post. So here goes.
- Exasol, Vertica, and ParAccel all store data in columnar formats.
- Exasol, Vertica, and ParAccel all compress data heavily.
- Exasol and Vertica operate on in-memory data in compressed formats. ParAccel decompresses the data when it gets to RAM. Exasol, Vertica, and ParAccel all — perhaps to varying extents — operate on in-memory data in compressed formats.
- ParAccel and Exasol write data to what amounts to the in-memory part of their basic data structures; the data then gets persisted to disk. Vertica, however, has a separate in-memory data structure to accept data and write it to disk.
- Vertica is a disk-centric system that doesn’t rely on there being a lot of RAM.
- ParAccel can be described that way too; however, in some cases (including on the TPC-H benchmarks), ParAccel recommends loading all your data into RAM for maximum performance.
- Exasol is totally optimized for the assumption that queries will be run against data that had already been previously loaded into RAM.
Beyond the above, I plan to discuss in a separate post how Exasol does MPP shared-nothing software-only columnar data warehouse database management differently than Vertica and ParAccel do shared-nothing software-only columnar data warehouse database management. ![]()
| Categories: Columnar database management, Data warehousing, Database compression, Exasol, ParAccel, Vertica Systems | 9 Comments |
Netezza update
In my usual dual role, I called Phil Francisco of Netezza to lay some post-Microsoft/DATAllegro consulting on him late on a Friday night — and then took the opportunity of being on the phone with him to get a general Netezza update. Netezza’s July quarter just ended, so they’re still in quiet period, so I didn’t press him for a lot of numerical detail. More generally, I didn’t find a lot out that wasn’t already covered in my May Netezza update. But notwithstanding all those disclaimers, it was still a pretty interesting chat.
My strongest takeaway was that Netezza sees concurrency as a significant competitive advantage. This is reflected in POCs, where Netezza guides prospects to simulate real-life mixed workloads. It also reflects the Netezza customer base. Phil says Netezza has “busy” warehouses with up to 80 terabytes of user data, with lots of busy ones in the single-digit to 20ish terabyte range. Multiple Netezza references have 100s of concurrent users, and the 1000 mark has been crossed.
Speaking of concurrency, Phil had a clear opinion of the typical Sybase IQ installation — a small reporting mart, supporting hundreds or thousands of users, but probably not a lot of ad hoc query. On the other hand, he recalls outright competing against Sybase only twice in the past year.
The vendor Netezza does see the most is, no surprise, Oracle. He put Oracle at 60ish percent, with most of the rest divided among Teradata and DB2 (only a few Microsoft SQL Server). Among the other new data warehouse specialists, Greenplum comes up the most often. (There was some confusion between “competitor” and “incumbent” in our discussion, and the sample sizes are small anyway, so fine levels of detail shouldn’t be taken too seriously.)
On the advanced analytics side, it sounds as if SAS integration akin to Teradata’s will happen sooner than any significant integration of Netezza’s own NuTech acquisition.
| Categories: Data warehouse appliances, Data warehousing, Greenplum, Netezza, Sybase | 2 Comments |
Database compression coming to the fore
I’ve posted extensively about data-warehouse-focused DBMS’ compression, which can be a major part of their value proposition. Most notable, perhaps, is a short paper Mike Stonebraker wrote for this blog — before he and his fellow researchers started their own blog — on column-stores’ advantages in compression over row stores. Compression has long been a big part of the DATAllegro story, while Netezza got into the compression game just recently. Part of Teradata’s pricing disadvantage may stem from weak compression results. And so on.
Well, the general-purpose DBMS vendors are working busily at compression too. Microsoft SQL Server 2008 exploits compression in several ways (basic data storage, replication/log shipping, backup). And Oracle offers compression too, as per this extensive writeup by Don Burleson.
If I had to sum up what we do and don’t know about database compression, I guess I’d start with this:
- Columnar DBMS really do get substantially better compression than row-based database systems. The most likely reasons are:
- More elements of a column fit into a single block, so all compression schemes work better.
- More compression schemes wind up getting used (e.g., delta compression as well the token/dictionary compression that row-based systems use too).
- Data-warehouse-based row stores seem to do better at compression than general-purpose DBMS. The reasons most likely are some combination of:
- They’re trying harder.
- They use larger block sizes.
- Notwithstanding these reasonable-sounding generalities, there’s a lot of variation in compression success among otherwise comparable products.
Compression is one of the most important features a database management system can have, since it creates large savings in storage and sometimes non-trivial gains in performance as well. Hence, it should be a key item in any DBMS purchase decision.
Column stores vs. vertically-partitioned row stores
Daniel Abadi and Sam Madden followed up their post on column stores vs. fully-indexed row stores with one about column stores vs. vertically-partitioned row stores. Once again, the apparently useful way to set up the row-store database backfired badly.* Read more
Extensive QlikView coverage from a big fan and reseller
David Raab is a reseller and great fan of QlikTech’s QlikView. His recent lengthy post about the product (I hesitate to call it “detailed” only because he rightly complains that QlikTech is in fact stingy with technical detail) is positive enough to have been recommended by the company itself. Specifically, it was cited in the comment thread to my recent post on QlikTech, where David himself also addressed some of my questions.
But of course, no technology is perfect, not even one as great as David thinks QlikView is. Read more
QlikTech/QlikView update
I talked with Anthony Deighton of memory-centric BI vendor QlikTech for an hour and a half this afternoon. QlikTech is quite the success story, with disclosed 2007 revenue of $80 million, up 80% year over year, and confidential year-to-date 2008 figures that do not disappoint as a follow-on. And a look at the QlikTech’s QlikView product makes it easy to understand how this success might have come about.
Let me start by reviewing QlikTech’s technology, as best I understand it.
| Categories: Analytic technologies, Business intelligence, Columnar database management, Database compression, Memory-centric data management, QlikTech and QlikView | 14 Comments |
Further thoughts on DATAllegro/Microsoft
My first, biggest thought about DATAllegro’s acquisition by Microsoft is “Why the ____ did it have to happen while I was trying to relax on my annual Cayman vacation???” Not coincidentally, I don’t plan to neatly cross-link all my posts and so on about DATAllegro/Microsoft until I get back to Acton this weekend.
One linking screwup is that I previously forgot to mention that — in addition to the numerous posts here — I also made several DATAllegro/Microsoft-related posts on my Network World blog A World of Bytes. They include: Read more
| Categories: Analytic technologies, DATAllegro, Data warehousing, Microsoft and SQL*Server | 7 Comments |
Other early coverage of Microsoft/DATAllegro
- Here’s the official press release on DATAllegro’s site, and Microsoft’s.
- Doug Henschen of Intelligent Enterprise has a good article. He got quotes from Microsoft claiming that SQL Server on its own would be able to handle 10s of terabytes of data in the next release, but DATAllegro was needed to get up to the 100s of terabytes. That said, the quotes don’t say whether that’s user data or total disk usage — the latter frankly seems more plausible.
- James Kobielus of Forrester has a long post on the Microsoft/DATAllegro deal, emphasizing product packaging issues and glossing over technological differentiators. (Edit: The post seems down as of Friday midday.)
- This is a few weeks old, but Kevin Closson is extremely skeptical of some of DATAllegro’s technical claims. (Not that it matters much if he’s right — more nodes = more throughput, no matter how much Oracle folks rant.)
- Eric Lai of Computerworld gets it right.
- Larry Dignan thinks the acquisition is part of an overall strong Microsoft enterprise push.
- William McKnight thinks Microsoft usually does a good job of integrating acquisitions.
- DATAllegro CEO Stuart Frost is happy.
- David Hunter thinks Microsoft will blithely continue with DATAllegro’s limited-hardware-support strategy. He’s almost certainly wrong.
- Philip Howard says almost nothing I agree with, although I can’t argue with the part
Conversely, it’s bad news for Ingres, bad news for Oracle, bad news for IBM, bad news for Teradata and bad news for HP, all for obvious reasons. As for the other appliance vendors: they will not be too happy either. In particular, we now have to consider who can survive on their own, who might be acquired, who might do the acquiring, and who is going to disappear.
| Categories: DATAllegro, Data warehousing, Microsoft and SQL*Server | 14 Comments |
DATAllegro could provide Microsoft with a true enterprise data warehouse sooner than you think
Jim Ericson of DM Review emailed the excellent questions:
Does DATAllegro give MSFT full-service high end data warehousing capability? If not, what is missing?
My quick answers are:
- No.
- Two things:
- Hard-core multi-user concurrency.
- Support for more esoteric analytic tools and functionality
Both are largely a matter of product maturity, and as a young company DATAllegro isn’t quite there yet.
That said, integration with Microsoft SQL Server is apt to be a big help in addressing both issues.
The data warehouse DBMS consolidation has begun
There are, or soon will be, a number of strong players in the market for data warehouse specialty DBMS.
- Teradata continues to prosper, whatever one may think of its price points.
- Netezza is growing healthily.
- Microsoft is buying DATAllegro.
- Oracle needs to buy somebody in response.
- DB2 is a significant player too, although perhaps not quite as big as one might think.
- Sybase IQ can’t be counted out either.
That doesn’t leave a lot of room for other players.
| Categories: Data warehousing | 6 Comments |
How will Oracle save its data warehouse business?
By acquiring DATAllegro, Microsoft has seriously leapfrogged Oracle in data warehouse technology. All doubts about maturity and versatility notwithstanding, DATAllegro has a 10X or better size advantage (actually, I think it’s more like 20-40X) versus Oracle in warehouses its technology can straightforwardly handle. Oracle cannot afford to let this move go unanswered.
It’s of course possible that Oracle has been successfully developing comparable data warehouse technology internally. But it’s unlikely. Oracle hasn’t done anything that radical, internally and successfully, for about 15 years, RAC (Real Application Clusters) excepted. (I.e., since the object/relational extensibility framework started in Release 7.) So in all likelihood, the answer will come via acquisition. I think there are four candidates that make the most sense: Teradata, Vertica, ParAccel, and Greenplum. Kognitio (controlled by former Oracle honcho Geoff Squire) might be in the mix as well. Netezza is probably a non-starter because of its hardware-centric strategy.
Here’s why I’m emphasizing Teradata, Vertica, ParAccel, and Greenplum:
| Categories: Analytic technologies, DATAllegro, Data warehouse appliances, Data warehousing, Greenplum, Microsoft and SQL*Server, Oracle, ParAccel, Teradata, Vertica Systems | 11 Comments |
Microsoft is buying DATAllegro
I’ve long argued that:
- Oracle and Microsoft are doomed in the data warehouse market unless they acquire MPP/shared-nothing data warehouse DBMS and/or data warehouse appliances.
- DATAllegro is the ideal acquisition for either of them.
Microsoft has now validated my claim by agreeing to buy DATAllegro. As you probably know, we’ve been covering DATAllegro extensively, as per the links listed below.
Basic deal highlights include:
Long, confused overview of data warehouse DBMS vendors
Steven Swoyer has an article for Enterprise Systems that covers a lot of issues in data warehouse technology. Unfortunately, however, it doesn’t always cover them correctly. E.g., he seems to imply that columnar architectures aren’t relational. (Oops.) I wouldn’t put too much credence in the other market segmentations he posits either.
Some of his theses, however, are basically correct. E.g., he points out that demand for fast, cost-effective, (almost) unconstrained ad hoc queries keeps growing, and that much of the recent innovation is concerned with supplying them.
| Categories: Data warehousing | 1 Comment |
Another Cognos scandal in Massachusetts
I already posted about the Boston Globe’s reporting on a deal to supply the whole Massachusetts state government with Cognos software that since has been investigated and rescinded.
The Globe now reports that a multimillion dollar deal the prior year with the Massachusetts Department of Education was equally dubious. Lowlights include: Read more
| Categories: Business intelligence, Cognos | Leave a Comment |
Declaration of Data Independence (humor)
The data warehouse appliance industry has a well-developed funny bone. Dataupia’s contribution is a Declaration of Data Independence, which begins:
When in the Course of an increasingly competitive global economy it becomes necessary for one data set to dissolve its connections to a constraining environment, the separate but inherently unequal station to which the Laws of Whose budget is larger prevails.
Related links:
- Cartoons from DATAllegro
- April Fool press release from Netezza
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Dataupia | Leave a Comment |
Three cartoons from DATAllegro



Related links:
- Humor from Netezza
- Another gerbil-based solution
| Categories: Analytic technologies, DATAllegro, Data warehousing, Humor | 1 Comment |
The IRS data warehouse
According to a recent Eric Lai Computerworld story and a 2006 Sybase.com success story,
- The IRS has a data warehouse running on Sybase IQ, with 500 named users, called the CDW (Compliance Data Warehouse). (Computerworld)
- By some metric, it’s a 150 TB warehouse. (Computerworld)
- By some metric, they add 15-20 TB/year, with a 4 hour load time. (Computerworld)
- As of 2006, there were 20-25 TB of “input data”, with a “70% compression rate”. (Sybase)
I can’t entirely reconcile those numbers, but in any case the database sounds plenty big.
Computerworld also said:
the research division also uses Microsoft Corp.’s SQL Server to store all of the metadata for the data warehouse and the rest of the agency. Managing and cleaning all of that metadata — 10,000 labels for 150 databases — is a huge task in itself,
| Categories: Analytic technologies, Data warehousing, Specific users, Sybase | 2 Comments |
Jerry Held on cloud data warehousing and how business intelligence will be transformed by it
Vertica Chairman Jerry Held has a pair of blog posts on analytics and data warehousing in the cloud. The first lays out a number of potential benefits and consequences of cloud data warehousing, under the heading of “Transforming BI”: Read more
| Categories: Analytic technologies, Business intelligence, Cloud computing, Data mart outsourcing, Data warehousing, Software as a Service (SaaS), Vertica Systems | 4 Comments |
Cognos/State of Massachusetts scandal
I assumed this had been reported widely outside of Massachusetts, but a web search suggests otherwise.
The story is this: Cognos sold 20,000 seats of software to Massachusetts for $13 million. There were technical violations of purchase procedures, and other aspects of the deal that didn’t pass the smell test. After IBM bought Cognos, the deal was rescinded, and is being rebid. Read more
| Categories: Analytic technologies, Business intelligence, Cognos, Pricing | 2 Comments |
Response to Rita Sallam of Oracle
In a comment thread on Seth Grimes’ blog, Rita Sallam of Oracle engaged in a passionate defense of her data warehousing software. I’d like to take it upon myself to respond to a few of here points here. Read more
| Categories: Data warehousing, Oracle | 6 Comments |
Oracle Optimized Warehouse Initiative
Oracle’s response to data warehouse appliances — and to IBM’s BCUs (Balanced Configuration Units) — so far is the Oracle Optimized Warehouse Initiative (OOW, not to be confused with Oracle Open World). A small amount of information about Oracle Optimized Warehouse can be found on Oracle’s website. Another small amount can be found in this recent long and breathless TDWI article, full of such brilliancies as attributing to the data warehouse appliance vendors the “claim that relational databases simply aren’t cut out for analytic workloads.” (Uh, what does he think they’re running — CODASYL DBMS?)
So far as I can tell, what Oracle Optimized Warehouse — much like IBM’s BCU — boils down to is the same old Oracle DBMS, but with recommended hardware configuration and tuning parameters. Thus, a lot of the hassle is taken out of ordering and installing an Oracle data warehouse, which is surely a good thing. But I doubt it does much to solve Oracle’s problems with price, price/performance, or the inevitable DBA hassles derived from a poorly-performing DBMS.
| Categories: Data warehouse appliances, Data warehousing, Oracle | 2 Comments |
Yahoo scales its web analytics database to petabyte range
Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:
- The Yahoo web analytics database is over 1 petabyte. They claim it will be in the 10s of petabytes by 2009.
- The Yahoo web analytics database is based on PostgreSQL. So much for MySQL fanboys’ claims of Yahoo validation for their beloved toy … uh, let me rephrase that. The highly-regarded MySQL, although doing a great job for some demanding and impressive applications at Yahoo, evidently wasn’t selected for this one in particular. OK. That’s much better now.
- But the Yahoo web analytics database doesn’t actually use PostgreSQL’s storage engine. Rather, Yahoo wrote something custom and columnar.
- Yahoo is processing 24 billion “events” per day. The article doesn’t clarify whether these are sent straight to the analytics store, or whether there’s an intermediate storage engine. Most likely the system fills blocks in RAM and then just appends them to the single persistent store. If commodity boxes occasionally crash and lose a few megs of data — well, in this application, that’s not a big deal at all.
- Yahoo thinks commercial column stores aren’t ready yet for more than 100 terabytes of data.
- Yahoo says it got great performance advantages from a custom system by optimizing for its specific application. I don’t know exactly what that would be, but I do know that database architectures for high-volume web analytics are still in pretty bad shape. In particular, there’s no good way yet to analyze the specific, variable-length paths users take through websites.
