Data warehousing
Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Data warehouse appliance hardware strategies
Recently, I’ve done extensive research into the hardware strategies of computing appliance vendors, across multiple functional areas. Data warehousing, firewall/unified threat management, antispam, data integration – you name it, I talked to them. Of course, each vendor has a unique twist. But some architectural groupings definitely emerged.
The most common approaches seem to be:
Type 1: Custom assembly from off-the-shelf parts. In this model, the only unusual (but still off-the-shelf) parts are usually in the area of network acceleration (or occasionally encryption). Also, the box may be balanced differently than standard systems, in terms of compute power and/or reliability.
Type 2 (Virtual): We don’t need no stinkin’ custom hardware. In this model, the only “appliancy” features are in the areas of easy deployment, custom operating systems, and/or preconfigured hardware.
And of course there are also appliances of Type 0: Custom hardware including proprietary ASICs or FPGAs.
Different markets had different emphases; e.g., firewall appliances are typically Type 1, while antispam devices cluster in Type 2. But the data warehouse appliance market is highly diverse, which maybe shouldn’t be a surprise. After all, the revenue market leader is non-appliance software vendor Oracle, while noisy upstart Netezza is famous for its FPGA. Read more
Categories: Data warehouse appliances, Data warehousing, DATAllegro, Greenplum, IBM and DB2, Kognitio, Netezza, Teradata | 8 Comments |
And then there were two: DATAllegro seems to be going with standard hardware
A while ago – for example, in a comment dated July 9, 2006 — CEO Stuart Frost of DATAllegro hinted that the company might port its software to commodity hardware before long. If this user story is to be believed, that has now happened. (Specific quote: “the Datallegro system is based on Dell and EMC hardware …”) Officially, the company is doing a Sgt. Schultz on the subject. But the evidence is pretty clear. Read more
Categories: Data warehouse appliances, Data warehousing, DATAllegro | 3 Comments |
Arguments AGAINST data warehouse appliances
Data warehouse appliance opponents like to argue that history is conclusively on their side. Database machine maker Britton-Lee, eventually bought by Teradata, fizzled. LISP machines were a spectacular failure. Rational Software’s origins as a special-purpose Ada machine maker had to be renounced before the company could succeed.
But the true story is more mixed. Teradata continues to this day as a major data warehouse technology player, and as far as I’m concerned Teradata indeed makes appliances. If we look further than the applications stack, we find that appliances actually occupy a large and growing share of the computing market. So a persuasive anti-appliance argument has to do more than just invoke the names of Britton-Lee and Symbolics.
I just ran across an article by MIT professor Samuel Madden that attempts to make such a case. And his MIT colleague Mike Stonebraker made similar arguments to me a few days ago. They are not wholly unbiased; indeed, both are involved in Vertica Systems. With that caveat, they have an interesting three-part argument:
Who’s who in columnar relational database management systems
The best known columnar RDBMS is surely Sybase’s IQ Accelerator, evolved from a product acquired in the mid-1990s. Problem – it doesn’t have a shared-nothing architecture of the sort needed to exploit grid/blade technology. Whoops. The other recognized player is SAND, but I don’t know a lot about them. Based on their website, it would seem that grids and compression play a big part in their story. Less established but pretty interesting is Kognitio, who are just beginning to make marketing noise outside the UK. SAP’s BI Accelerator is also a compressed columnar system, but operates entirely in-memory and hence is limited in possible database size. Mike Stonebraker’s startup Vertica is of course the new kid on the block, and there are other columnar startups as well whose names currently escape me.
Categories: Data warehousing, Investment research and trading, Kognitio, SAP AG, TransRelational | 3 Comments |
Are row-oriented RDBMS obsolete?
If Mike Stonebraker is to be believed, the era of columnar data stores is upon us.
Whether or not you buy completely into Mike’s claims, there certainly are cool ideas in his latest columnar offering, from startup Vertica Systems. The Vertica corporate site offers little detail, but Mike tells me that the product’s architecture closely resembles that of C-Store, which is described in this November, 2005 paper.
The core ideas behind Vertica’s product are as follows. Read more
Introduction to Kognitio WX-2
Kognitio called me for a briefing this morning on their WX-2 product. Technical highlights included:
- Their core technology is MPP/shared-nothing data warehousing.
- Unlike most other vendors (but like Greenplum), they are available software-only.
- Like DATallegro and Netezza, they have no global indexing.
- Unlike the other MPP players, they don’t hash partition the data and lead with hash joins. Rather, they have local compressed bitmap indices on every node.
- Similarly, they distribute data utterly randomly and have no concept of range partitioning whatsoever.
- Probably for that reason, WX-2 reads data in small 32K blocks. This forfeits the benefit of sequential reads, unless David Aldridge is correct that Linux can take care of that on its own.
- They seem more chip-heavy than DATallegro and Netezza. A dual-core Opteron blade with 16 or 32 gigabytes of RAM talks to 144, 288, or in some cases 600 gigabytes of disk (before mirroring).
- The position themselves somewhat as being a memory-centric product supplier. While I suspect this is exaggerated, it probably indicates that they’ve put some work into managing RAM as well as disk.
Much like the other “new” MPP data warehouse vendors, Kognitio claims to never have knowingly been outbenchmarked, whether on performance or on TCO factors such as ease of installation.
Read more
Categories: Data warehouse appliances, Data warehousing, Greenplum, Kognitio, Memory-centric data management | 11 Comments |
SAS Intelligence Storage
SAS has its own data store, called SAS Intelligence Storage. It’s a relational system running on SMP boxes, whose unique feature is that it has fixed-length records and hence is a perfect array, for speedy lookup. This is highly analogous to classical MOLAP systems. However, SAS reports that customers store up to several hundred terabytes of data in SAS Intelligence Storage, which is definitely not very analogous to what goes on in the MOLAP world.
It sounds as if the product is optimized for data mining and generic OLAP alike. Indeed, SAS Intelligence Storage is used to power both SAS’s data mining and other advanced analytics, and also its more conventional BI suite.
Categories: Data warehousing, MOLAP, SAS Institute | 4 Comments |
Data mining is driving much of data warehousing
Until I did all this recent research on data warehousing, I didn’t realize just how big a role data mining plays in driving the whole thing. Basically, there are three things you can do with a data warehouse – classical BI, “operational” BI, and data mining. If we’re talking about long-running queries, that’s not operational BI, and it’s not all of classical BI either. The rest is data mining. Indeed, if you think back to what you know of the customer bases at data warehouse appliance vendors Netezza and DATallegro, there are a lot of credit-reporting-data types of users – i.e., data miners. And it’s hard to talk about uses for those appliances very long without SAS extracts and the like coming up.
Read more
Categories: Data warehouse appliances, Data warehousing, DATAllegro, Netezza, Oracle, Predictive modeling and advanced analytics | 8 Comments |
Philip Howard on Netezza
Philip Howard has published a write-up based on Netezza’s user conference, entertaininly mixing fantasy and reality in his usual manner. Notably, he confuses Netezza’s zone maps, which are basically a very limited form of range partitioning, with something that can substitute for real indices. And the mind boggles at his implication that Netezza has neglected the FPGA in its overall market messaging. More understandable is his regurgitation of Netezza’s claims about heat and power, but although I must confess to not having checked either side’s arithmetic, I find Stuart Frost’s rebuttal in the comments to this thread pretty interesting.
But little nits like that aside — yeah, he went to the same conference I did. 😉
Categories: Data warehouse appliances, Data warehousing, Netezza | Leave a Comment |
Vendor segmentation for data warehouse DBMS
February, 2011 edit: I’ve now commented on Gartner’s 2010 Data Warehouse Database Management System Magic Quadrant as well.
Several vendors are offering links to Gartner’s new Magic Quadrant report on data warehouse DBMS. (Edit: This is now a much better link to the 2006 MQ.) Somewhat atypically for Gartner, there’s a strict hierarchy among most of the vendors, with Teradata > IBM > Oracle > Microsoft > Sybase > Kognitio > MySQL > Sand, in each case on both axes of the matrix. The only two exceptions are Netezza and DATallegro, which are depicted as outvisioning Microsoft somewhat even as they trail both Microsoft and Sybase in execution.
Gartner Magic Quadrants tend to annoy me, and I’m not going to critique the rankings in detail. But I do think this particular MQ is helpful in framing a vendor segmentation, namely:
- Big full-spectrum MPP/shared-nothing vendors: Teradata and IBM.
- MPP/shared-nothing appliance upstarts: Netezza and DATallegro
- Big SMP/shared-everything vendors who also are apt to be your OLTP incumbent, and who want to integrate your software stack soup-to-nuts: Oracle and Microsoft
- Niche vendors: Pretty much everybody else