May 8th, 2008 Curt Monash
Another TDWI conference approaches. Not coincidentally, I had another Vertica briefing. Primary subjects included some embargoed stuff, plus (at my instigation) outsourced data marts. But I also had the opportunity to follow up on a couple of points from February’s briefing, namely:
Vertica has about 35 paying customers. That doesn’t sound like a lot more than they had a quarter ago, but first quarters can be slow.
Vertica’s list price is $150K/terabyte of user data. That sounds very high versus the competition. On the other hand, if you do the math versus what they told me a few months ago — average initial selling price $250K or less, multi-terabyte sites — it’s obvious that discounting is rampant, so I wouldn’t actually assume that Vertica is a high-priced alternative.
Vertica does stress several reasons for thinking its TCO is competitive. First, with all that compression and performance, they think their hardware costs are very modest. Second, with the self-tuning, they think their DBA costs are modest too. Finally, they charge only for deployed data; the software that stores copies of data for development and test is free.
Posted in Analytics and analytic technologies, Columnar architectures, Data warehousing, Database compression, Vertica Systems | 4 Comments »
April 21st, 2008 Curt Monash
After months of leaks, Teradata has unveiled its new lines of data warehouse appliances, raising the total number either from 1 to 3 (my view) or 0 to 2 (what you believe if you think Teradata wasn’t previously an appliance vendor). Most significant is the new Teradata 2500 series, meant to compete directly with the smaller data warehouse specialists. Highlights include:
-
An oddly precise estimated capacity of “6.12 terabytes”/node (user data). This estimate is based on 30% compression, which is low by industry standards, and surely explains part of the price umbrella the Teradata 2500 is offering other vendors.
-
$125K/TB of user data. Obviously, list pricing and actual pricing aren’t the same thing, and many vendors don’t even bother to disclose official price lists. But the Teradata 2500 seems more expensive than most smaller-vendor alternatives.
-
Scalability up to 24 nodes (>140 TB).
-
Full Teradata application-facing functionality. Some of Teradata’s rivals are still working on getting all of their certifications with tier-1 and tier-2 business intelligence tools. Teradata has a rich application ecosystem.
-
What will be controversial performance, until customer-benchmark trends clearly emerge.
The Teradata 2500 is coming out of the chute with two customers – a new-customer retailer buying a single cabinet (i.e., 6.12 TB), and an existing customer for whom fewer details seem available. So far as I can tell, the sales force has had the product since late January, although the first leaks I got incorrectly suggested the system would only scale to a limited number of nodes.
Other products in the announcement included:
-
The Teradata 5550, a routine annual upgrade to the Teradata 5500.
-
The Teradata 550. This is a low-end, single-server SMP box introduced 9 or so months ago, originally meant for application development and testing. But some customers have been using it for deployment, and Teradata is now officially acknowledging that. It only scales to 2-3 TB of user data.
The Teradata 2500’s performance should be below the Teradata 5550’s for three reasons:
The same considerations apply to a comparison between the Teradata 2500 and the older Teradata 5000, but in that case they’re offset by a year of Moore’s Law benefit.
Read the rest of this entry »
Posted in Analytics and analytic technologies, Data warehouse appliances, Data warehousing, Database compression, Relational database management systems, Teradata | 1 Comment »
April 18th, 2008 Curt Monash
I chatted with Raj Cherabuddi and others on the Kickfire (formerly C2) team for over an hour on Monday, and now have a better sense of their story. There are some very basic questions I still don’t have answers to; I’ll fill those in when I can.
Highlights of what I have and haven’t figured out so far include:
-
Kickfire’s technology has two main parts: A SQL co-processor chip and a MySQL storage engine.
-
Kickfire makes a Type 0 appliance. If I understood correctly, it contains the chip, a couple of standard CPU cores, and 64 gigs of RAM. Or else it contains just the chip, and is meant to be hooked up to a 2U box with 64 gigs of RAM. I’m confused.
-
The Kickfire box can handle up to 3 terabytes of user data. The disk required for that is 4-5 terabytes without redundancy, 2X with. Based on that formulation and other clues, I’m guessing Kickfire — unlike other appliance vendors — doesn’t build in storage itself.
-
I don’t know whether the Kickfire chip is true custom silicon or an FPGA emulation.
-
The essential idea of the chip is dataflow programming for SQL, with pipelining between operations. This eliminates the overhead of registers and context switching. I don’t know what the trade-offs are, if any.
-
Kickfire’s database software is columnar, operating on compressed data even in RAM. In that, Kickfire’s story is most similar to Vertica’s, although I’m guessing Exasol may do something similar as well. Like Vertica, Kickfire uses multiple compression methods (they’re reluctant to give detail, but agreed it would be fair to say they use both something like dictionary/token and something like delta compression).
-
Kickfire’s software is ACID-compliant. You can do incremental loads or trickle feeds. Bulk load speed is 100 Gb/hour. Kickfire’s solution for the traditional problem of updating column stores is called “snapshots.” Without giving details, they position that as similar to the Vertica solution.
-
Like other MySQL storage engines, Kickfire inherits whatever data connectivity, stored procedure capabilities, user-defined functions ability, etc. that MySQL has.
-
Kickfire has no paying customers, but does have a slide showing many logos of “prospects and beta customers.”
-
Kickfire has no MPP capabilities at this time, but says adding those is “on the roadmap” and will be “easy.”
-
Kickfire submitted a 100 Gb TPC-H result, in which it beat the previous leaders — Exasol, ParAccel, and Microsoft – on price-performance, and lagged only Exasol and ParAccel on absolute performance. Kickfire is extremely proud of this. Indeed, I don’t recall another vendor ascribing that much weight to them in the entire history of TPCs.* Kickfire seems unfazed by the fact that its result is for a system listed with a ship date 6 months in the future (I’m guessing that’s the latest the TPC will allow), while the other results are for systems available today.
*Somebody – perhaps adman extraordinaire Rick Bennett? — may want to check my memory on this, but I think Oracle’s famed “Gentlemen, start your snails” ad in the early 1990s was about PC World tests, not TPCs. Oracle also had an ad about WW1-style planes nosediving, but I don’t think those referenced TPCs either.
Posted in Analytics and analytic technologies, Columnar architectures, Data warehouse appliances, Data warehousing, Database compression, Database theory and practice, Kickfire, Open source RDBMS, Relational database management systems | 3 Comments »
December 7th, 2007 Curt Monash
The proximate cause for today’s flurry of Netezza-related posts is that the company has finally rolled out its compression story. In a nutshell, Netezza has developed its own version of columnar delta compression, slated to ship May, 2008. It compresses 2-5X, with the factor sometimes going up into double digits. Netezza estimates this produces a 2-3X improvement in overall performance, with the core marketing claim being that performance will “double” from compression alone. Read the rest of this entry »
Posted in Analytics and analytic technologies, Data warehouse appliances, Data warehousing, Database compression, Database theory and practice, Netezza, Relational database management systems | No Comments »
October 28th, 2007 Curt Monash
An InfoBright employee posted something quite reasonable-looking in response to my inaugaral post about BrightHouse. Even so, InfoBright asked if they could substitute something with a slightly different tone. I agreed. Here’s what they sent in.
Curt, thanks for the write-up and the opportunity to talk about our customer success stories. As you say, our customer story is definitely “more than zero.” We are addressing a number of critical customer issues with our unique approach to data warehousing.
Infobright currently has 5 customers - customers that have bucked the trend of throwing hardware at the problem. To be perfectly braggadocio about this, we have never lost a competitive proof of concept in which we’ve been engaged. This is accomplished with the horsepower of one box (though for redundancy customers may deploy multiple boxes with a load balancer).
Read the rest of this entry »
Posted in Analytics and analytic technologies, Columnar architectures, Data warehousing, Database compression, Infobright and Brighthouse, Relational database management systems | No Comments »
October 22nd, 2007 Curt Monash
To a first approximation, Infobright – maker of BrightHouse — is yet another data warehouse DBMS specialist with a columnar architecture, boasting great compression and running on commodity hardware, emphasizing easy set-up, simple administration, great price-performance, and hence generally low TCO. BrightHouse isn’t actually MPP yet, but Infobright confidently promises a generally available MPP version by the end of 2008. The company says that experience shows >10:1 compression of user data is realistic – i.e., an expansion ratio that’s fractional, and indeed better than 1/10:1. Accordingly, despite the lack of shared-nothing parallelism, Infobright claims a sweet spot of 1-10 terabyte warehouses, and makes occasional references to figures up to 30 terabytes or so of user data.
BrightHouse is essentially a MySQL storage engine, and hence gets a lot of connectivity and BI tool support features from MySQL for “free.” Beyond that, Infobright’s core technical idea is to chop columns of data into 64K chunks, called data packs, and then store concise information about what’s in the packs. The more basic information is stored in data pack nodes,* one per data pack. If you’re familiar with Netezza zone maps, data pack nodes sound like zone maps on steroids. They store maximum values, minimum values, and (where meaningful) aggregates, and also encode information as to which intervals between the min and max values do or don’t contain actual data values. Read the rest of this entry »
Posted in Analytics and analytic technologies, Columnar architectures, Data warehousing, Database compression, Infobright and Brighthouse, MySQL, Open source RDBMS, Relational database management systems | 1 Comment »
September 27th, 2007 Curt Monash
I’ve pointed out in the past that solid-state/Flash memory could be a good alternative to hard disks in PCs and enterprise systems alike. Well, when that happy day arrives, what will be some of the implications for database management software architecture?
- Compression will be even more important. Cost per terabyte of storage will spike up for that storage that is moved from disk to solid-state.
- The sequential-rather-than-random reading strategy of data warehouse appliance makers may become less relevant. The one way to get rid of the disk-speed bottleneck is to get rid of disks.
- DBMS will need to write data as rarely as possible. Solid-state memory tends to wear out if you keep writing over it. Assuming this problem gets better over time (if it doesn’t, this whole discussion is moot) but isn’t totally solved, architectures which have fewer writes are on the whole better.
Read the rest of this entry »
Posted in Data warehouse appliances, Data warehousing, Database compression, Database theory and practice, Netezza, Specialized data management in general | No Comments »
September 24th, 2007 Curt Monash
Pervasive Software has a long history – 25 years, in fact, as they’re emphasizing in some current marketing. Ownership and company name have changed a few times, as the company went from being an independent startup to being owned by Novell to being independent again. The original product, and still the cash cow, was a linked-list DBMS called Btrieve, eventually renamed Pervasive PSQL as it gained more and more relational functionality.
Pervasive Summit PSQL v10 has just been rolled out, and I wrote a nice little white paper to commemorate the event, describing some of the main advances over v9, primarily for the benefit of current Pervasive PSQL developers. In one major advance, Pervasive made the SQL functionality much stronger. In particular, you now can have a regular SQL data dictionary, so that the database can be used for other purposes – BI, additional apps, whatever. Apparently, that wasn’t possible before, although it had been possible in yet earlier releases. Pervasive also added view-based security permissions, which is obviously a Very Good Thing.
There also are some big performance boosts. Read the rest of this entry »
Posted in Database compression, Hierarchies, networks, graphs, and trees, Memory-centric data management, Microsoft and SQL*Server, Mid-range DBMS, OLTP database management, Pervasive Software, Portability, transparency, and plug-compatibility, Relational database management systems | No Comments »
September 18th, 2007 Curt Monash
Back in March, I suggested that compression was a central and compelling aspect of Vertica’s story. Well, in their new blog, the Vertica guys now strongly reinforce that impression.
I recommend those two Database Column posts (by Sam Madden) highly. I’ve rarely seen such a clear, detailed presentation of a company’s technical argument. My own thoughts on the subject boil down to:
- In principle, all the technology (and hence all the technological advantages) they’re talking about could be turned into features of one of the indexing options of a row-oriented RDBMS. But in practice, there’s no indication that this will happen any time soon.
-
Release 1 of the Vertica product will surely have many rough edges.
- Some startups are surprisingly ignorant of the issue involved in building a successful, industrial-strength DBMS. But a company that has both Mike Stonebraker and Jerry Held seriously involved has a big advantage. They may make other kinds of errors, but they won’t make many ignorant ones.
Technorati Tags: Vertica, database compression, columnar
Posted in Columnar architectures, Data warehousing, Database compression, Database theory and practice, Michael Stonebraker, Relational database management systems, Vertica Systems | 4 Comments »
August 16th, 2007 Curt Monash
In the literal sense, that is. While the details on what I wrote about this a few weeks ago* are still embargoed, I’m at liberty to drop a few more hints.
*Please also see DATAllegro CEO Stuart Frost’s two comments added today to that thread.
DATAllegro systems these days basically consist of Dell servers talking to EMC disk arrays, with Cisco Infiniband to provide fast inter-server communication without significant CPU load. Well, if you decrease the number of Dell servers per EMC box, and increase the number of disks per EMC box, you can slash your per-terabyte price (possibly at the cost of lowering performance).
Read the rest of this entry »
Posted in DATAllegro, Data warehouse appliances, Data warehousing, Database compression, Relational database management systems | No Comments »
July 25th, 2007 Curt Monash
DATAllegro Stuart Frost called in for a prebriefing/feedback/consulting session. (I love advising my DBMS vendor clients on how to beat each other’s brains in. This was even more fun in the 1990s, when combat was generally more aggressive. Those were also the days when somebody would change jobs to an arch-rival and immediately explain how everything they’d told me before was utterly false …)
While I had Stuart on the phone, I did manage to extract some stuff I’m at liberty to use immediately. Here are the highlights: Read the rest of this entry »
Posted in DATAllegro, Data warehouse appliances, Data warehousing, Database compression, Greenplum, Netezza, Relational database management systems, Teradata | 4 Comments »
June 15th, 2007 Curt Monash
When Mike Stonebraker and I discussed RDF yesterday, he quickly turned to suggesting fast ways of implementing it over an RDBMS. Then, quite characteristically, he sent over a paper that allegedly covered them, but actually was about closely related schemes instead.
Edit: The paper has a new, stable URL. Hat tip to Daniel Abadi.
All minor confusion aside, here’s the story. At its core, an RDF database is one huge three-column table storing subject-property-object triples. In the naive implementation, you then have to join this table to itself repeatedly. Materialized views are a good start, but they only take you so far. Read the rest of this entry »
Posted in Columnar architectures, Data warehousing, Database compression, Database theory and practice, Hierarchies, networks, graphs, and trees, RDF and graphs, Relational database management systems, Vertica Systems | No Comments »
May 29th, 2007 Curt Monash
EMC has announced a machine — a virtual tape library — that supposedly stores 1.8 petabytes of data. Even though that’s only 584 terabytes uncompressed, it shows that the 1 petabyte barrier will be broken soon no matter how unhyped the measurement.
I just recently encountered some old notes in which Sybase proudly announced a “1 gigabyte challenge.” The idea was that 1 gig was a breakthrough size for business databases.
Time flies.
Want to continue getting great research about DBMS, analytics, data integration, and other technologies related to data management? Get a FREE subscription by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.
Technorati Tags: EMC
Posted in Database compression, Database theory and practice, Sybase | No Comments »
March 24th, 2007 Curt Monash
I’ve recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon as processor power increases; 10X or more is not unrealistic. True, this applies mainly to data warehouses, but that’s where the big database growth is happening. And new kinds of data — geospatial, telemetry, document, video, whatever — are highly compressible as well.
This trend suggests a few interesting possibilities for hardware, semiconductors, and storage.
- The growth in demand for storage might actually slow. That said, I frankly think it’s more likely that Parkinson’s Law of Data will continue to hold: Data expands to fill the space available. E.g., video and other media have near-infinite potential to consume storage; it’s just a question of resolution and fidelity.
- Solid-state (aka semiconductor or flash) persistent storage might become practical sooner than we think. If you really can fit a terabyte of data onto 100 gigs of flash, that’s a pretty affordable alternative. And by the way — if that happens, a lot of what I’ve been saying about random vs. sequential reads might be irrelevant.
- Similarly, memory-centric data management is more affordable when compression is aggressive. That’s a key point of schemes such as SAP’s or QlikTech’s. Who needs flash? Just put it in RAM, persisting it to disk just for backup.
- There’s a use for faster processors. Compression isn’t free. What you save on disk space and I/O you pay for at the CPU level. Those 5X+ compression levels do depend on faster processors, at least for the row store vendors.
Keep getting great research about database management, analytics, and related technologies. No hassle, no spam!
Technorati Tags: relational databases, storage, processors, memory, flash memory, compression
Posted in Data warehousing, Database compression, Memory-centric data management, QlikTech and QlikView, SAP, BI Accelerator, and MaxDB | 6 Comments »
March 24th, 2007 Curt Monash
In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):
The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.
It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.
In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.
Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.
Read the rest of this entry »
Posted in Columnar architectures, Data warehousing, Database compression, Sybase, Vertica Systems | 6 Comments »
March 24th, 2007 Curt Monash
The following is by Mike Stonebraker, CTO of Vertica Systems, copyright 2007, as part of our ongoing discussion of data compression. My comments are in a separate post.
Row Store Compression versus Column Store Compression
I Introduction
There are three aspects of space requirements, which we discuss in this short note, namely:
structural space requirements
index space requirements
attribute space requirements.
Read the rest of this entry »
Posted in Data warehousing, Database compression, Database theory and practice, Michael Stonebraker, Vertica Systems | 3 Comments »
March 21st, 2007 Curt Monash
We have lively discussions going on columnar data stores vs. vertically partitioned row stores. Part is visible in the comment thread to a recent post. Other parts come in private comments from Stuart Frost of DATAllegro and Mike Stonebraker of Vertica et al.
To me, the most interesting part of what the Vertica guys are saying is twofold. One is that data compression just works better in column stores than row stores, perhaps by a factor of 3, because “the next thing in storage is the same data type, rather than a different one.” Frankly, although Mike has said this a couple of times, I haven’t understood yet why row stores can’t be smart enough to compress just as well. Yes, it’s a little harder than it would be in a columnar system; but I don’t see why the challenge would be insuperable.
The second part is even cooler, namely the claim that column stores allow the processors to operate directly on compressed data. But once again, I don’t see why row stores can’t do that too. For example, when you join via bitmapped indices, exactly what you’re doing is operating on highly-compressed data.
Want to continue getting great research about DBMS, analytics, and other technologies related to data management? Then subscribe to our feed, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.
Posted in Columnar architectures, DATAllegro, Data warehouse appliances, Data warehousing, Database compression, Relational database management systems, Vertica Systems | 1 Comment »
March 16th, 2007 Curt Monash
IBM sent over a bunch of success stories recently, with DB2’s new aggressive compression prominently mentioned. Mike Stonebraker made a big point of Vertica’s compression when last we talked; other column-oriented data warehouse/mart software vendors (e.g. Kognitio, SAP, Sybase) get strong compression benefits as well. Other data warehouse/mart specialists are doing a lot with compression too, although some of that is governed by please-don’t-say-anything-good-about-us NDA agreements.
Compression is important for at least three reasons:
- It saves disk space, which is a major cost issue in data warehousing.
- It saves I/O, which is the major performance issue in data warehousing.
- In well-designed systems, it can actually make on-chip execution faster, because the gains in memory speed and movement can exceed the cost of actually packing/unpacking the data. (Or so I’m told; I haven’t aggressively investigated that claim.)
When evaluating data warehouse/mart software, take a look at the vendor’s compression story. It’s important stuff.
EDIT: DATAllegro claims in a note to me that they get 3-4x storage savings via compression. They also make the observation that fewer disks ==> fewer disk failures, and spin that — as it were
— into a claim of greater reliability.
Posted in DATAllegro, Data warehouse appliances, Data warehousing, Database compression, IBM and DB2, Relational database management systems, SAP, BI Accelerator, and MaxDB, Vertica Systems | 2 Comments »
January 22nd, 2007 Curt Monash
If Mike Stonebraker is to be believed, the era of columnar data stores is upon us.
Whether or not you buy completely into Mike’s claims, there certainly are cool ideas in his latest columnar offering, from startup Vertica Systems. The Vertica corporate site offers little detail, but Mike tells me that the product’s architecture closely resembles that of C-Store, which is described in this November, 2005 paper.
The core ideas behind Vertica’s product are as follows. Read the rest of this entry »
Posted in Columnar architectures, Data warehousing, Database compression, Database theory and practice, Kognitio and WX2, Memory-centric data management, Netezza, Products and vendors, Relational database management systems, Vertica Systems | 15 Comments »
January 22nd, 2007 Curt Monash
When it comes to DBMS inventors, Mike Stonebraker is the next closest thing to Codd. And he’s become a huge non-believer in the idea that one DBMS architecture meets all needs.
Frankly, there isn’t much in that paper that hasn’t already been said in this blog, except for the part that is specifically relevant to one of his startups, StreamBase. Still, it’s nice to have the high-powered agreement.
More recently, the argument in that paper has been extended with a benchmark-filled follow-up based on another Stonebraker startup, Vertica.
Posted in Columnar architectures, Database compression, Database theory and practice, Relational database management systems, StreamBase, Vertica Systems | No Comments »
September 28th, 2006 Curt Monash
One of the least understood aspects of data warehouse technology is what may be called the
Expansion Ratio = (Total disk space used, except for mirroring) / (Size of the base database).
This is similar to the explosion ratio discussed in the OLAP Report’s justly famous discussion of database explosion, but I’m going with my own terminology because I don’t want to be tied to their precise terminology, nor to their technical focus. Expansion Ratios are hotly debated, with some figures being:
- Teradata claims an Expansion Ratio of 8-9X for Oracle, 6X for DB2 (open system version), and 2.5X for Teradata. The underlying source is data warehouses they’ve replaced, so there may be a bias toward out-of-control warehouses on the part of their competitors.
- An anonymous appliance vendor exec said to me off the top of his head that Oracle has 6-8X Expansion Ratios.
- Oracle’s TPC-H submissions in the largest size range (10 terabytes) have 9.7-10.5X Expansion Ratios, if I’m reading the TPCs correctly.
- Oracle cites a survey of 8 customers with 10-60 Tb database size in which the Expansion Ratio works out to 1.6X. (More on this anomalous result below.)
I don’t have actual figures from Netezza and DATallegro, but I imagine they’d come out lower than 2X, possibly well below.
Read the rest of this entry »
Posted in DATAllegro, Data warehouse appliances, Data warehousing, Database compression, IBM and DB2, Netezza, Oracle, Relational database management systems, Teradata | 4 Comments »
September 20th, 2006 Curt Monash
I wrote about SAP’s BI Accelerator quite a bit in my white paper on memory-centric data management, but otherwise I seem not to have posted much about it here. In essence, it’s a product that’s all RAM-based, and generally geared for multi-hundred-gigabyte data marts. The basic design is a compression-heavy column-based architecture, evolved from SAP’s text-indexing technology TREX. Like data warehouse appliances, it eschews indexing, relying instead on blazingly fast table scans.
I asked Lothar Schubert of SAP how BIA was doing in the market in its early going. This was his response:
Read the rest of this entry »
Posted in Analytics and analytic technologies, Business intelligence, Data warehouse appliances, Data warehousing, Database compression, Memory-centric data management, Relational database management systems, SAP, BI Accelerator, and MaxDB | 5 Comments »