Vertica Systems
Analysis of columnar data warehouse DBMS vendor Vertica Systems. Related subjects include:
Notes on columnar/TPC-H compression
I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.* When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).
*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.
But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace). Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing. So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.
And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.
| Categories: Benchmarks and POCs, Columnar database management, Data warehousing, Database compression, Vertica Systems | Leave a Comment |
My current customer list among the analytic DBMS specialists
(This is an updated version of an August, 2008 post.)
One of my favorite pages on the Monash Research website is the list of many current and a few notable past customers. (Another favorite page is the one for testimonials.) For a variety of reasons, I won’t undertake to be more precise about my current customer list than that. But I don’t think it would hurt anything to list the analytic/data warehouse DBMS/appliance specialists in the group. They are:
- Aster Data
- Greenplum
- Infobright
- Kickfire
- Kognitio
- Microsoft
- Netezza (my biggest client this year, probably, because of all the Enzee Universe appearances)
- Sybase
- Teradata
- Vertica
- Attivio, which may or may not be construed as being in the analytic DBMS business
- Clearpace, ditto
All of those are Monash Advantage members.
If you care about all this, you may also be interested in the rest of my standards and disclosures.
| Categories: About this blog, Aster Data, Data warehousing, Greenplum, Infobright, Kickfire, Microsoft and SQL*Server, Netezza, Sybase, Teradata, Vertica Systems | 2 Comments |
H-Store is now VoltDB
I’ve always honored more of an NDA about the H-Store project and its commercialization than I really felt obligated to, given how freely information was being bandied about to others. I’m still doing so.
But I think I’ll at least say that the H-Store project is now named VoltDB. The VoltDB website names two individuals — Mike Stonebraker and Andy Palmer — both of whom are founders of Vertica. Job listings on the site are for field engineer and trainer, but not developer, so that suggests something about the project’s/product’s maturity level.
If you have an extreme OLTP need, you should talk to VoltDB. If you don’t have access to Mike or Andy directly, I can hook you up with a key VoltDB marketing/outreach guy. Price may not be as much of a barrier as you’d initially fear.
If anybody from VoltDB wants to be less cloak-and-daggery and say more in the comment thread, I’d be pleased.
And yes — an open-secret working name for H-Store/VoltDB was, for a while, “Horizontica.”
| Categories: In-memory DBMS, Memory-centric data management, OLTP, Vertica Systems, VoltDB and H-Store | 8 Comments |
Per-terabyte pricing
Software-only DBMS vendors sometimes price per terabyte of user data. Vertica’s list price is $100K/TB. Greenplum’s list price is $70K/TB. In practice, both offer substantial discounts, especially at higher volumes. In both cases, this means raw data, uncompressed, without counting indexes or temp space.
Client experience teaches me that this definition is easy to forget, so let me reemphasize the key point:
Per-terabyte pricing is based on a calculated figure. Per-terabyte pricing is not based on the current disk space used by your database when managed by the DBMS you are replacing.
There’s at least one important difference in how Vertica and Greenplum calculate database size. No matter how many times you copy the data, Vertica only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy. Similarly, Vertica charges only for deployment, and not for test or development; I didn’t remember to ask what Greenplum’s policies are in those regards. (Edit: Greenplum says in a comment below that it doesn’t charge for test or development data either.)
*That policy is a great fit with Vertica’s performance recommendation that you should store columns in different sort orders, perhaps an average of two copies per column.
| Categories: Columnar database management, Data warehousing, Greenplum, Pricing, Vertica Systems | 4 Comments |
Vertica pricing and customer metrics
Since last fall, Vertica’s stated pricing has been “$100K per terabyte of user data.” Vertica hastens to point out that unlike, for example, appliance vendors or Sybase, it only charges for deployment licenses; development and test are free (although of course you have to Bring Your Own hardware). Offer the past few weeks, I’ve gotten other pricing comments from Vertica to the effect that:
- Of course, Vertica offers substantial negotiated quantity discounts. (Specifics that Vertica told me are confidential.)
- Actually,Vertica’s official price list (unpublished but apparently freely available to prospects) contains quantity discounts too.
- Finally, Vertica told me that its actual average price is around $25K/terabyte, and gave me person to publish same.
I didn’t press my luck and ask exactly what “average” means in this context.
As for customers, metrics I got include:
| Categories: Data warehousing, Market share, Pricing, Vertica Systems | 2 Comments |
There always seems to be a fire drill around MapReduce news
Last August I flew out to see my new clients at Greenplum. They told me they planned to roll out MapReduce in a few weeks, and asked for my help in publicizing it. From their offices I went to dinner with non-clients Aster Data, who told me they’d gotten wind of a Greenplum MapReduce announcement and planned to come out ahead of it. A couple of hours later, Aster signed up as a client. In something of a pickle — but not one of my own making — I knocked heads, and persuaded both vendors to announce MapReduce at the same time, namely the following Monday. Lots of publicity ensued for both vendors, and everybody was reasonably satisfied.
| Categories: Analytic technologies, Aster Data, Greenplum, MapReduce, Michael Stonebraker, Vertica Systems | Leave a Comment |
Stonebraker, DeWitt, et al. compare MapReduce to DBMS
Along with five other coauthors — the lead author seems to be Andy Pavlo — famous MapReduce non-fans Mike Stonebraker and David DeWitt have posted a SIGMOD 2009 paper called “A Comparison of Approaches to Large-Scale Data Analysis.” The heart of the paper is benchmarks of Hadoop, Vertica, and “DBMS-X” on identical clusters of 100 low-end nodes., across a series of tests including (if I understood correctly):
- A couple of different flavors of a Grep task originally proposed in a Google MapReduce paper.
- A database query on simulated clickstream data
- A join on the same clickstream data.
- Two aggregations on the clickstream data.
| Categories: Analytic technologies, Hadoop, MapReduce, Michael Stonebraker, Parallelization, Vertica Systems | 6 Comments |
The questionable benefits of terabyte-scale data warehouse virtualization
Vertica is virtualizing via VMware, and has suggested a few operational benefits to doing so that might or might not offset VMware’s computational overhead. But on the whole,it seems virtualization’s major benefits don’t apply to the large-database MPP data warehousing.
| Categories: Columnar database management, Data warehousing, Database compression, Theory and architecture, Vertica Systems | 2 Comments |
Vertica Virtualizes Via VMware
(In other news, the sixth sick sheik’s sixth sheep is sick … but I digress.)
It seems that every analytic DBMS vendor feels compelled to issue at least one press release the week of winter TDWI. Vertica’s grand revelation this year is that you can use Vertica with VMware.* Of course, VMware working the way it does, you in fact have always been able to use Vertica with VMware. But now things are slightly improved, because Vertica has built install packages you can download, and has been working out recommended configuration settings as well.
| Categories: Data warehousing, Vertica Systems | 2 Comments |
Draft slides on how to select an analytic DBMS
I need to finalize an already-too-long slide deck on how to select an analytic DBMS by late Thursday night. Anybody see something I’m overlooking, or just plain got wrong?
Edit: The slides have now been finalized.
One vendor’s trash is another’s treasure
A few months ago, CEO Mayank Bawa of Aster Data commented to me on his surprise at how “profound” the relationship was between design choices in one aspect of a data warehouse DBMS and choices in other parts. The word choice in that was all Mayank, but the underlying thought is one I’ve long shared, and that I’m certain architects of many analytic DBMS share as well.
For that matter, the observation is no doubt true in many other product categories as well. But in the analytic database management arena, where there are literally 10-20+ competitors with different, non-stupid approaches, it seems most particularly valid. Here are some examples of what I mean.
| Categories: Aster Data, Data warehousing, Exadata, Kognitio, Oracle, Theory and architecture, Vertica Systems | 22 Comments |
New England Database Day this Friday January 30
Dan Weinreb, to whose opinions I usually give great weight, spoke very favorably of last year’s New England Database Day conference. Well, this year’s is taking place on Friday. It’s at MIT and it’s free, with easy registration. A list of papers is here.
It’s pretty obvious who’s running the show. Sam Madden’s name is given as a contact; elsewhere it’s referred to as being organized by Madden and Mike Stonebraker. Of the six identified papers, 2-3 look like the subjects or people could be taken straight from Vertica’s Database Column blog. But that hardly means the event will be one long Vertica commercial. For example, the other papers include one from Netezza and one on Flash memory data access methods.
I really doubt I’ll make to Cambridge in time for the 9:00 am opening remarks ;), but I’ll try to swing by later on.
| Categories: Michael Stonebraker, Theory and architecture, Vertica Systems | 2 Comments |
Database SaaS gains a little visibility
Way back in the 1970s, a huge fraction of analytic database management was done via timesharing, specifically in connection with the RAMIS and FOCUS business-intelligence-precursor fourth-generation languages. (Both were written by Gerry Cohen, who built his company Information Builders around the latter one.) The market for remoting-computing business intelligence has never wholly gone away since. Indeed, it’s being revived now, via everything from the analytics part of Salesforce.com to the service category I call data mart outsourcing.
Less successful to date are efforts in the area of pure database software-as-a-service. It seems that if somebody is going for SaaS anyway, they usually want a more complete, integrated offering. The most noteworthy exceptions I can think of to this general rule are Kognitio and Vertica, and they only have a handful of database SaaS customers each. To wit: Read more
Gartner’s 2008 data warehouse database management system Magic Quadrant is out
Gartner’s annual Magic Quadrant for data warehouse DBMS is out. Thankfully, vendors don’t seem to be taking it as seriously as usual, so I didn’t immediately hear about. (I finally noticed it in a Greenplum pay-per-click ad.) Links to Gartner MQs tend to come and go, but as of now here are two working links to the 2008 Gartner Data Warehouse Database Management System MQ. My posts on the 2007 and 2006 MQs have also been updated with working links. Read more
More from Vertica on data warehouse load speeds
Last month, when Vertica releases its “benchmark” of data warehouse load speeds, I didn’t realize it had previously released some actual customer-experience load rates as well. In a July, 2008 white paper that seems thankfully free of any registration requirements, Vertica cited four examples:
- (Comcast) Trickle loads 48MB/minute – SNMP data generated by devices in the Comcast cable network is trickle loaded on a 24×7 basis at rates as high as 135,000 rows/second. The system runs on 5 HP ProLiant DL 380 servers.
- (Verizon) Bulk loads to memory 300MB/minute – 50MB to 300MB of call detail records (1K record size—150 columns per row) are loaded every 10 minutes. Vertica runs on 6 HP ProLiant DL380 servers.
- (Level 3 Communications) Bulk loads to disk 5GB/minute - The loading and enrichment (i.e., summary table creation) of 1.5TB of call detail records formerly took 5 days in a row-oriented data warehouse database. Vertica required 5 hours to load the same data.
- (”Global investment firm”) Trickle loads 2.6GB/minute - Historic financial trade and quote (TaQ) data was bulk loaded into the database at a rate of 125GB/hour. New TaQ data was trickled into the database at rates as high as 90,000 rows per second (480b per row).
| Categories: Vertica Systems | Leave a Comment |
More grist for the column vs. row mill
Daniel Abadi and Sam Madden are at it again, following up on their blog posts of six months arguing for the general superiority of column stores over row stores (for analytic query processing). The gist is to recite a number of bases for superiority, beyond the two standard ones of less I/O and better compression, and seems to be based largely on Section 5 of a SIGMOD paper they wrote with Neil Hachem.
A big part of their argument is that if you carry the processing of columnar and/or compressed data all the way through in memory, you get lots of advantages, especially because everything’s smaller and hence fits better into Level 2 cache. There also is some kind of join algorithm enhancement, which seems to be based on noticing when the result wound up falling into a range according to some dimension, and perhaps using dictionary encoding in a way that will help induce such an outcome.
The main enemy here is row-store vendors who say, in effect, “Oh, it’s easy to shoehorn almost all the benefits of a column-store into a row-based system.” They also take a swipe — for being insufficiently purely columnar — at unnamed columnar Vertica competitors, described in terms that seemingly apply directly to ParAccel.
| Categories: Columnar database management, Data warehousing, Database compression, ParAccel, Vertica Systems | 2 Comments |
Data warehouse load speeds in the spotlight
Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:
- Syncsort isn’t just a mainframe sort utility company, but also does data integration. Who knew?
- Vertica’s design to overcome the traditional slow load speed of columnar DBMS works.
The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now.
Silly website tricks
Vertica’s marketing is usually good-to-outstanding, but they made a funny misstep this time. If you go to the Vertica home page, you’ll see seasonal art suggesting that their product is a turkey and/or that it’s terrified it’s about to get the ax.
Live by the pun, die by the pun.
| Categories: Humor, Vertica Systems | 6 Comments |
Vertica offers some more numbers
Eric Lai interviewed Dave Menninger of Vertica. Highlights included:
- $20 million in trailing revenue. Removing a single multi-million-dollar deal from the list, that’s a few hundred thousand dollars each for 50ish customers. At $100K or so per terabyte, that’s an average of several terabytes of user data each, or more depending on what you assume about discounting.
- Dave used a figure of $100K per terabyte of user data, down from the $150K Vertica has previously used.
| Categories: Data warehousing, Market share, Pricing, Vertica Systems | 10 Comments |
Vertica finally spells out its compression claims
Omer Trajman of Vertica put up a must-read blog post spelling out detailed compression numbers, based on actual field experience (which I’d guess is from a combination of production systems and POCs):
- CDR - 8:1 (87%)
- Consumer Data - 30:1 (96%)
- Marketing Analytics - 20:1 (95%)
- Network logging - 60:1 (98%)
- Switch Level SNMP - 20:1 (95%)
- Trade and Quote Exchange - 5:1 (80%)
- Trade Execution Auditing Trails - 10:1 (90%)
- Weblog and Click-stream - 10:1 (90%)
It’s clear what Omer means by most of those categories from reading the post, but I’m a little fuzzy on what “Consumer Data” or “Marketing Analytics” comprise in his taxonomy. Anyhow, Omer’s post is a huge improvement over my recent one — based on a conversation with Omer
— which featured some far less accurate or complete compression numbers.
Omer goes on to claim that trickle-feed data is harder for rival systems to compress than it is for Vertica, and generally to claim that Vertica’s compression is typically severalfold better than that of competitive row-based systems.
| Categories: Database compression, Vertica Systems, Web analytics | 3 Comments |
Database compression is heavily affected by the kind of data
I’ve written often of how different kinds or brands of data warehouse DBMS get very different compression figures. But I haven’t focused enough on how much compression figures can vary among different kinds of data. This was really brought home to me when Vertica told me that web analytics/clickstream data can often be compressed 60X in Vertica, while at the other extreme — some kind of floating point data, whose details I forget for now — they could only do 2.5X. Edit: Vertica has now posted much more accurate versions of those numbers. Infobright’s 30X compression reference at TradeDoubler seems to be for a clickstream-type app. Greenplum’s customer getting 7.5X — high for a row-based system — is managing clickstream data and related stuff. Bottom line:
When evaluating compression ratios — especially large ones — it is wise to inquire about the nature of the data.
| Categories: Data warehousing, Database compression, Greenplum, Infobright, Vertica Systems, Web analytics | 4 Comments |
Web analytics — clickstream and network event data
It should surprise nobody that web analytics – and specifically clickstream data — is one of the biggest areas for high-end data warehousing. For example:
- I believe that both of the previously mentioned petabyte+ databases on Greenplum will feature clickstream data.
- Aster Data’s largest disclosed database, by almost two orders of magnitude, is at MySpace.
- Clickstream analytics is a big application area for Vertica Systems.
- Clickstream analytics is a big application area for Netezza.
- Infobright’s customer success stories appear to be concentrated in clickstream analytics.
- Coral8 tells me that CEP is also being used for clickstream data, although I suspect that a lot of Coral8’s evidence in that regard comes from a single flagship account. Edit: Actually, Coral8 has a bunch of clickstream customers.
| Categories: Aleri and Coral8, Aster Data, Complex event processing (CEP), Greenplum, Infobright, Netezza, Vertica Systems, Web analytics | 2 Comments |
SANs vs. DAS in MPP data warehousing
Generally speaking:
- SANs (Storage Area Networks) are pulling ahead of DAS (Direct Attached Storage).
- Much of the growth in storage is due to data warehousing.
- MPP (Massively Parallel Processing) is pulling ahead of SMP (Symmetric MultiProcessing) for high-end data warehousing.
- MPP architectures are commonly shared-nothing.
- Shared-nothing entails DAS.
But if you think about it, those facts don’t exactly add up.
| Categories: Calpont, Parallelization, Storage, Vertica Systems | 19 Comments |
Dividing the data warehousing work among MPP nodes
I talk with lots of vendors of MPP data warehouse DBMS. I’ve now heard enough different approaches to MPP architecture that I think it might be interesting to contrast some of the alternatives.
| Categories: Aster Data, Calpont, Exasol, Greenplum, Parallelization, Theory and architecture, Vertica Systems | 21 Comments |
Vertica’s paying customer count
In a recent Computerworld article, Andy Ellicott of Vertica was cited as saying Vertica has 50 paying customers total. That’s very much on par with Greenplum’s figure, leaving aside any questions of deal size. (Greenplum runs a number of databases much larger than Vertica’s biggest. However, I believe Greenplum also charges a lot less per terabyte of user data.)
Previous Vertica paying customer count figures include:
| Categories: Data warehousing, Greenplum, Vertica Systems | 8 Comments |
