Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Yahoo scales its web analytics database to petabyte range
Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:
- The Yahoo web analytics database is over 1 petabyte. They claim it will be in the 10s of petabytes by 2009.
- The Yahoo web analytics database is based on PostgreSQL. So much for MySQL fanboys’ claims of Yahoo validation for their beloved toy … uh, let me rephrase that. The highly-regarded MySQL, although doing a great job for some demanding and impressive applications at Yahoo, evidently wasn’t selected for this one in particular. OK. That’s much better now.
- But the Yahoo web analytics database doesn’t actually use PostgreSQL’s storage engine. Rather, Yahoo wrote something custom and columnar.
- Yahoo is processing 24 billion “events” per day. The article doesn’t clarify whether these are sent straight to the analytics store, or whether there’s an intermediate storage engine. Most likely the system fills blocks in RAM and then just appends them to the single persistent store. If commodity boxes occasionally crash and lose a few megs of data — well, in this application, that’s not a big deal at all.
- Yahoo thinks commercial column stores aren’t ready yet for more than 100 terabytes of data.
- Yahoo says it got great performance advantages from a custom system by optimizing for its specific application. I don’t know exactly what that would be, but I do know that database architectures for high-volume web analytics are still in pretty bad shape. In particular, there’s no good way yet to analyze the specific, variable-length paths users take through websites.
| Categories: Analytic technologies, Columnar database management, Data warehousing, MySQL, Petabyte-scale data management, PostgreSQL, Specific users, Theory and architecture, Yahoo | 13 Comments |
DATAllegro on compression
DATAllegro CEO Stuart Frost has been blogging quite a bit recently (and not before time!). A couple of his posts have touched on compression. In one he gave actual numbers for compression, namely:
DATAllegro compresses between 2:1 and 6:1 depending on the content of the rows, whereas column-oriented systems claim 4:1 to 10:1.
In another recent post, Stuart touched on architecture, saying:
Due to the way our compression code works, DATAllegro’s current products are optimized for performance under heavy concurrency. The end result is that we don’t use the full power of the platform when running one query at a time.
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Database compression, DATAllegro | Leave a Comment |
Data warehouse appliance power user TEOCO
If you had to name super-high-end users of data warehouse technology, your list might start with a few retailers, credit data processors, and telcos, plus the US intelligence establishment. Well, it turns out that TEOCO runs outsourced data warehouses for several of the top US telcos, making it one of the top data warehouse technology users around.
A few weeks ago, I had a fascinating chat with John Devolites of TEOCO. Highlights included:
- TEOCO runs a >200 TB DATAllegro warehouse for a major US telco. (When we hear about a big DATAllegro telco site that’s been in production for a while, that’s surely the one they’re talking about.)
- TEOCO runs around 450 TB total of DATAllegro databases across its various customers. (When Stuart Frost blogs of >400 TB “systems,” that may be what he’s talking about.)
- TEOCO likes DATAllegro better than Netezza, although the margin is now small. This is mainly for financial reasons, specifically price-per-terabyte. When TEOCO spends its own money without customer direction as to appliance brand, it buys DATAllegro.
- TEOCO runs at least one 50 TB Netezza system — originally due to an acquisition of a Netezza user — with more coming. There also is more DATAllegro coming.
- TEOCO feels 15-30 concurrent users is the current practical limit for both DATAllegro and Netezza. That’s greater than it used to be.
- Netezza is a little faster than DATAllegro on a few esoteric queries, but the difference is not important to TEOCO’s business.
- Official price lists notwithstanding, TEOCO sees prices as being in the $10K/TB range. DATAllegro’s price advantage has shrunk greatly, as others have come down to more or less match. However, since John stated his price preference for DATAllegro as being in the present tense, I presume the price match isn’t perfect.
- Teradata was never a serious consideration, for price reasons.
- In the original POC a few years ago, the incumbent Oracle — even after extensive engineering — couldn’t get an important query down under 8 hours of running time. DATAllegro and Netezza both handled it in 2-3 minutes. Similarly, Oracle couldn’t get the load time for 100 million call detail records (CDRs) below 24 hours.
- Applications sound pretty standard for telecom: Lots of CDR processing — 550 million/day on the big DATAllegro system cited above. Pricing and fraud checking. Some data staging for legal reasons (giving the NSA what it subpoenas and no more).
| Categories: Analytic technologies, Data mart outsourcing, Data warehouse appliances, Data warehousing, DATAllegro, Netezza, Pricing, Specific users, Telecommunications, TEOCO | 7 Comments |
Netezza on compression
Phil Francisco put up a nice post on Netezza’s company blog about a month ago, explaining the Netezza compression story. Highlights include:
- Like other row-based vendors, Netezza compresses data on a column-by-column basis, then stores the results in rows. This is obviously something of a limitation — no run-length encoding for them — but can surely accommodate several major compression techniques.
- The Netezza “Compress Engine” compresses data on a block-by-block basis. This is a disadvantage for row-based systems vs. columnar ones in the area of compression, because columnar systems have more values per block to play with, and that yields higher degrees of compression. And among row-based systems, typical block size is an indicator of compression success. Thus, DATAllegro probably does a little better at compression than Netezza, and Netezza does a lot better at compression than Teradata.
- Netezza calls its compression “compilation.” The blog post doesn’t make the reason clear. And the one reason I can recall confuses me. Netezza once said the compression extends at least somewhat to columns with calculated values. But that seems odd, as Netezza only has a very limited capability for materialized views.
- Netezza pays the processing cost of compression in the FPGA, not the microprocessor. And so Netezza spins the overhead of the Compress Engine as being zero or free. That’s actually not ridiculous, since Netezza seems to have still-unused real estate on the FPGA for new features like compression. Read more
| Categories: Analytic technologies, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Netezza, Theory and architecture | 2 Comments |
Netezza has an EMC deal too
Netezza has an EMC deal too. As befits a hardware vendor, Netezza has an actual OEM relationship with EMC, in which it is offering CLARiiONs built straight into NPS appliances. 5 TB of CLARiiON will be free in any Netezza system from 2 racks on upward. (A rack holds about 12.5 TB.) In addition, you’ll be able to buy 10 TB more of CLARiiON in every Netezza rack, if you want. The whole thing is supposed to ship before year-end. Read more
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, EMC, Netezza | 5 Comments |
Top-end data warehouse sizes have grown hundreds-fold over the past 12 years
I just tripped across a link from February, 1996 in which NCR/Teradata:
- Bragged that it had half a dozen customers with >1 TB of raw user data
- Showed off a “record-breaking” 11 TB simulation
That represents roughly a 60-70% annual growth rate in top-end database sizes in the intervening 12 years.
| Categories: Analytic technologies, Data warehousing, Teradata | 4 Comments |
Netezza, enterprise data warehouses, and the 100 terabyte mark
Phil Francisco of Netezza checked in tonight with some news that will be embargoed for a few hours. While I had him on the phone anyway, I asked him about large databases and/or enterprise data warehouses. Highlights included:
- Netezza has one customer with 200 TB of user data. The name is confidential (but he told me who it was).
- Netezza has sold 15 or so of its NPS 10-800s, which are rated at 100 TB capacity.
- The second-largest database in production on Netezza is probably 80 TB or so at Catalina Marketing, which has been a Netezza early adopter all along.
- Netezza’s biggest users typically have a handful (literally — off the top of his head, Phil said “4 to 6”) of applications, each with its own primary set of fact tables.
- Each application-specific set of fact tables in such big-honking-data-mart installations is usually either of cardinality one, or else a small set sharing a common hash key.
- Phil insists Netezza isn’t exaggerating when it claims to have true enterprise data warehouse installations. What he means by an EDW is something that is an enterprise’s primary data warehouse, is used by lots of departments, draws data from lots of sources, has loads going on at various points during the day, and has 100s if not 1000s of total users.
- Netezza’s biggest EDW has about 30 TB of user data. Phil wouldn’t tell me the name of that customer.
ParAccel unveils its EMC-related appliance strategy
Embargoes are getting ever more stupid these days, wasting analysts’ and bloggers’ time in doomed attempts to micromanage the news flow. ParAccel is no exception to the rule. An announcement that’s actually been public knowledge for a couple of months was finally made official a few minutes ago. It’s an appliance, or at least an attempt to gain customers for an appliance. The core ideas include:
- ParAccel’s usual shared-nothing configuration is hooked up to SAN-based EMC storage at the back end.
- Around half of the total data is on internal (i.e., node-specific) disks, mirrored on the storage device. The rest of the data lives only on the EMC device. Logically, all this data is integrated. So hopefully you’ll be able to process more data per unit of time than you could on a standard ParAccel configuration.
- Also, different parts of the EMC device are dedicated to different ParAccel nodes. So, while this isn’t a shared-nothing architecture, at least it’s shared-not-very-much. (DATAllegro does something similar, although without the mirroring on direct-attached storage.)
- Backup, snapshotting, and so on are inherited from EMC. Administration will increasingly be integrated with EMC’s.
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, EMC, ParAccel, Parallelization | 2 Comments |
The week of trivial press releases
TDWI has several conferences per year, including one this week. And so companies active in data warehousing feel they must put out several press releases a year timed for TDWI conferences, whether or not anything newsworthy has actually happened. So far, the only one I’ve gotten of any real substance was Vertica’s (and, in compliance with Murphy’s Law, even that was glitched in its release). Most are yawnworthy product partnerships, adding some useful but me-too feature to somebody’s product line. Worst of all was the announcement that a certain vendor had established an indirect sales channel — after over a decade in the marketplace, when all its other competitors already have one.
Usually I love my work, but there are exceptions …
| Categories: Analytic technologies, Data warehousing | 1 Comment |
Vertica in the cloud
I may have gotten confused again as to an embargo date, but if so, then this time I had it late rather than early. Anyhow, the TDWI-timed news is that Vertica is now available in the Amazon cloud. Of course, the new Vertica cloud offering is:
- Super-easy to set up
- Pay-as-you-go.
Slightly less obviously:
- Vertica insists its software was designed for grid computing from the ground up, and hence doesn’t need Elastra’s administrative aids for starting, stopping, and/or provisioning instances.
- This is a natural fit for new or existing Vertica customers in data mart outsourcing.
Other coverage:
Related link
