DATAllegro on compression
DATAllegro CEO Stuart Frost has been blogging quite a bit recently (and not before time!). A couple of his posts have touched on compression. In one he gave actual numbers for compression, namely:
DATAllegro compresses between 2:1 and 6:1 depending on the content of the rows, whereas column-oriented systems claim 4:1 to 10:1.
In another recent post, Stuart touched on architecture, saying:
Due to the way our compression code works, DATAllegro’s current products are optimized for performance under heavy concurrency. The end result is that we don’t use the full power of the platform when running one query at a time.
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Database compression, DATAllegro | Leave a Comment |
Data warehouse appliance power user TEOCO
If you had to name super-high-end users of data warehouse technology, your list might start with a few retailers, credit data processors, and telcos, plus the US intelligence establishment. Well, it turns out that TEOCO runs outsourced data warehouses for several of the top US telcos, making it one of the top data warehouse technology users around.
A few weeks ago, I had a fascinating chat with John Devolites of TEOCO. Highlights included:
- TEOCO runs a >200 TB DATAllegro warehouse for a major US telco. (When we hear about a big DATAllegro telco site that’s been in production for a while, that’s surely the one they’re talking about.)
- TEOCO runs around 450 TB total of DATAllegro databases across its various customers. (When Stuart Frost blogs of >400 TB “systems,” that may be what he’s talking about.)
- TEOCO likes DATAllegro better than Netezza, although the margin is now small. This is mainly for financial reasons, specifically price-per-terabyte. When TEOCO spends its own money without customer direction as to appliance brand, it buys DATAllegro.
- TEOCO runs at least one 50 TB Netezza system — originally due to an acquisition of a Netezza user — with more coming. There also is more DATAllegro coming.
- TEOCO feels 15-30 concurrent users is the current practical limit for both DATAllegro and Netezza. That’s greater than it used to be.
- Netezza is a little faster than DATAllegro on a few esoteric queries, but the difference is not important to TEOCO’s business.
- Official price lists notwithstanding, TEOCO sees prices as being in the $10K/TB range. DATAllegro’s price advantage has shrunk greatly, as others have come down to more or less match. However, since John stated his price preference for DATAllegro as being in the present tense, I presume the price match isn’t perfect.
- Teradata was never a serious consideration, for price reasons.
- In the original POC a few years ago, the incumbent Oracle — even after extensive engineering — couldn’t get an important query down under 8 hours of running time. DATAllegro and Netezza both handled it in 2-3 minutes. Similarly, Oracle couldn’t get the load time for 100 million call detail records (CDRs) below 24 hours.
- Applications sound pretty standard for telecom: Lots of CDR processing — 550 million/day on the big DATAllegro system cited above. Pricing and fraud checking. Some data staging for legal reasons (giving the NSA what it subpoenas and no more).
| Categories: Analytic technologies, Data mart outsourcing, Data warehouse appliances, Data warehousing, DATAllegro, Netezza, Pricing, Specific users, Telecommunications, TEOCO | 7 Comments |
Netezza on compression
Phil Francisco put up a nice post on Netezza’s company blog about a month ago, explaining the Netezza compression story. Highlights include:
- Like other row-based vendors, Netezza compresses data on a column-by-column basis, then stores the results in rows. This is obviously something of a limitation — no run-length encoding for them — but can surely accommodate several major compression techniques.
- The Netezza “Compress Engine” compresses data on a block-by-block basis. This is a disadvantage for row-based systems vs. columnar ones in the area of compression, because columnar systems have more values per block to play with, and that yields higher degrees of compression. And among row-based systems, typical block size is an indicator of compression success. Thus, DATAllegro probably does a little better at compression than Netezza, and Netezza does a lot better at compression than Teradata.
- Netezza calls its compression “compilation.” The blog post doesn’t make the reason clear. And the one reason I can recall confuses me. Netezza once said the compression extends at least somewhat to columns with calculated values. But that seems odd, as Netezza only has a very limited capability for materialized views.
- Netezza pays the processing cost of compression in the FPGA, not the microprocessor. And so Netezza spins the overhead of the Compress Engine as being zero or free. That’s actually not ridiculous, since Netezza seems to have still-unused real estate on the FPGA for new features like compression. Read more
| Categories: Analytic technologies, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Netezza, Theory and architecture | 2 Comments |
Netezza has an EMC deal too
Netezza has an EMC deal too. As befits a hardware vendor, Netezza has an actual OEM relationship with EMC, in which it is offering CLARiiONs built straight into NPS appliances. 5 TB of CLARiiON will be free in any Netezza system from 2 racks on upward. (A rack holds about 12.5 TB.) In addition, you’ll be able to buy 10 TB more of CLARiiON in every Netezza rack, if you want. The whole thing is supposed to ship before year-end. Read more
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, EMC, Netezza | 5 Comments |
Top-end data warehouse sizes have grown hundreds-fold over the past 12 years
I just tripped across a link from February, 1996 in which NCR/Teradata:
- Bragged that it had half a dozen customers with >1 TB of raw user data
- Showed off a “record-breaking” 11 TB simulation
That represents roughly a 60-70% annual growth rate in top-end database sizes in the intervening 12 years.
| Categories: Analytic technologies, Data warehousing, Teradata | 4 Comments |
Netezza, enterprise data warehouses, and the 100 terabyte mark
Phil Francisco of Netezza checked in tonight with some news that will be embargoed for a few hours. While I had him on the phone anyway, I asked him about large databases and/or enterprise data warehouses. Highlights included:
- Netezza has one customer with 200 TB of user data. The name is confidential (but he told me who it was).
- Netezza has sold 15 or so of its NPS 10-800s, which are rated at 100 TB capacity.
- The second-largest database in production on Netezza is probably 80 TB or so at Catalina Marketing, which has been a Netezza early adopter all along.
- Netezza’s biggest users typically have a handful (literally — off the top of his head, Phil said “4 to 6”) of applications, each with its own primary set of fact tables.
- Each application-specific set of fact tables in such big-honking-data-mart installations is usually either of cardinality one, or else a small set sharing a common hash key.
- Phil insists Netezza isn’t exaggerating when it claims to have true enterprise data warehouse installations. What he means by an EDW is something that is an enterprise’s primary data warehouse, is used by lots of departments, draws data from lots of sources, has loads going on at various points during the day, and has 100s if not 1000s of total users.
- Netezza’s biggest EDW has about 30 TB of user data. Phil wouldn’t tell me the name of that customer.
ParAccel unveils its EMC-related appliance strategy
Embargoes are getting ever more stupid these days, wasting analysts’ and bloggers’ time in doomed attempts to micromanage the news flow. ParAccel is no exception to the rule. An announcement that’s actually been public knowledge for a couple of months was finally made official a few minutes ago. It’s an appliance, or at least an attempt to gain customers for an appliance. The core ideas include:
- ParAccel’s usual shared-nothing configuration is hooked up to SAN-based EMC storage at the back end.
- Around half of the total data is on internal (i.e., node-specific) disks, mirrored on the storage device. The rest of the data lives only on the EMC device. Logically, all this data is integrated. So hopefully you’ll be able to process more data per unit of time than you could on a standard ParAccel configuration.
- Also, different parts of the EMC device are dedicated to different ParAccel nodes. So, while this isn’t a shared-nothing architecture, at least it’s shared-not-very-much. (DATAllegro does something similar, although without the mirroring on direct-attached storage.)
- Backup, snapshotting, and so on are inherited from EMC. Administration will increasingly be integrated with EMC’s.
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, EMC, ParAccel, Parallelization | 2 Comments |
EnterpriseDB survey on open source database adoption (participation time)
CTO Bob Zurek of EnterpriseDB asked me to pass along a link to a short survey on open source database adoption. He plans to release the results publicly after they are collected. Bob stressed to me that he used to be a Forrester analyst, his point being that he knows how to be analytically objective.
Looking over the 15 questions (14 of which are simple multiple-choice), he lived up quite well to the “unbiased” claim. E.g., the only Postgres option cited is PostgreSQL, rather than EnterpriseDB’s proprietary/value-added packagings. I do see one little screw-up: Several of the questions are worded as if the respondent is, enterprise-wide, running one and exactly one instance of open source DBMS. But otherwise it seems like a clean, tight, simple survey.
| Categories: EnterpriseDB and Postgres Plus, Open source | 1 Comment |
McObject eXtremeDB — a solidDB alternative
McObject — vendor of memory-centric DBMS eXtremeDB — is a tiny, tiny company, without a development team of the size one would think needed to turn out one or more highly-reliable DBMS. So I haven’t spent a lot of time thinking about whether it’s a serious alternative to solidDB for embedded DBMS, e.g. in telecom equipment. However:
- IBM’s acquisition of Solid seems to suggest a focus on DB2 caching rather than the embedded market
- McObject actually has built up something of a customer list, as per the boilerplate on any of its press releases.
And they do seem to have some nice features, including Patricia tries (like solidDB), R-trees (for geospatial), and some kind of hybrid disk-centric/memory-centric operation.
| Categories: GIS and geospatial, In-memory DBMS, McObject, Memory-centric data management, solidDB | 6 Comments |
The week of trivial press releases
TDWI has several conferences per year, including one this week. And so companies active in data warehousing feel they must put out several press releases a year timed for TDWI conferences, whether or not anything newsworthy has actually happened. So far, the only one I’ve gotten of any real substance was Vertica’s (and, in compliance with Murphy’s Law, even that was glitched in its release). Most are yawnworthy product partnerships, adding some useful but me-too feature to somebody’s product line. Worst of all was the announcement that a certain vendor had established an indirect sales channel — after over a decade in the marketplace, when all its other competitors already have one.
Usually I love my work, but there are exceptions …
| Categories: Analytic technologies, Data warehousing | 1 Comment |
