Data warehousing

Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:

July 1, 2008

Jerry Held on cloud data warehousing and how business intelligence will be transformed by it

Vertica Chairman Jerry Held has a pair of blog posts on analytics and data warehousing in the cloud. The first lays out a number of potential benefits and consequences of cloud data warehousing, under the heading of “Transforming BI”: Read more

Categories: Analytic technologies, Business intelligence, Cloud computing, Data mart outsourcing, Data warehousing, Software as a Service (SaaS), Vertica Systems

7 Comments

June 28, 2008

Response to Rita Sallam of Oracle

In a comment thread on Seth Grimes’ blog, Rita Sallam of Oracle engaged in a passionate defense of her data warehousing software. I’d like to take it upon myself to respond to a few of here points here. Read more

Categories: Benchmarks and POCs, Clustering, Data warehousing, Oracle, Parallelization

10 Comments

June 28, 2008

Oracle Optimized Warehouse Initiative

Oracle’s response to data warehouse appliances — and to IBM’s BCUs (Balanced Configuration Units) — so far is the Oracle Optimized Warehouse Initiative (OOW, not to be confused with Oracle Open World). A small amount of information about Oracle Optimized Warehouse can be found on Oracle’s website. Another small amount can be found in this recent long and breathless TDWI article, full of such brilliancies as attributing to the data warehouse appliance vendors the “claim that relational databases simply aren’t cut out for analytic workloads.” (Uh, what does he think they’re running — CODASYL DBMS?)

So far as I can tell, what Oracle Optimized Warehouse — much like IBM’s BCU — boils down to is the same old Oracle DBMS, but with recommended hardware configuration and tuning parameters. Thus, a lot of the hassle is taken out of ordering and installing an Oracle data warehouse, which is surely a good thing. But I doubt it does much to solve Oracle’s problems with price, price/performance, or the inevitable DBA hassles derived from a poorly-performing DBMS.

Categories: Data warehouse appliances, Data warehousing, Oracle

3 Comments

May 29, 2008

Yahoo scales its web analytics database to petabyte range

Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:

The Yahoo web analytics database is over 1 petabyte. They claim it will be in the 10s of petabytes by 2009.
The Yahoo web analytics database is based on PostgreSQL. So much for MySQL fanboys’ claims of Yahoo validation for their beloved toy … uh, let me rephrase that. The highly-regarded MySQL, although doing a great job for some demanding and impressive applications at Yahoo, evidently wasn’t selected for this one in particular. OK. That’s much better now.
But the Yahoo web analytics database doesn’t actually use PostgreSQL’s storage engine. Rather, Yahoo wrote something custom and columnar.
Yahoo is processing 24 billion “events” per day. The article doesn’t clarify whether these are sent straight to the analytics store, or whether there’s an intermediate storage engine. Most likely the system fills blocks in RAM and then just appends them to the single persistent store. If commodity boxes occasionally crash and lose a few megs of data — well, in this application, that’s not a big deal at all.
Yahoo thinks commercial column stores aren’t ready yet for more than 100 terabytes of data.
Yahoo says it got great performance advantages from a custom system by optimizing for its specific application. I don’t know exactly what that would be, but I do know that database architectures for high-volume web analytics are still in pretty bad shape. In particular, there’s no good way yet to analyze the specific, variable-length paths users take through websites.

Categories: Analytic technologies, Columnar database management, Data warehousing, MySQL, Petabyte-scale data management, PostgreSQL, Specific users, Theory and architecture, Yahoo

13 Comments

May 24, 2008

DATAllegro on compression

DATAllegro CEO Stuart Frost has been blogging quite a bit recently (and not before time!). A couple of his posts have touched on compression. In one he gave actual numbers for compression, namely:

DATAllegro compresses between 2:1 and 6:1 depending on the content of the rows, whereas column-oriented systems claim 4:1 to 10:1.

In another recent post, Stuart touched on architecture, saying:

Due to the way our compression code works, DATAllegro’s current products are optimized for performance under heavy concurrency. The end result is that we don’t use the full power of the platform when running one query at a time.

Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Database compression, DATAllegro

Data warehouse appliance power user TEOCO

If you had to name super-high-end users of data warehouse technology, your list might start with a few retailers, credit data processors, and telcos, plus the US intelligence establishment. Well, it turns out that TEOCO runs outsourced data warehouses for several of the top US telcos, making it one of the top data warehouse technology users around.

A few weeks ago, I had a fascinating chat with John Devolites of TEOCO. Highlights included:

TEOCO runs a >200 TB DATAllegro warehouse for a major US telco. (When we hear about a big DATAllegro telco site that’s been in production for a while, that’s surely the one they’re talking about.)
TEOCO runs around 450 TB total of DATAllegro databases across its various customers. (When Stuart Frost blogs of >400 TB “systems,” that may be what he’s talking about.)
TEOCO likes DATAllegro better than Netezza, although the margin is now small. This is mainly for financial reasons, specifically price-per-terabyte. When TEOCO spends its own money without customer direction as to appliance brand, it buys DATAllegro.
TEOCO runs at least one 50 TB Netezza system — originally due to an acquisition of a Netezza user — with more coming. There also is more DATAllegro coming.
TEOCO feels 15-30 concurrent users is the current practical limit for both DATAllegro and Netezza. That’s greater than it used to be.
Netezza is a little faster than DATAllegro on a few esoteric queries, but the difference is not important to TEOCO’s business.
Official price lists notwithstanding, TEOCO sees prices as being in the $10K/TB range. DATAllegro’s price advantage has shrunk greatly, as others have come down to more or less match. However, since John stated his price preference for DATAllegro as being in the present tense, I presume the price match isn’t perfect.
Teradata was never a serious consideration, for price reasons.
In the original POC a few years ago, the incumbent Oracle — even after extensive engineering — couldn’t get an important query down under 8 hours of running time. DATAllegro and Netezza both handled it in 2-3 minutes. Similarly, Oracle couldn’t get the load time for 100 million call detail records (CDRs) below 24 hours.
Applications sound pretty standard for telecom: Lots of CDR processing — 550 million/day on the big DATAllegro system cited above. Pricing and fraud checking. Some data staging for legal reasons (giving the NSA what it subpoenas and no more).

Categories: Analytic technologies, Data mart outsourcing, Data warehouse appliances, Data warehousing, DATAllegro, Netezza, Pricing, Specific users, Telecommunications, TEOCO

7 Comments

May 22, 2008

Netezza on compression

Phil Francisco put up a nice post on Netezza’s company blog about a month ago, explaining the Netezza compression story. Highlights include:

Like other row-based vendors, Netezza compresses data on a column-by-column basis, then stores the results in rows. This is obviously something of a limitation — no run-length encoding for them — but can surely accommodate several major compression techniques.
The Netezza “Compress Engine” compresses data on a block-by-block basis. This is a disadvantage for row-based systems vs. columnar ones in the area of compression, because columnar systems have more values per block to play with, and that yields higher degrees of compression. And among row-based systems, typical block size is an indicator of compression success. Thus, DATAllegro probably does a little better at compression than Netezza, and Netezza does a lot better at compression than Teradata.
Netezza calls its compression “compilation.” The blog post doesn’t make the reason clear. And the one reason I can recall confuses me. Netezza once said the compression extends at least somewhat to columns with calculated values. But that seems odd, as Netezza only has a very limited capability for materialized views.
Netezza pays the processing cost of compression in the FPGA, not the microprocessor. And so Netezza spins the overhead of the Compress Engine as being zero or free. That’s actually not ridiculous, since Netezza seems to have still-unused real estate on the FPGA for new features like compression. Read more

Categories: Analytic technologies, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Netezza, Theory and architecture

2 Comments

May 20, 2008

Netezza has an EMC deal too

Netezza has an EMC deal too. As befits a hardware vendor, Netezza has an actual OEM relationship with EMC, in which it is offering CLARiiONs built straight into NPS appliances. 5 TB of CLARiiON will be free in any Netezza system from 2 racks on upward. (A rack holds about 12.5 TB.) In addition, you’ll be able to buy 10 TB more of CLARiiON in every Netezza rack, if you want. The whole thing is supposed to ship before year-end. Read more

Categories: Analytic technologies, Data warehouse appliances, Data warehousing, EMC, Netezza

5 Comments

May 20, 2008

Top-end data warehouse sizes have grown hundreds-fold over the past 12 years

I just tripped across a link from February, 1996 in which NCR/Teradata:

Bragged that it had half a dozen customers with >1 TB of raw user data
Showed off a “record-breaking” 11 TB simulation

That represents roughly a 60-70% annual growth rate in top-end database sizes in the intervening 12 years.

Categories: Analytic technologies, Data warehousing, Teradata

4 Comments

May 19, 2008

Netezza, enterprise data warehouses, and the 100 terabyte mark

Phil Francisco of Netezza checked in tonight with some news that will be embargoed for a few hours. While I had him on the phone anyway, I asked him about large databases and/or enterprise data warehouses. Highlights included:

Netezza has one customer with 200 TB of user data. The name is confidential (but he told me who it was).
Netezza has sold 15 or so of its NPS 10-800s, which are rated at 100 TB capacity.
The second-largest database in production on Netezza is probably 80 TB or so at Catalina Marketing, which has been a Netezza early adopter all along.
Netezza’s biggest users typically have a handful (literally — off the top of his head, Phil said “4 to 6”) of applications, each with its own primary set of fact tables.
Each application-specific set of fact tables in such big-honking-data-mart installations is usually either of cardinality one, or else a small set sharing a common hash key.
Phil insists Netezza isn’t exaggerating when it claims to have true enterprise data warehouse installations. What he means by an EDW is something that is an enterprise’s primary data warehouse, is used by lots of departments, draws data from lots of sources, has loads going on at various points during the day, and has 100s if not 1000s of total users.
Netezza’s biggest EDW has about 30 TB of user data. Phil wouldn’t tell me the name of that customer.

Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Netezza

2 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in