July 6, 2009

Yahoo is up to 10 petabytes now?

According to somebody (I forget who) who attended Yahoo’s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo’s predictions last year. Apparently, Yahoo also gave more details of how the technology works.

Categories: Columnar database management, Data warehousing, Web analytics, Yahoo

5 Comments

July 2, 2009

User data vs. raw disk space as a marketing metric

I tried to post a comment on Daniel Abadi’s blog, but doing so seems to require some sort of registration process, so I’m posting here instead.

In a comment to his post on node scalability, Daniel Abadi argued that disk space is a better metric to use in marketing than (presumably compressed) user data. Well, I imagine he didn’t quite mean to say that, but that’s actually what he wound up saying, starting from the accurate observation that compression ratios vary wildly from one data set to another, even more than they vary from product to product on the same data.

Nonetheless, I favor user data as a metric because:

That’s what users care about.
That’s how a number of analytic DBMS vendors, including Vertica, actually price.

Categories: Data warehousing, Parallelization, Pricing

3 Comments

July 2, 2009

The TPC-H schema

Would anybody recommend in real life running the TPC-H schema for that data? (I.e., fully normalized, no materialized views.) If so — why????

Categories: Benchmarks and POCs, Data warehousing

13 Comments

July 2, 2009

Notes on columnar/TPC-H compression

I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.* When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).

*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.

But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace). Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing. So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.

And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.

Categories: Benchmarks and POCs, Columnar database management, Data warehousing, Database compression, Vertica Systems

Storage humor

A Microsoft Answers message board got the question:

I’ve noticed that as I copy data/install programs on my Laptop, the weight of the Laptop increases. I have a bad back and am medically limited on the amount of weight I can carry so I need to be very carefull not to inflict injury upon myself.

I have also noticed my XBox feels heavier as well (the more games I save or purchase from arcade). I generally don’t travel with my XBox so that is not an issue for me, but note the I am having the same results.

My ask, what is the weight/file ratio? So for example, how many GB’s = 6oz? I dread the day I need a dolly to commute to work with my Laptop.

Hilarity ensued.

Categories: Fun stuff, Humor, Storage

6 Comments

July 1, 2009

NoSQL?

Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:

In most cases, no.
Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
There’s one big countervailing factor to all these generalities — schema flexibility.

As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more

Categories: Analytic technologies, Data models and architecture, Data warehousing, Database diversity, DBMS product categories, Facebook, Fox and MySpace, Hadoop, MapReduce, Michael Stonebraker, NoSQL, OLTP, Parallelization, Streaming and complex event processing (CEP), Theory and architecture, VoltDB and H-Store

30 Comments

July 1, 2009

Correction to a recent quote

I’m quoted in a recent article around Aster’s appliance announcement as saying data warehouse appliances are more suitable for small workgroups of analysts crunching small amounts of data than they are for other uses.

But that’s not what I think at all.

I do think the ease-of-administration pitch for appliances makes them particularly well suited for users who want to scrape by without doing much database adminstration. This is especially appealing to departments or smaller enterprises. And the first/best scenario that comes to mind is indeed a small team of analysts, with good SQL skills but lightweight DBA experience, although Netezza has proved that many other kinds of users can find appliances appealing as well.

But that small team of analysts may maintain the largest database in the firm.

And by the way — notwithstanding the MySpace counterexample, most of Aster’s initial customers had <10 terabyte databases, and I think indeed <5 terabyte. The “frontline” pitch succeeded for Aster before (MySpace again aside) any better-big-data-crunching story did.

Categories: Analytic technologies, Aster Data, Data warehouse appliances, Data warehousing, Theory and architecture

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Yahoo is up to 10 petabytes now?

User data vs. raw disk space as a marketing metric

The TPC-H schema

Notes on columnar/TPC-H compression

Storage humor

NoSQL?

Correction to a recent quote

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin