Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

April 20, 2009

Infobright update

For the past couple of quarters, Infobright has been MySQL’s partner of choice for larger data warehousing applications. Infobright’s stated business metrics, and I quote, include:

  • > 50 Customers in 7 Countries

  • > 25 Partners on 4 continents

  • A vibrant open source community

    • +1 million visitors

    • Approaching 10,000 downloads

    • 2,000 active community participants

These may be compared with analogous metrics Infobright offered in February.

Infobright has also made or promised a variety of technological enhancements. Ones that are either shipping now or promised soon include: Read more

April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more

March 20, 2009

Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data

Data warehouse load speeds are a contentious issue.  Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate.  Oracle has gotten dinged for very low load speeds, which then are hotly debated.  I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.

Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that.  Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.

One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.

March 5, 2009

Fox Interactive Media’s multi-hundred terabyte database running on Greenplum

Greenplum’s largest named account is Fox Interactive Media — the parent organization of MySpace — which has a multi-hundred terabyte database that it uses for hardcore data mining/analytics. Greenplum has been engaging in regrettable business practices, claiming that it is in the process of supplanting Aster Data at Fox/MySpace. In fact, MySpace’s use of Aster is more mission-critical than Fox’s use of Greenplum, and is increasing significantly.

Still, as Greenplum’s gushing customer video with Fox Interactive Media* illustrates, the Fox/Greenplum database is impressive on its own merits. Read more

March 5, 2009

MySpace’s multi-hundred terabyte database running on Aster Data

Aster Data has put up a blog post embedding and summarizing a video about its MySpace account. Basic metrics include:

The combined Aster deployment now has 200+ commodity hardware servers working together to manage 200+ TB of data that is growing at 2-3TB per day by collecting 7-10B events that happen on one of the world.

I’m pretty sure that’s counting correctly (i.e., user data).* Read more

February 23, 2009

The questionable benefits of terabyte-scale data warehouse virtualization

Vertica is virtualizing via VMware, and has suggested a few operational benefits to doing so that might or might not offset VMware’s computational overhead. But on the whole,it seems virtualization’s major benefits don’t apply to the large-database MPP data warehousing. Read more

February 4, 2009

Draft slides on how to select an analytic DBMS

I need to finalize an already-too-long slide deck on how to select an analytic DBMS by late Thursday night.  Anybody see something I’m overlooking, or just plain got wrong?

Edit: The slides have now been finalized.

February 2, 2009

One vendor’s trash is another’s treasure

A few months ago, CEO Mayank Bawa of Aster Data commented to me on his surprise at how “profound” the relationship was between design choices in one aspect of a data warehouse DBMS and choices in other parts. The word choice in that was all Mayank, but the underlying thought is one I’ve long shared, and that I’m certain architects of many analytic DBMS share as well.

For that matter, the observation is no doubt true in many other product categories as well. But in the analytic database management arena, where there are literally 10-20+ competitors with different, non-stupid approaches, it seems most particularly valid. Here are some examples of what I mean. Read more

January 28, 2009

More Oracle notes

When I went to Oracle in October, the main purpose of the visit was to discuss Exadata. And so my initial post based on the visit was focused accordingly. But there were a number of other interesting points I’ve never gotten around to writing up. Let me now remedy that, at least in part. Read more

January 26, 2009

New England Database Day this Friday January 30

Dan Weinreb, to whose opinions I usually give great weight, spoke very favorably of last year’s New England Database Day conference.  Well, this year’s is taking place on Friday.  It’s at MIT and it’s free, with easy registration.  A list of papers is here

It’s pretty obvious who’s running the show. Sam Madden’s name is given as a contact; elsewhere it’s referred to as being organized by Madden and Mike Stonebraker.  Of the six identified papers, 2-3 look like the subjects or people could be taken straight from Vertica’s Database Column blog.  But that hardly means the event will be one long Vertica commercial.  For example, the other papers include one from Netezza and one on Flash memory data access methods.

I really doubt I’ll make to Cambridge in time for the 9:00 am opening remarks ;), but I’ll try to swing by later on.

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.