Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

Any subcategory
Database diversity
Explicit support for specific data types
(in Text Technologies) Text search

April 20, 2009

Infobright update

For the past couple of quarters, Infobright has been MySQL’s partner of choice for larger data warehousing applications. Infobright’s stated business metrics, and I quote, include:

> 50 Customers in 7 Countries

> 25 Partners on 4 continents

A vibrant open source community

+1 million visitors

Approaching 10,000 downloads

2,000 active community participants

These may be compared with analogous metrics Infobright offered in February.

Infobright has also made or promised a variety of technological enhancements. Ones that are either shipping now or promised soon include: Read more

Categories: Columnar database management, Data warehousing, Infobright, MySQL, Open source

6 Comments

April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

2 1/2 petabytes of data managed via Hadoop
10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
Ad targeting queries run every 15 minutes in Hadoop
Dashboard roll-up queries run every hour in Hadoop
Ad-hoc research/analytic Hadoop queries run whenever
Anti-fraud analysis done in Hadoop
Text mining (e.g., of things written on people’s “walls”) done in Hadoop
100s or 1000s of simultaneous Hadoop queries
JSON-based social network analysis in Hadoop

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more

Categories: Analytic technologies, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Petabyte-scale data management, RDF and graphs, Specific users, Web analytics

27 Comments

March 20, 2009

Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data

Data warehouse load speeds are a contentious issue. Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate. Oracle has gotten dinged for very low load speeds, which then are hotly debated. I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.

Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that. Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.

One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.

Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Fox and MySpace, Greenplum, Theory and architecture, Web analytics

3 Comments

March 5, 2009

Fox Interactive Media’s multi-hundred terabyte database running on Greenplum

Greenplum’s largest named account is Fox Interactive Media — the parent organization of MySpace — which has a multi-hundred terabyte database that it uses for hardcore data mining/analytics. Greenplum has been engaging in regrettable business practices, claiming that it is in the process of supplanting Aster Data at Fox/MySpace. In fact, MySpace’s use of Aster is more mission-critical than Fox’s use of Greenplum, and is increasing significantly.

Still, as Greenplum’s gushing customer video with Fox Interactive Media* illustrates, the Fox/Greenplum database is impressive on its own merits. Read more

Categories: Analytic technologies, Aster Data, Data warehousing, Fox and MySpace, Greenplum, Specific users, Theory and architecture, Web analytics

3 Comments

March 5, 2009

MySpace’s multi-hundred terabyte database running on Aster Data

Aster Data has put up a blog post embedding and summarizing a video about its MySpace account. Basic metrics include:

The combined Aster deployment now has 200+ commodity hardware servers working together to manage 200+ TB of data that is growing at 2-3TB per day by collecting 7-10B events that happen on one of the world.

I’m pretty sure that’s counting correctly (i.e., user data).* Read more

Categories: Analytic technologies, Application areas, Aster Data, Data warehousing, Fox and MySpace, Specific users, Theory and architecture, Web analytics

11 Comments

February 23, 2009

The questionable benefits of terabyte-scale data warehouse virtualization

Vertica is virtualizing via VMware, and has suggested a few operational benefits to doing so that might or might not offset VMware’s computational overhead. But on the whole,it seems virtualization’s major benefits don’t apply to the large-database MPP data warehousing. Read more

Categories: Columnar database management, Data warehousing, Database compression, Theory and architecture, Vertica Systems

2 Comments

February 4, 2009

Draft slides on how to select an analytic DBMS

I need to finalize an already-too-long slide deck on how to select an analytic DBMS by late Thursday night. Anybody see something I’m overlooking, or just plain got wrong?

Edit: The slides have now been finalized.

Categories: Aster Data, Buying processes, Columnar database management, Data warehouse appliances, Data warehousing, Exadata, Exasol, Greenplum, IBM and DB2, Infobright, Kickfire, Microsoft and SQL*Server, MySQL, Netezza, Open source, Oracle, ParAccel, Presentations, SAND Technology, Sybase, Teradata, Vertica Systems

12 Comments

February 2, 2009

One vendor’s trash is another’s treasure

A few months ago, CEO Mayank Bawa of Aster Data commented to me on his surprise at how “profound” the relationship was between design choices in one aspect of a data warehouse DBMS and choices in other parts. The word choice in that was all Mayank, but the underlying thought is one I’ve long shared, and that I’m certain architects of many analytic DBMS share as well.

For that matter, the observation is no doubt true in many other product categories as well. But in the analytic database management arena, where there are literally 10-20+ competitors with different, non-stupid approaches, it seems most particularly valid. Here are some examples of what I mean. Read more

Categories: Aster Data, Data warehousing, Exadata, Kognitio, Oracle, Theory and architecture, Vertica Systems

22 Comments

January 28, 2009

More Oracle notes

When I went to Oracle in October, the main purpose of the visit was to discuss Exadata. And so my initial post based on the visit was focused accordingly. But there were a number of other interesting points I’ve never gotten around to writing up. Let me now remedy that, at least in part. Read more

Categories: Data types, Data warehousing, Database compression, GIS and geospatial, MOLAP, Oracle, SAP AG, Streaming and complex event processing (CEP), Theory and architecture, Web analytics

9 Comments

January 26, 2009

New England Database Day this Friday January 30

Dan Weinreb, to whose opinions I usually give great weight, spoke very favorably of last year’s New England Database Day conference. Well, this year’s is taking place on Friday. It’s at MIT and it’s free, with easy registration. A list of papers is here.

It’s pretty obvious who’s running the show. Sam Madden’s name is given as a contact; elsewhere it’s referred to as being organized by Madden and Mike Stonebraker. Of the six identified papers, 2-3 look like the subjects or people could be taken straight from Vertica’s Database Column blog. But that hardly means the event will be one long Vertica commercial. For example, the other papers include one from Netezza and one on Flash memory data access methods.

I really doubt I’ll make to Cambridge in time for the 9:00 am opening remarks ;), but I’ll try to swing by later on.

Categories: Michael Stonebraker, Theory and architecture, Vertica Systems

3 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Theory and architecture

Infobright update

Cloudera presents the MapReduce bull case

Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data

Fox Interactive Media’s multi-hundred terabyte database running on Greenplum

MySpace’s multi-hundred terabyte database running on Aster Data

The questionable benefits of terabyte-scale data warehouse virtualization

Draft slides on how to select an analytic DBMS

One vendor’s trash is another’s treasure

More Oracle notes

New England Database Day this Friday January 30

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin