Application areas

Posts focusing on the use of database and analytic technologies in specific application domains. Related subjects include:

June 14, 2009

MMO games are still screwed up in their database technology

Two years ago I wrote about the database technology of Guild Wars. Not coincidentally, Guild Wars was the MMO RPG (Massively Multiplayer Online Role-Playing Game) I then played. I had the chance to interview Guild Wars’ lead developers. While much else they had to say was impressive, Guild Wars’ database architecture was — er, it was rather mind-boggling.

Since then, Linda and I have taken to playing Lord of the Rings Online, commonly known as LOTRO, developed by Turbine, Inc.. I haven’t had the chance to interview any Turbine folks, despite repeated requests. But from afar, it would seem that Turbine’s technology choices leave quite a bit to be desired, in enterprise-like IT areas such as authentication, database management, and storage.

LOTRO and other Turbine games commonly are down, for scheduled maintenance or in some cases otherwise. There is scheduled multi-hour downtime to start many weeks. There are fairly frequent server restarts in addition to that. Lag and congestion are frequent. And so on and so forth. By way of contrast, Guild Wars very rarely goes down, and other technical difficulties are less common as well. Reliability is a key design goal at for Guild Wars’ developers, and in my opinion they’ve achieved it.

Some of the reasons for Turbine’s difficulties seem related to the stresses of MMOs — e.g., they’re probably due to the problems caused by having many fictional characters moving through the same fictional space at once, with graphical detail much richer than Guild Wars’. But a couple of head-scratchers make me really wonder about how Turbine manages data.

Read more

June 10, 2009

Netezza Q1 earning call transcript

I finally read the Netezza Q1 earnings call transcript, put out by Seeking Alpha.  Highlights included:

One tip for the Netezza folks, by the way, from this former stock analyst — you should never use the word “certainly” about a deal you haven’t closed yet. “Almost surely” could be OK, but “certainly” — well, it certainly was not the thing to say.

June 8, 2009

Greenplum blogs about some customers

I’ve written some about Greenplum’s customers at eBay and Fox Interactive Media.  But as I recently grumped, I’m not in the mood right now to write much about other Greenplum customers.  Fortunately, Greenplum has filled the gap itself.  Marketing chief Paul Salazar just blogged about a number of other big Greenplum customers. And last month Paul blogged in considerable detail about what he characterizes as an enterprise data warehouse (EDW) conversion — Oracle replacement — at a large pharmaceutical company.

June 8, 2009

More on Fox Interactive Media’s use of Greenplum

Greenplum’s most important reference is probably its energetic advocate Fox Interactive Media, even ahead of much larger user Greenplum user eBay, and notwithstanding Aster Data’s large presence in Fox subsidiary MySpace. I just ran across a “review” of Greenplum by FIM’s Brian Dolan, neatly summarizing his views about Greenplum’s strengths, weaknesses, and uses inside Fox.  Highlights include: Read more

May 29, 2009

Sneakernet to the cloud

Recently, Amazon CTO Werner Vogels put up a blog post which suggested that, now and in the future, the best way to get large databases into the cloud is via sneakernet.  In some circumstances, he is surely right. Possible implications include:

But for one-time moves of data sets — sure, sneaker net/snail mail should work just fine.

May 18, 2009

Followup on IBM System S/InfoSphere Streams

After posting about IBM’s System S/InfoSphere Streams CEP offering, I sent three followup questions over to Jeff Jones.  It seems simplest to just post the Q&A verbatim.

1.  Just how many processors or cores does it take to get those 5 million messages/sec through? A little birdie says 4,000 cores. Read more

May 13, 2009

Microsoft announced CEP this week too

Microsoft still hasn’t worked out all the kinks regarding when and how intensely to brief me. So most of what I know about their announcement earlier this week of a CEP/stream processing product* is what I garnered on a consulting call in March. That said, I sent Microsoft my notes from that call, they responded quickly and clearly to my question as to what remained under NDA, and for good measure they included a couple of clarifying comments that I’ll copy below.

*”in the SQL Server 2008 R2 timeframe,” about which Microsoft wrote “the first Community Technology Preview (CTP) of SQL Server 2008 R2 will be available for download in the second half of 2009 and the release is on track to ship in the first half of calendar year 2010. “

Perhaps it is more than coincidence that IBM rushed out its own announcement of an immature CEP technology — due to be more mature in a 2010 release — immediately after Microsoft revealed its plans. Anyhow, taken together, these announcements support my theory that the small independent CEP/stream processing vendors are more or less ceding broad parts of the potential stream processing market.

The main use cases Microsoft talks about for CEP are in the area of sensor data.

Read more

May 13, 2009

IBM System S Streams, aka InfoSphere Streams, aka stream processing, aka “please don’t call it CEP”

IBM has hastily announced System S Streams, a product that was supposed to be called InfoSphere Streams and introduced only in 2010. Apparently, the rush is because senior management wanted to talk about it later this week, and perhaps also because it was implicitly baked into some of IBM’s advertising already. Scrambling ensued. Even so, Jeff Jones and team got to me fast, and briefed me — fairly non-technically, unfortunately, but otherwise how I like it, namely on a harmless embargo and without any NDAs. That’s more than can be said for my clients at Microsoft, who also introduced CEP this week, but I digress …

*Indeed, as I draft this post-Celtics-game, the embargo is already expired.

Marketing aside, IBM System S/InfoSphere Streams is indeed a CEP/stream processing engine + language (with an Eclipse-based development environment). Apparently, IBM’s thinks InfoSphere Streams (if that’s what it winds up being renamed to) is or will be differentiated from other CEP packages in:

Read more

May 11, 2009

Facebook, Hadoop, and Hive

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters.

Read more

May 4, 2009

37 Ways To Get More From Analytics, Version 2.0

As I hoped, there were some very helpful responses to my post listing ways to improve analytic effectiveness. Here’s a second draft incorporating them. Comments continue to be very welcome. I need to finalize this soon.

Read more

April 30, 2009

eBay’s two enormous data warehouses

A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.

Metrics on eBay’s main Teradata data warehouse include:

Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:

Read more

April 24, 2009

Some DB2 highlights

I chatted with IBM Thursday, about recent and imminent releases of DB2 (9.5 through 9.7). Highlights included:

April 16, 2009

Introduction to Tokutek

Tokutek has a paradoxical pitch: Tokutek writes data particularly quickly, and therefore you’re supposed to buy Tokutek for query-oriented uses. Highlights of the Tokutek story include:

Tokutek’s initial target market is the usual combination of clickstream/personalization/other network management. The idea is that many data warehouse technologies have trouble getting latency below, say, 15 seconds to 5 minutes, at least at very high update volumes. So if immediacy is more important than raw complex query performance, Tokutek’s performance profile could be attractive.

Read more

April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included:

Read more

April 1, 2009

Business intelligence notes and trends

I keep not finding the time to write as much about business intelligence as I’d like to. So I’m going to do one omnibus post here covering a lot of companies and trends, then circle back in more detail when I can. Top-level highlights include:

A little more detail

Read more

March 31, 2009

Twitter is considering using MapReduce

From a Twitter job listing (formatting mine).  The most interesting section is “Additional preferred experience.” Read more

March 25, 2009

Aleri update

My skeptical remarks on the Aleri/Coral8 merger generated some pushback. Today I actually got around to talking with John Morell, who was marketing chief at Coral8 and has remained with the combined company. First, some quick metrics:

John is sticking by the company line that there will be an integrated Aleri/Coral8 engine in around 12 months, with all the performance optimization of Aleri and flexibility of Coral8, that compiles and runs code from any of the development tools either Aleri or Coral8 now has. While this is a lot faster than, say, the Informix/Illustra or Oracle/IRI Express integrations, John insists that integrating CEP engines is a lot easier. We’ll see.

I focused most of the conversation on Aleri’s forthcoming efforts outside the financial services market. John sees these as being focused around Coral8’s old “Continuous (Business) Intelligence” message, enhanced by Aleri’s Live OLAP. Aleri Live OLAP is an in-memory OLAP engine, real-time/event-driven, fed by CEP. Queries can be submitted via ODBO/MDX today. XMLA is coming. John reports that quite a few Coral8 customers are interested in Live OLAP, and positions the capability as one Coral8 would have had to develop had the company remained independent.

Read more

March 25, 2009

Kickfire update

I talked recently with my clients at Kickfire, especially newish CEO Bruce Armstrong. I also visited the Kickfire blog, which among other virtues features a fairly clear overview of Kickfire technology. (I did my own Kickfire overview in October.) Highlights of the current Kickfire story include:

March 20, 2009

Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data

Data warehouse load speeds are a contentious issue.  Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate.  Oracle has gotten dinged for very low load speeds, which then are hotly debated.  I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.

Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that.  Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.

One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.

March 7, 2009

Three Greenplum customers’ applications of MapReduce

Greenplum (and Truviso) advisor Joseph Hellerstein offers a few examples of MapReduce applications (specifically Greenplum MapReduce), namely:

The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).

Roger talked about using MapReduce to extract structured entities from text for doing tech trend analyses from billions of rows of online job postings.  Brian (who is a mathematician by training) was talking about implementing conjugate gradiant and Support Vector Machines in parallel SQL to support “hypertargeting” for advertisers.  I mentioned how Jonathan Goldman at LinkedIn was using SQL and MapReduce to do graph algorithms for social network analysis.

Incidentally: While it’s been some months since I asked, my sense is that the O’Reilly text extraction is home-grown, and primitive compared to what one could do via commercial products. That said, if the specific application is examining job postings, I’m not sure how much value more sophisticated products would add. After all, tech job listings are generally written in a style explicitly designed to ensure that most or all of their meaning is conveyed simply by a bag of keywords. And by the way, this effort has been underway for quite some time.

Related link

March 5, 2009

Fox Interactive Media’s multi-hundred terabyte database running on Greenplum

Greenplum’s largest named account is Fox Interactive Media — the parent organization of MySpace — which has a multi-hundred terabyte database that it uses for hardcore data mining/analytics. Greenplum has been engaging in regrettable business practices, claiming that it is in the process of supplanting Aster Data at Fox/MySpace. In fact, MySpace’s use of Aster is more mission-critical than Fox’s use of Greenplum, and is increasing significantly.

Still, as Greenplum’s gushing customer video with Fox Interactive Media* illustrates, the Fox/Greenplum database is impressive on its own merits.

Read more

March 5, 2009

MySpace’s multi-hundred terabyte database running on Aster Data

Aster Data has put up a blog post embedding and summarizing a video about its MySpace account. Basic metrics include:

The combined Aster deployment now has 200+ commodity hardware servers working together to manage 200+ TB of data that is growing at 2-3TB per day by collecting 7-10B events that happen on one of the world.

I’m pretty sure that’s counting correctly (i.e., user data).*

Read more

February 26, 2009

Data warehousing business trends

I’ve talked with a whole lot of vendors recently, some here at TDWI, as well as users, fellow analysts, and so on. Repeated themes include:

Read more

February 26, 2009

HP and Neoview update

I had lunch with some HP folks at TDWI. Highlights (burgers and jokes aside) included:

Given the emphasis on trying to exploit HP’s other expertise in the data warehousing business, I suggested it was a pity that HP spun off Agilent (HP’s instrumentation division, aka HP Classic). Nobody much disagreed.

February 23, 2009

MapReduce user eHarmony chose Netezza over Aster or Greenplum

Depending on which IDG reporter you believe, eHarmony has either 4 TB of data or more than 12 TB, stored in Oracle but now analyzed on Netezza.  Interestingly, eHarmony is a Hadoop/MapReduce shop, but chose Netezza over Aster Data or Greenplum even so.  Price was apparently an important aspect of the purchase decision. Netezza also seems to have had a very smooth POC. Read more

Next Page →

Feed including blog about database management, data warehousing, and business intelligence Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.