September 23, 2011

Some notes on Hadoop (mainly) and appliances

1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.

2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.

3. On its latest earnings call, Oracle clearly said it would introduce a Hadoop appliance, versus just hinting at a Hadoop appliance the prior quarter. The money quote was:  Read more

September 22, 2011

DataStax pivots back to its original strategy

The DataStax and Cassandra stories are somewhat confusing. Unfortunately, DataStax chose to clarify them in what has turned out to be a crazy news week. I’m going to use this post just to report on the status of the DataStax product line, without going into any analysis beyond that.

Read more

September 19, 2011

Are there any remaining reasons to put new OLTP applications on disk?

Once again, I’m working with an OLTP SaaS vendor client on the architecture for their next-generation system. Parameters include:

So I’m leaning to saying:   Read more

September 5, 2011

Data management at Zynga and LinkedIn

Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn’s People You May Know application. :)

It’s blindingly obvious that Zynga is one of Vertica’s petabyte-scale customers, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it’s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.

I don’t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.

I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, memcached/Membase/Couchbase), as Zynga decided that sending the data to some kind of log first was more trouble than it’s worth. Second, there’s Zynga’s approach to analytic database design. Highlights of that include: Read more

August 21, 2011

Hadoop evolution

I wanted to learn more about Hadoop and its futures, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was:

Arun previously addressed these issues and more in a June slide deck.
Read more

July 27, 2011

MongoDB users and use cases

I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren’t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that’s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I’d guess those really do use the bulk of available disk space.   Read more

July 18, 2011

HBase is not broken

It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop’s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, HBase has worked pretty well since the .90 release in January of this year.

After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers using, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers using HBase that Omer was thinking of, 15 are in HBase production, one is in HBase “early production”, one is still doing R&D in the area of HBase, and one is a classified government customer not providing such details. Read more

July 15, 2011

Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued

As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:

Continuing with that discussion of DBMS alternatives:

And while we’re at it — going schema-free often makes a whole lot of sense. I need to write much more about the point, but for now let’s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.

July 10, 2011

Cloudera and Hortonworks

My clients at Cloudera have been around for a while, in effect positioned as “the Hadoop company.” Their business, in a nutshell, consists of:

Hortonworks spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point — but evidently hopes that day will never actually come.

Read more

July 6, 2011

Petabyte-scale Hadoop clusters (dozens of them)

I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.

Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:

Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.