October 10, 2010

EMC/Greenplum notes

I dropped by the former Greenplum for my quarterly consulting visit (scheduled for the first week of Q4 for a couple of reasons, one of them XLDB4). Much of what we discussed was purely advisory and/or confidential — duh! — but there were real, nonconfidential takeaways in two areas.

First, feelings about the EMC acquisition are still very positive.

I do still think two particular Greenplum long-timers won’t be EMC long-timers as well,* but both were in meetings with me and seemed fully engaged.

*and I’m not referring to Luke Lonergan’s driving habits or new Lamborghini.

I also got an update from Greenplum analytics chief Steven Hillion. Steven’s team seems focused more on customer-specific consulting than on productization, but a fair amount of parallel analytic technology has made it into the Greenplum DBMS even so. In particular:

I asked Steven to shoot me an email listing the analytic capabilities Greenplum currently has built in. In two tries, he didn’t exactly do that. Below, however, please find a copy of the somewhat relevant email that he did send in its place.

Curt:

I agree with the point you made yesterday, that even the most sophisticated users of data warehouse and analytics technologies are often unaware of the extent to which in-database modeling is already available. Greenplum is certainly culpable of not talking enough about all the analytics capabilities that we’ve built. Much of it is easy to use, and hugely scalable. In fact, a decent number of our customers have been using in-database modeling for some time now. But I think it needs to be more widely adopted.

As you know, I work closely with Greenplum’s customers to maximize the value they get from their data, using advanced analytics techniques to deliver insight into business problems. And, simply put, the work I do would be impossible without our in-db modeling functions, without MapReduce, without PL/R and PL/Java, and so on. For me, Greenplum is, first and foremost, a comprehensive analytics platform, and we’ve used it to do some pretty cool stuff – product recommendation engines, real-time fraud detection, customer segmentation using text analytics, churn analysis incorporating social network effects, and so on.

To give you one concrete example, we used one of our regression functions to build campaign optimization models that run daily on tens of millions of observations and thousands of features. And all it takes is a single line of SQL:

Table regr_example:
y   |  x1  |  x2
——-+——+——
19.01 |  4.3 | -5.6
4.7 |    2 |  1.5
-3.92 | -1.7 |  5.5

SELECT mregr_coef (y, array[1,x1,x2]) FROM regr_example;

mregr_coef
——————————

————————-
{7.1462507322788,0.232103104862361,-1.94030462800233}

It really is that simple. And what’s more, this function is fully parallelized, so it scales far better than traditional modeling tools. That’s not to say that traditional modeling tools won’t continue to be important. But the message I want to get across is that we regularly build models on massive data sets, directly in the database, to solve real-world problems for our customers.

Steven

Comments

4 Responses to “EMC/Greenplum notes”

  1. Notes on the EMC Greenplum Data Computing Appliance | DBMS 2 : DataBase Management System Services on October 13th, 2010 10:13 am

    [...] big confidential part of my visit last week to EMC’s Data Computing Division, nee’ Greenplum, was of course this week’s announcement of the first EMC/Greenplum “Data Computing [...]

  2. EMC/Greenplum notes | DBMS 2 : DataBase Management System Services | IT Information Technology on November 28th, 2010 3:28 pm

    [...] data sets directly into the database, to solve real world problems for our customers. Steven Database Management System Notes – Google Blog Search by Chris [...]

  3. Analytic computing systems, aka analytic platforms | DBMS 2 : DataBase Management System Services on January 28th, 2011 3:41 am

    [...] fully parallel processes? For example, I like Netezza’s broad approach to linear algebra, Greenplum’s sparse vector manipulation, and a number of Aster Data’s packages. ParAccel’s list [...]

  4. How Revolution Analytics parallelizes R | DBMS 2 : DataBase Management System Services on December 20th, 2013 7:35 am

    [...] can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.