October 10, 2010

EMC/Greenplum notes

I dropped by the former Greenplum for my quarterly consulting visit (scheduled for the first week of Q4 for a couple of reasons, one of them XLDB4). Much of what we discussed was purely advisory and/or confidential — duh! — but there were real, nonconfidential takeaways in two areas.

First, feelings about the EMC acquisition are still very positive.

Hiring has been rapid, on track to roughly quadruple Greenplum’s size over a 1 1/2 year period. These don’t seem to be EMC imports, but rather outside hires, although EMC folks are surely helping in the recruiting.
The former Greenplum is clearly going to pursue more product possibilities than it would have on its own. This augurs well for Greenplum customers.
Griping about big-company bureaucracy is minimal.
I didn’t hear one word about any unwelcome product/business strategy constraints. On the other hand …
… the next Greenplum product announcement you’ll hear about will be one designed to be appealing to the EMC customer base — i.e., to enterprises that EMC is generally successful in selling to.

I do still think two particular Greenplum long-timers won’t be EMC long-timers as well,* but both were in meetings with me and seemed fully engaged.

*and I’m not referring to Luke Lonergan’s driving habits or new Lamborghini.

I also got an update from Greenplum analytics chief Steven Hillion. Steven’s team seems focused more on customer-specific consulting than on productization, but a fair amount of parallel analytic technology has made it into the Greenplum DBMS even so. In particular:

Greenplum has had integrated MapReduce for quite a while.
Greenplum has a selection of built-in analytic functions. According to my notes, these include but are not limited to:
- Random sampling techniques
- Various data mining functions
- Linear regression and naïve Bayes (I’m not sure why these were mentioned separately from general data mining)
Just to be clear, that means that Greenplum has built-in in-database modeling. I believe it has had this for over a year.
Reminiscent of Netezza’s nzMatrix, Greenplum has built in some linear algebra capabilities as a building block for analytics. In Greenplum’s case that’s mainly the ability to do rapid dot products of (and hence distance computations between) sparse vectors — Steven is very proud of that.
- I forget whether it was at Greenplum or a competitor, but somewhere last week I talked about doing a market basket analysis that amounted to clustering vectors in a vector space whose dimension was equal to the number of SKUs a retailer offered.
However, Greenplum doesn’t have any particular analytic execution framework (other than SQL or MapReduce) analogous to Aster Data’s.

I asked Steven to shoot me an email listing the analytic capabilities Greenplum currently has built in. In two tries, he didn’t exactly do that. Below, however, please find a copy of the somewhat relevant email that he did send in its place.

Curt:

I agree with the point you made yesterday, that even the most sophisticated users of data warehouse and analytics technologies are often unaware of the extent to which in-database modeling is already available. Greenplum is certainly culpable of not talking enough about all the analytics capabilities that we’ve built. Much of it is easy to use, and hugely scalable. In fact, a decent number of our customers have been using in-database modeling for some time now. But I think it needs to be more widely adopted.

As you know, I work closely with Greenplum’s customers to maximize the value they get from their data, using advanced analytics techniques to deliver insight into business problems. And, simply put, the work I do would be impossible without our in-db modeling functions, without MapReduce, without PL/R and PL/Java, and so on. For me, Greenplum is, first and foremost, a comprehensive analytics platform, and we’ve used it to do some pretty cool stuff – product recommendation engines, real-time fraud detection, customer segmentation using text analytics, churn analysis incorporating social network effects, and so on.

To give you one concrete example, we used one of our regression functions to build campaign optimization models that run daily on tens of millions of observations and thousands of features. And all it takes is a single line of SQL:

Table regr_example:
y | x1 | x2
——-+——+——
19.01 | 4.3 | -5.6
4.7 | 2 | 1.5
-3.92 | -1.7 | 5.5

SELECT mregr_coef (y, array[1,x1,x2]) FROM regr_example;

mregr_coef
——————————

————————-
{7.1462507322788,0.232103104862361,-1.94030462800233}

It really is that simple. And what’s more, this function is fully parallelized, so it scales far better than traditional modeling tools. That’s not to say that traditional modeling tools won’t continue to be important. But the message I want to get across is that we regularly build models on massive data sets, directly in the database, to solve real-world problems for our customers.

Steven

Categories: Data warehousing, EMC, Greenplum, MapReduce, Parallelization, Predictive modeling and advanced analytics

Subscribe to our complete feed!

Comments

4 Responses to “EMC/Greenplum notes”

Notes on the EMC Greenplum Data Computing Appliance | DBMS 2 : DataBase Management System Services on October 13th, 2010 10:13 am

[…] big confidential part of my visit last week to EMC’s Data Computing Division, nee’ Greenplum, was of course this week’s announcement of the first EMC/Greenplum “Data Computing […]
EMC/Greenplum notes | DBMS 2 : DataBase Management System Services | IT Information Technology on November 28th, 2010 3:28 pm

[…] data sets directly into the database, to solve real world problems for our customers. Steven Database Management System Notes – Google Blog Search by Chris […]
Analytic computing systems, aka analytic platforms | DBMS 2 : DataBase Management System Services on January 28th, 2011 3:41 am

[…] fully parallel processes? For example, I like Netezza’s broad approach to linear algebra, Greenplum’s sparse vector manipulation, and a number of Aster Data’s packages. ParAccel’s list […]
How Revolution Analytics parallelizes R | DBMS 2 : DataBase Management System Services on December 20th, 2013 7:35 am

[…] can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

EMC/Greenplum notes

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin