EMC/Greenplum notes
I dropped by the former Greenplum for my quarterly consulting visit (scheduled for the first week of Q4 for a couple of reasons, one of them XLDB4). Much of what we discussed was purely advisory and/or confidential — duh! — but there were real, nonconfidential takeaways in two areas.
First, feelings about the EMC acquisition are still very positive.
- Hiring has been rapid, on track to roughly quadruple Greenplum’s size over a 1 1/2 year period. These don’t seem to be EMC imports, but rather outside hires, although EMC folks are surely helping in the recruiting.
- The former Greenplum is clearly going to pursue more product possibilities than it would have on its own. This augurs well for Greenplum customers.
- Griping about big-company bureaucracy is minimal.
- I didn’t hear one word about any unwelcome product/business strategy constraints. On the other hand …
- … the next Greenplum product announcement you’ll hear about will be one designed to be appealing to the EMC customer base — i.e., to enterprises that EMC is generally successful in selling to.
I do still think two particular Greenplum long-timers won’t be EMC long-timers as well,* but both were in meetings with me and seemed fully engaged.
*and I’m not referring to Luke Lonergan’s driving habits or new Lamborghini.
I also got an update from Greenplum analytics chief Steven Hillion. Steven’s team seems focused more on customer-specific consulting than on productization, but a fair amount of parallel analytic technology has made it into the Greenplum DBMS even so. In particular:
- Greenplum has had integrated MapReduce for quite a while.
- Greenplum has a selection of built-in analytic functions. According to my notes, these include but are not limited to:
- Random sampling techniques
- Various data mining functions
- Linear regression and naïve Bayes (I’m not sure why these were mentioned separately from general data mining)
- Just to be clear, that means that Greenplum has built-in in-database modeling. I believe it has had this for over a year.
- Reminiscent of Netezza’s nzMatrix, Greenplum has built in some linear algebra capabilities as a building block for analytics. In Greenplum’s case that’s mainly the ability to do rapid dot products of (and hence distance computations between) sparse vectors — Steven is very proud of that.
- I forget whether it was at Greenplum or a competitor, but somewhere last week I talked about doing a market basket analysis that amounted to clustering vectors in a vector space whose dimension was equal to the number of SKUs a retailer offered.
- However, Greenplum doesn’t have any particular analytic execution framework (other than SQL or MapReduce) analogous to Aster Data’s.
I asked Steven to shoot me an email listing the analytic capabilities Greenplum currently has built in. In two tries, he didn’t exactly do that. Below, however, please find a copy of the somewhat relevant email that he did send in its place.
Curt:
I agree with the point you made yesterday, that even the most sophisticated users of data warehouse and analytics technologies are often unaware of the extent to which in-database modeling is already available. Greenplum is certainly culpable of not talking enough about all the analytics capabilities that we’ve built. Much of it is easy to use, and hugely scalable. In fact, a decent number of our customers have been using in-database modeling for some time now. But I think it needs to be more widely adopted.
As you know, I work closely with Greenplum’s customers to maximize the value they get from their data, using advanced analytics techniques to deliver insight into business problems. And, simply put, the work I do would be impossible without our in-db modeling functions, without MapReduce, without PL/R and PL/Java, and so on. For me, Greenplum is, first and foremost, a comprehensive analytics platform, and we’ve used it to do some pretty cool stuff – product recommendation engines, real-time fraud detection, customer segmentation using text analytics, churn analysis incorporating social network effects, and so on.
To give you one concrete example, we used one of our regression functions to build campaign optimization models that run daily on tens of millions of observations and thousands of features. And all it takes is a single line of SQL:
Table regr_example:
y | x1 | x2
——-+——+——
19.01 | 4.3 | -5.6
4.7 | 2 | 1.5
-3.92 | -1.7 | 5.5SELECT mregr_coef (y, array[1,x1,x2]) FROM regr_example;
mregr_coef
——————————————————-
{7.1462507322788,0.232103104862361,-1.94030462800233}It really is that simple. And what’s more, this function is fully parallelized, so it scales far better than traditional modeling tools. That’s not to say that traditional modeling tools won’t continue to be important. But the message I want to get across is that we regularly build models on massive data sets, directly in the database, to solve real-world problems for our customers.
Steven
Comments
4 Responses to “EMC/Greenplum notes”
Leave a Reply
[…] big confidential part of my visit last week to EMC’s Data Computing Division, nee’ Greenplum, was of course this week’s announcement of the first EMC/Greenplum “Data Computing […]
[…] data sets directly into the database, to solve real world problems for our customers. Steven Database Management System Notes – Google Blog Search by Chris […]
[…] fully parallel processes? For example, I like Netezza’s broad approach to linear algebra, Greenplum’s sparse vector manipulation, and a number of Aster Data’s packages. ParAccel’s list […]
[…] can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in […]