Greenplum

Analysis of data warehouse DBMS vendor Greenplum and its successor, EMC’s Data Computing division. Related subjects include:

November 19, 2013

How Revolution Analytics parallelizes R

I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.

There are four primary ways that people try to parallelize predictive modeling:

One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:

  1. External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
  2. What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
  3. … algorithms that can be parallelized by:
    • Operating on data in parts.
    • Getting intermediate results.
    • Combining them in some way for a final result.
  4. Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.

To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.

Read more

November 10, 2013

RDBMS and their bundle-mates

Relational DBMS used to be fairly straightforward product suites, which boiled down to:

Now, however, most RDBMS are sold as part of something bigger.

Read more

September 8, 2013

Layering of database technology & DBMS with multiple DMLs

Two subjects in one post, because they were too hard to separate from each other

Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.

Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:

Other examples on my mind include:

And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.

In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include:  Read more

August 6, 2013

Hortonworks, Hadoop, Stinger and Hive

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger —  but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

*By the way — Teradata seems serious about pushing the UDA as a core message.

Ecosystem notes, in Hortonworks’ perception, included:

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include:  Read more

June 6, 2013

Dave DeWitt responds to Daniel Abadi

A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.

Read more

June 2, 2013

SQL-Hadoop architectures compared

The genesis of this post is:

I love my life.

Per Daniel (emphasis mine): Read more

February 27, 2013

Hadoop distributions

Elephants! Elephants!
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.

Elephants! Elephants!
Three elephants went out to play
Etc.

–  Popular children’s song

It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:

Most of the same observations could apply to Hadoop appliance vendors.

Read more

February 25, 2013

Greenplum HAWQ

My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.

The basic idea seems to be much like what I mentioned a few days ago  — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.

and

In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.

The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.

February 5, 2013

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica: Read more

September 27, 2012

Hoping for true columnar storage in Oracle12c

I was asked to clarify one of my July comments on Oracle12c,

I wonder whether Oracle will finally introduce a true columnar storage option, a year behind Teradata. That would be the obvious enhancement on the data warehousing side, if they can pull it off. If they can’t, it’s a damning commentary on the core Oracle codebase.

by somebody smart who however seemed to have half-forgotten my post comparing (hybrid) columnar compression to (hybrid) columnar storage.

In simplest terms:

Next Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.