October 10, 2010

Partnering with Cloudera

After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis:

There are a lot of big datasets out there, where “big” commonly means “petabyte scale.”
Owners of that much data commonly like to store it using free or quasi-free software, especially if the data isn’t structured in such a way that relational tables are a great fit in the first place. HDFS (Hadoop Distributed File System) is the default choice. (Of course, there always are exceptions.)
Some kinds of analytics can be done perfectly well in Hadoop.
Some kinds of analytics, of course, can not be done well in Hadoop, with the most obvious examples being:
- Queries that involve serious joins.
- Anything that requires a lower latency than Hadoop provides.
When doing analytics in Hadoop on data stored in HDFS, you often will want to include data you’re storing in your relational DBMS.

So Cloudera is promising fast, bidirectional connectors between Hadoop/HDFS and various DBMS, such as Aster Data nCluster, and will provide them on a services basis even before the productized versions ship. Here “fast” should and in multiple cases does mean “fully parallel,” with all data-owning nodes on either side (Hadoop or HDFS) more or less equally involved. Indeed, Aster is (I think for the first time) bypassing its loader nodes, instead sending Hadoop data straight to its workers.

Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Database diversity, Hadoop, MapReduce, Parallelization, Petabyte-scale data management

Subscribe to our complete feed!

Comments

11 Responses to “Partnering with Cloudera”

Colin on October 10th, 2010 7:53 pm

What Cloudera is most likely doing is loading data directly into the workers via a SQL MR function. And unless the data is correctly partitioned, as it would be via Aster’s loader function, it would either need to be partitioned and redistributed or potentially face a dramatic throughput reduction on subsequent processing.

This is a kluge and although will most likely work, it’s far from an optimal application of parallel processing.
Curt Monash on October 10th, 2010 8:08 pm

Colin,

That’s a good point, Colin. However, the need can also be met by having Hadoop dispatch the data to the correct nodes, which basically means hashing on the same key used to distribute data in Aster Data nCluster.

I don’t see how it would work well in the case of round-robin distribution, but Kognitio aside, round-robin is generally not an essential or even preferred distribution method.
Colin on October 10th, 2010 8:59 pm

Curt,

I agree. But, consider the operational issues around having to do this.

It would be best to have nCluster pull the data out of HDFS, not to have Hadoop nodes push the data. This means that the logic would be embedded in a query and corresponding sql mr function at best and would have to be modified for any nCluster worker addition. Or, perhaps Cloudera is accessing some undocumented aspect of nCluser to base the distributed load upon, which could possibly break the loading function. This could lead to inadvertent data loss; the anathema of all database operations.

All in all, it’s a workable approach; but at best, a brittle solution in my opinion. Ask me why I’m familiar with this approach? Because they’re not the first to consider this methodology, as I most likely wasn’t either.

An Aster provided loader API to facilitate this approach would be a wonderful addition to an already great product offering. The API could enable both push and pull loading, letting the customer decide which approach was best for the problem being solved.
Curt Monash on October 10th, 2010 11:39 pm

Colin,

Color me dense, but I fail to see the problem in nCluster nodes each saying “Hey, HDFS node, send me the key-value pairs whose keys hash to this range”
Colin on October 11th, 2010 12:47 am

Curt,

So I’m assuming that you agree with me then that having nCluster pull is better than having Hadoop push in this instance.

And yes, there’s nothing wrong with asking for a particular key set. How you automate that for hands-off operation is a different matter.

But, if this is for the ‘pull’ instead of ‘push’ then how data getting transferred? Is it via Hive? If so, having multiple processes (nCluster workers) all making separate Hive requests is probably inefficient. Having one or more processes reading from HDFS and pushing to nCluster might be more efficient – if a loader API was available. Aster’s loader program is very fast (we’ve loaded a day’s worth of NYSE in a couple of minutes for example).

My point is that this approach is a work around for not having a loader API. If that API was available, or if the loader program read from HDFS, then we wouldn’t even be having this conversation, right?
Curt Monash on October 11th, 2010 12:59 am

Colin,

I think we’re getting to a level of detail that we should buck to Aster or Cloudera. Normally they’re pretty fast to comment here when needed, but I wouldn’t be shocked if Hadoop World were to slow them down a bit.
Stephanie McReynolds on October 18th, 2010 3:46 pm

Curt, thanks for waiting a bit for our response—indeed we were caught up in a very successful Hadoop World where both Bank of America and comScore spoke to the joint Aster Data and Hadoop solution.

It may be useful for your readers to know that for over a year now, a number of Aster Data customers have been using Aster Data nCluster and Hadoop together for big data analysis. So the use cases for integrated use of Aster Data nCluster and Cloudera are well known and not new territory for us.

What is new is the commitment of Cloudera to deliver a standardized Hadoop connector to nCluster. Cloudera is focusing on Sqoop as the standard interface for databases with CDH. In Cloudera’s latest release, CDH3, the Sqoop framework is enhanced for better performance, supportability, and improved integration with database technologies and other Hadoop technologies. In addition Sqoop can now cover a broader range of integration use cases thanks to support for incremental database updates to and from Hadoop, as well as support for fast-path import export support that directly work with database specific batch tools. We have partnered with Cloudera so that they can incorporate IP from our existing connector, making it Sqoop compatible (enhancing it as needed to do so), and certifying it against the latest releases of our respective platforms. The intent here is that our joint customers can get the highest levels of ongoing support for implementations of the joint solution.
Curt Monash on October 18th, 2010 6:27 pm

Hi Stephanie,

Thanks!

But I’m not sure that actually answers Colin’s questions. 😉

Best,

CAM
AsterData partners with Tableau « DECISION STATS on November 30th, 2010 12:04 pm

[…] Partnering with Cloudera (dbms2.com) […]
Cloudera on December 18th, 2010 9:09 am

[…] Partnering with Cloudera (dbms2.com) […]
Analytic computing systems, aka analytic platforms | DBMS 2 : DataBase Management System Services on January 24th, 2011 1:28 am

[…] forms can the inputs and outputs of a UDF take? And by the way, what’s your complete list of MapReduce integration […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Partnering with Cloudera

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin