Partnering with Cloudera
After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis:
- There are a lot of big datasets out there, where “big” commonly means “petabyte scale.”
- Owners of that much data commonly like to store it using free or quasi-free software, especially if the data isn’t structured in such a way that relational tables are a great fit in the first place. HDFS (Hadoop Distributed File System) is the default choice. (Of course, there always are exceptions.)
- Some kinds of analytics can be done perfectly well in Hadoop.
- Some kinds of analytics, of course, can not be done well in Hadoop, with the most obvious examples being:
- Queries that involve serious joins.
- Anything that requires a lower latency than Hadoop provides.
- When doing analytics in Hadoop on data stored in HDFS, you often will want to include data you’re storing in your relational DBMS.
So Cloudera is promising fast, bidirectional connectors between Hadoop/HDFS and various DBMS, such as Aster Data nCluster, and will provide them on a services basis even before the productized versions ship. Here “fast” should and in multiple cases does mean “fully parallel,” with all data-owning nodes on either side (Hadoop or HDFS) more or less equally involved. Indeed, Aster is (I think for the first time) bypassing its loader nodes, instead sending Hadoop data straight to its workers.
Comments
11 Responses to “Partnering with Cloudera”
Leave a Reply
What Cloudera is most likely doing is loading data directly into the workers via a SQL MR function. And unless the data is correctly partitioned, as it would be via Aster’s loader function, it would either need to be partitioned and redistributed or potentially face a dramatic throughput reduction on subsequent processing.
This is a kluge and although will most likely work, it’s far from an optimal application of parallel processing.
Colin,
That’s a good point, Colin. However, the need can also be met by having Hadoop dispatch the data to the correct nodes, which basically means hashing on the same key used to distribute data in Aster Data nCluster.
I don’t see how it would work well in the case of round-robin distribution, but Kognitio aside, round-robin is generally not an essential or even preferred distribution method.
Curt,
I agree. But, consider the operational issues around having to do this.
It would be best to have nCluster pull the data out of HDFS, not to have Hadoop nodes push the data. This means that the logic would be embedded in a query and corresponding sql mr function at best and would have to be modified for any nCluster worker addition. Or, perhaps Cloudera is accessing some undocumented aspect of nCluser to base the distributed load upon, which could possibly break the loading function. This could lead to inadvertent data loss; the anathema of all database operations.
All in all, it’s a workable approach; but at best, a brittle solution in my opinion. Ask me why I’m familiar with this approach? Because they’re not the first to consider this methodology, as I most likely wasn’t either.
An Aster provided loader API to facilitate this approach would be a wonderful addition to an already great product offering. The API could enable both push and pull loading, letting the customer decide which approach was best for the problem being solved.
Colin,
Color me dense, but I fail to see the problem in nCluster nodes each saying “Hey, HDFS node, send me the key-value pairs whose keys hash to this range”
Curt,
So I’m assuming that you agree with me then that having nCluster pull is better than having Hadoop push in this instance.
And yes, there’s nothing wrong with asking for a particular key set. How you automate that for hands-off operation is a different matter.
But, if this is for the ‘pull’ instead of ‘push’ then how data getting transferred? Is it via Hive? If so, having multiple processes (nCluster workers) all making separate Hive requests is probably inefficient. Having one or more processes reading from HDFS and pushing to nCluster might be more efficient – if a loader API was available. Aster’s loader program is very fast (we’ve loaded a day’s worth of NYSE in a couple of minutes for example).
My point is that this approach is a work around for not having a loader API. If that API was available, or if the loader program read from HDFS, then we wouldn’t even be having this conversation, right?
Colin,
I think we’re getting to a level of detail that we should buck to Aster or Cloudera. Normally they’re pretty fast to comment here when needed, but I wouldn’t be shocked if Hadoop World were to slow them down a bit.
Curt, thanks for waiting a bit for our response—indeed we were caught up in a very successful Hadoop World where both Bank of America and comScore spoke to the joint Aster Data and Hadoop solution.
It may be useful for your readers to know that for over a year now, a number of Aster Data customers have been using Aster Data nCluster and Hadoop together for big data analysis. So the use cases for integrated use of Aster Data nCluster and Cloudera are well known and not new territory for us.
What is new is the commitment of Cloudera to deliver a standardized Hadoop connector to nCluster. Cloudera is focusing on Sqoop as the standard interface for databases with CDH. In Cloudera’s latest release, CDH3, the Sqoop framework is enhanced for better performance, supportability, and improved integration with database technologies and other Hadoop technologies. In addition Sqoop can now cover a broader range of integration use cases thanks to support for incremental database updates to and from Hadoop, as well as support for fast-path import export support that directly work with database specific batch tools. We have partnered with Cloudera so that they can incorporate IP from our existing connector, making it Sqoop compatible (enhancing it as needed to do so), and certifying it against the latest releases of our respective platforms. The intent here is that our joint customers can get the highest levels of ongoing support for implementations of the joint solution.
Hi Stephanie,
Thanks!
But I’m not sure that actually answers Colin’s questions. 😉
Best,
CAM
[…] Partnering with Cloudera (dbms2.com) […]
[…] Partnering with Cloudera (dbms2.com) […]
[…] forms can the inputs and outputs of a UDF take? And by the way, what’s your complete list of MapReduce integration […]