I wanted to learn more about Hadoop and its futures, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was:
- NameNode evolution, and the related issue of file-count limitations.
- JobTracker evolution.
Arun previously addressed these issues and more in a June slide deck.
*Arun has been working on Hadoop full time for over 6 years, and leading the Yahoo/Hortonworks MapReduce team for at least the last 2.
For both NameNode and JobTracker, Arun’s take was along the lines of:
- It’s worked well so far, but it had to be enhanced.
- What we’re taking care of in the current development cycle will make things better.
- Future enhancements after that are of course likely.
- What’s more, we’re doing some refactoring* to improve pluggability, so that alternatives to the mainline Apache Hadoop version of any specific module can easily be swapped in.
*My word, not Arun’s.
In the case of both NameNode and JobTracker, the problem starts with an essential component of the Hadoop system being on a single server. That single server can be mirrored for high availability, but it’s a bottleneck. The current fix for NameNode is to federate it across multiple servers, each owning a portion of the namespace. A variety of further possibilities come to mind:
- Allow persistent storage (I’d urge solid-state memory be assumed in the design), in some sort of a purpose-built system. (Right now NameNode keeps all the metadata it manages in RAM, which is why it has such capacity limitations.)
- Use HBase, as Zettaset does.
- Build on some other open-source DBMS, especially one that was designed for hybrid memory-centric use cases.
The Hadoop guys evidently don’t know which of those approaches will be tried next, in which combination. But if they choose wrong — well, as Arun points out, you’re welcome to implement your own alternative.
Meanwhile, Arun says the current improvement will take Hadoop’s capacity up to half a billion files or so,* or 60-100 petabytes of “storage” (presumably before compression, replication, and so on are taken into account). I didn’t ask Arun to walk me through the arithmetic of that, but I did ask why there were so many files in the first place. I’ve heard in the past of what amounted to “a new file for every update” kinds of scenarios, but that’s not the example he gave. Rather, he spoke of such slicing and dicing of Yahoo advertising data that there are huge numbers of calculated quasi-cubes. For example, advertiser-specific aggregates are run on 5-minute slices of web event data, and stored for 15-minute slices; the raw data is of course kept as well.
*Depending on who you ask, the current figure seems to be in the 70-100 million range, special cases perhaps aside.
Arun also addressed the debate over the fact that MapReduce uses the node-specific file system to create temporary files, rather than HDFS. One reason for this is surely the file-count limitation, but another is performance. Yes, this choice means you lose intermediate results if there’s a node failure; but Hadoop is designed to deal with nodes that fail. And by the way — even if you did use HDFS for intermediate result sets, you might well be setting its replication factor to 1.
As for JobTracker — that’s getting split into at least two pieces, to separate its two main functions:
- Global Resource Manager, which will manage the allocation of resources to various applications, which is apparently a fairly lightweight task.
- Application Master, which will manage the scheduling across nodes of individual applications, which is apparently the part that consumes lots of resources.
Apparently JobTracker has been a key barrier to scalability, being limited to 40-50,000 tasks per cluster, which is a lot like 10-15 tasks/node multiplied by the limit-to-date of 4,000 or so nodes/cluster. The new architecture, in which multiple Application Master compete for resources dispenses by a single Global Resource Manager, is supposed to allow at least 20-50 tasks/node, and a greater number of nodes as well. This capability was checked into the main Hadoop code line just this week.
Global Resource Manager also runs on a single node, high availability mirroring aside. Discussion of opportunities for further scaling was much like our discussion of scaling NameNode — there’s one option that could work, and another option that could also work, and in the mean time one can substitute one’s own version if one likes. Indeed, different Hadoop users already plug different schedulers into JobTracker today.
Another benefit of refactoring JobTracker is that non-MapReduce processing frameworks can be included in Hadoop. The one that gets mentioned repeatedly is MPI (Message Passing Interface). Like the SAS folks, the Hadoop folks seem to think that not all important algorithms can be parallelize via MapReduce, and that MPI is a great place to look for alternative strategies.
I’m still confused about issues such as:
- What resources the Global Resource Manager manages (CPU? RAM? I/O?).
- Whether the Global Resource Manager prioritizes work in the flexible way that, say, a good analytic DBMS workload manager does (I’m guessing it doesn’t, based on the discussion of alternative scheduling modules).
- What, if anything, still needs to be done to actually make MPI work within Hadoop.
But in any case, this all seems to be an important step in Hadoop’s evolution toward being (more of) a flexible, industrial-strength application execution environment.