Google recently received a patent for MapReduce. The first and most general claim is (formatting and emphasis mine):
A system for large-scale processing of data, comprising:
- a plurality of processes executing on a plurality of interconnected processors;
- the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes;
- the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes;
- each of a first plurality of the worker processes including an application-independent map module for retrieving a respective input data block assigned to the worker process by the master process and applying an application-specific map operation to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a key/value pair, and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in parallel on distinct, respective input data blocks;
- a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes all key/value pairs for a distinct set of respective keys, and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes;
- and each of a second plurality of the worker processes including an application-independent reduce module for retrieving data, the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying an application-specific reduce operation to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in parallel on multiple respective subsets of the produced intermediate data values.
The way a patent works is that you make a big claim and, just in case it’s later invalidated, you also make more specialized sub-claims. What’s more, in a software patent, you claim everything twice, once as a “system” and once as a “method.”
When a patent takes that long to issue and has a core claim that wordy, one can assume there was much back and forth with the PTO (Patent and Trademark Office) to whittle it down to something they felt they could approve. At a guess, I’d conjecture that the supposedly unique parts of the claim are concentrated in the areas I bolded above, and that the PTO doesn’t think the claim would be patentable unless most or all of them were included.
So should the claim have been approved even so? Let’s consider prior art. Oracle has long been able to parallelize ala MapReduce. I don’t see anything in the claim that isn’t preceded by what Oracle did, except maybe the emphasis on key/value pairs. (And the same statement applies to the other 15 claims in the patent, at least on a quick skim.) I forget the details of SenSage’s quasi-MapReduce, which also preceded the Google patent filing, but I imagine something similar would be true about it.
There is no doubt that Google popularized the ideas of MapReduce — which turns out to have been a worthy public service. In one great example of that popularization, the seminal paper on parallel data mining is almost laughable in how it deviates from MapReduce key/value pair formalism — but it still seems to have been inspired by Google’s MapReduce. But that’s a different matter; popularization != invention, even though there’s a certain connection between the two in patent law. Actually, Google also often does get credit for having “invented” MapReduce, including regrettably in the marketing materials of clients I can’t talk out of saying that and which now might be looking into the barrel of the Google patent (hello Aster); but again, saying something doesn’t make it enforceable in court.
So what it all boils down to is:
Should Google’s patent on the idea of parallelizing the handling of sets of application-visible key/value pairs be regarded as valid?
The United States PTO, which is paid to think about these things, has evidently decided Yes. I disagree. In simplest terms, my reason is that key/value pairs have been around for decades, and so:
Anything which was known or obvious without special reference to key/value pairs doesn’t suddenly become non-obvious when key/value pairs are mixed in.
If Google ever tries to enforce its MapReduce patent, I’m available as an expert witness for the other side.