than for grep. This is because the sort map tasks spend about half their
time and I/O bandwidth writing intermediate output to their local disks.
The corresponding intermediate output for grep had negligible size.
A few things to note: the input rate is higher than the shuffle rate
and the output rate because of our locality optimization; most data is
read from a local disk and bypasses our relatively bandwidth constrained network. The shuffle rate is higher than the output rate
because the output phase writes two copies of the sorted data (we
make two replicas of the output for reliability and availability reasons).
We write two replicas because that is the mechanism for reliability and
availability provided by our underlying file system. Network bandwidth
requirements for writing data would be reduced if the underlying file
system used erasure coding [ 15] rather than replication.
The original article has further experiments that examine the
effects of backup tasks and machine failures [ 8].
6 Experience
We wrote the first version of the MapReduce library in February of
2003 and made significant enhancements to it in August of 2003,
including the locality optimization, dynamic load balancing of task execution across worker machines, etc. Since that time, we have been
pleasantly surprised at how broadly applicable the MapReduce library
has been for the kinds of problems we work on. It has been used
across a wide range of domains within Google, including:
• large-scale machine learning problems,
• clustering problems for the Google News and Froogle products,
• extracting data to produce reports of popular queries (e.g. Google
Zeitgeist and Google Trends),
• extracting properties of Web pages for new experiments and products (e.g. extraction of geographical locations from a large corpus of
Web pages for localized search),
• processing of satellite imagery data,
• language model processing for statistical machine translation, and
• large-scale graph computations.
Figure 4 shows the significant growth in the number of separate
MapReduce programs checked into our primary source-code management system over time, from 0 in early 2003 to almost 900 in September 2004, to about 4000 in March 2006. MapReduce has been so
successful because it makes it possible to write a simple program and
run it efficiently on a thousand machines in a half hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows
programmers who have no experience with distributed and/or parallel
systems to exploit large amounts of resources easily.
Table I. MapReduce Statistics for Different Months.
Number of jobs (1000s)
Avg. completion time (secs)
Machine years used
map input data (TB)
map output data (TB)
reduce output data (TB)
Avg. machines per job
Unique implementations
map
reduce
Aug. ’04
29
634
217
3,288
758
193
157
Mar. ’06
171
874
2,002
52,254
6,743
2,970
268
Sep. ’07
2,217
395
11,081
403, 152
34,774
14,018
394
395
269
1958
1208
4083
2418
At the end of each job, the MapReduce library logs statistics about
the computational resources used by the job. In Table I, we show some
statistics for a subset of MapReduce jobs run at Google in various
months, highlighting the extent to which MapReduce has grown and
become the de facto choice for nearly all data processing needs at Google.
6. 1 Large-Scale Indexing
One of our most significant uses of MapReduce to date has been a
complete rewrite of the production indexing system that produces the
data structures used for the Google Web search service. The indexing
system takes as input a large set of documents that have been retrieved
by our crawling system, stored as a set of GFS files. The raw contents
for these documents are more than 20 terabytes of data. At the time
we converted the indexing system to use MapReduce in 2003, it ran as
a sequence of eight MapReduce operations. Since that time, because
of the ease with which new phases can be added, many new phases
have been added to the indexing system. Using MapReduce (instead
of the ad-hoc distributed passes in the prior version of the indexing
system) has provided several benefits.
• The indexing code is simpler, smaller, and easier to understand because the code that deals with fault tolerance, distribution, and parallelization is hidden within the MapReduce library. For example, the
size of one phase of the computation dropped from approximately
3800 lines of C++ code to approximately 700 lines when expressed
using MapReduce.
• The performance of the MapReduce library is good enough that we
can keep conceptually unrelated computations separate instead of
mixing them together to avoid extra passes over the data. This makes
it easy to change the indexing process. For example, one change that
took a few months to make in our old indexing system took only a
few days to implement in the new system.