User
Program
( 1) fork
( 1) fork
( 1) fork
( 2)
assign
map
Master
( 2)
assign
reduce
worker
split 0
split 1
( 4) local write
worker
( 5) re te reamdo
( 6) write
worker
output
file 0
( 3) read
split 2
split 3
split 4
worker
output
file 1
worker
Input
files
Map
phasr
Intermediate files
(on local disks)
Reduce
phase
Output
files
Fig. 1. Execution overview.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call
in the user program returns back to the user code.
After successful completion, the output of the mapreduce execution
is available in the R output files (one per reduce task, with file names
specified by the user). Typically, users do not need to combine these R
output files into one file; they often pass these files as input to another
MapReduce call or use them from another distributed application that
is able to deal with input that is partitioned into multiple files.
3. 2 Master Data Structures
The master keeps several data structures. For each map task and
reduce task, it stores the state (idle, in-progress, or completed) and the
identity of the worker machine (for nonidle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and
sizes of the R intermediate file regions produced by the map task.
Updates to this location and size information are received as map tasks
are completed. The information is pushed incrementally to workers
that have in-progress reduce tasks.
3. 3 Fault Tolerance
Since the MapReduce library is designed to help process very large
amounts of data using hundreds or thousands of machines, the library
must tolerate machine failures gracefully.
Handling Worker Failures
The master pings every worker periodically. If no response is received
from a worker in a certain amount of time, the master marks the worker
as failed. Any map tasks completed by the worker are reset back to their
initial idle state and therefore become eligible for scheduling on other
workers. Similarly, any map task or reduce task in progress on a failed
worker is also reset to idle and becomes eligible for rescheduling.
Completed map tasks are reexecuted on a failure because their output is stored on the local disk(s) of the failed machine and is therefore
inaccessible. Completed reduce tasks do not need to be reexecuted
since their output is stored in a global file system.
When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce
tasks are notified of the reexecution. Any reduce task that has not
already read the data from worker A will read the data from worker B.
MapReduce is resilient to large-scale worker failures. For example,
during one MapReduce operation, network maintenance on a running
cluster was causing groups of 80 machines at a time to become unreachable for several minutes. The MapReduce master simply reexecuted the
work done by the unreachable worker machines and continued to make
forward progress, eventually completing the MapReduce operation.
Semantics in the Presence of Failures
When the user-supplied map and reduce operators are deterministic
functions of their input values, our distributed implementation produces the same output as would have been produced by a nonfaulting
sequential execution of the entire program.