5 Performance
In this section, we measure the performance of MapReduce on two computations running on a large cluster of machines. One computation
searches through approximately one terabyte of data looking for a particular pattern. The other computation sorts approximately one terabyte of data.
These two programs are representative of a large subset of the real
programs written by users of MapReduce—one class of programs
shuffles data from one representation to another, and another class
extracts a small amount of interesting data from a large dataset.
5. 1 Cluster Configuration
All of the programs were executed on a cluster that consisted of approximately 1800 machines. Each machine had two 2GHz Intel Xeon
processors with Hyper-Threading enabled, 4GB of memory, two
160GB IDE disks, and a gigabit Ethernet link. The machines were
arranged in a two-level tree-shaped switched network with approximately 100-200Gbps of aggregate bandwidth available at the root. All of
the machines were in the same hosting facility and therefore the round-trip time between any pair of machines was less than a millisecond.
Out of the 4GB of memory, approximately 1-1.5GB was reserved by
other tasks running on the cluster. The programs were executed on a
weekend afternoon when the CPUs, disks, and network were mostly idle.
5. 2 Grep
The grep program scans through 1010 100-byte records, searching for a relatively rare three-character pattern (the pattern occurs in 92,337 records).
The input is split into approximately 64MB pieces (M = 15000), and the
entire output is placed in one file (R = 1).
Fig. 2. Data transfer rate over time (mr-grep).
Figure 2 shows the progress of the computation over time. The
Y-axis shows the rate at which the input data is scanned. The rate gradually picks up as more machines are assigned to this MapReduce computation and peaks at over 30 GB/s when 1764 workers have been
assigned. As the map tasks finish, the rate starts dropping and hits zero
about 80 seconds into the computation. The entire computation takes
approximately 150 seconds from start to finish. This includes about a
minute of startup overhead. The overhead is due to the propagation of
the program to all worker machines and delays interacting with GFS to
open the set of 1000 input files and to get the information needed for
the locality optimization.
5. 3 Sort
The sort program sorts 1010 100-byte records (approximately 1 terabyte
of data). This program is modeled after the TeraSort benchmark [ 12].
MapReduce: Simplified Data Processing on Large Clusters
The sorting program consists of less than 50 lines of user code. The
final sorted output is written to a set of 2-way replicated GFS files (i.e.,
2 terabytes are written as the output of the program).
As before, the input data is split into 64MB pieces (M = 15000). We
partition the sorted output into 4000 files (R = 4000). The partitioning
function uses the initial bytes of the key to segregate it into one of pieces.
Our partitioning function for this benchmark has built-in knowledge of the distribution of keys. In a general sorting program, we would
add a prepass MapReduce operation that would collect a sample of the
keys and use the distribution of the sampled keys to compute split-points for the final sorting pass.
Fig. 3. Data transfer rate over time (mr-sort).
Figure 3 shows the progress of a normal execution of the sort program. The top-left graph shows the rate at which input is read. The rate
peaks at about 13GB/s and dies off fairly quickly since all map tasks finish before 200 seconds have elapsed. Note that the input rate is less