contributed articles
Doi: 10.1145/1629175.1629198
MapReduce advantages over parallel databases
include storage-system independence and
fine-grain fault tolerance for large jobs.
BY JeffRe Y DeAn AnD sAnJAY GHemAWAT
mapReduce:
A flexible
Data
Processing
Tool
MApReDUCe is A programming model for processing
and generating large data sets.
4 Users specify a
map function that processes a key/value pair to
generate a set of intermediate key/value pairs and
a reduce function that merges all intermediate
values associated with the same intermediate key.
We built a system around this programming model
in 2003 to simplify construction of the inverted
index for handling searches at Google.com. Since
then, more than 10,000 distinct programs have been
implemented using MapReduce at Google, including
algorithms for large-scale graph processing, text
processing, machine learning, and statistical machine
translation. the Hadoop open source implementation
of MapReduce has been used extensively outside of Google by a number of
organizations.
10, 11
To help illustrate the MapReduce
programming model, consider the
problem of counting the number of
occurrences of each word in a large collection of documents. The user would
write code like the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, “ 1”);
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word
plus an associated count of occurrences
(just `
1' in this simple example). The reduce function sums together all counts
emitted for a particular word.
MapReduce automatically parallelizes and executes the program on a
large cluster of commodity machines.
The runtime system takes care of the
details of partitioning the input data,
scheduling the program’s execution
across a set of machines, handling
machine failures, and managing required inter-machine communication.
MapReduce allows programmers with
no experience with parallel and distributed systems to easily utilize the resources of a large distributed system. A
typical MapReduce computation processes many terabytes of data on hundreds or thousands of machines. Programmers find the system easy to use,
and more than 100,000 MapReduce
jobs are executed on Google’s clusters
every day.
IllustratIon by MarIus WatZ
compared to Parallel Databases
The query languages built into parallel database systems are also used to