the benchmarks, starting with data in a
collection of files on disk, it is possible
to run 50 separate MapReduce analyses over the data before it is possible to
load the data into a database and complete a single analysis. Long load times
may not matter if many queries will be
run on the data after loading, but this
is often not the case; data sets are often
generated, processed once or twice,
and then discarded. For example, the
Web-search index-building system described in the MapReduce paper4 is a
sequence of MapReduce phases where
the output of most phases is consumed
by one or two subsequent MapReduce
phases.
conclusion
The conclusions about performance
in the comparison paper were based
on flawed assumptions about MapReduce and overstated the benefit of parallel database systems. In our experience, MapReduce is a highly effective
and efficient tool for large-scale fault-tolerant data analysis. However, a few
useful lessons can be drawn from this
discussion:
Startup latency. MapReduce implementations should strive to reduce
startup latency by using techniques like
worker processes that are reused across
different invocations;
Data shuffling. Careful attention
must be paid to the implementation of
the data-shuffling phase to avoid generating O(M*R) seeks in a MapReduce
with M map tasks and R reduce tasks;
Textual formats. MapReduce users
should avoid using inefficient textual
formats;
Natural indices. MapReduce users
should take advantage of natural indices (such as timestamps in log file
names) whenever possible; and
Unmerged output. Most MapReduce
output should be left unmerged, since
there is no benefit to merging if the
next consumer is another MapReduce
program.
MapReduce provides many significant advantages over parallel databases. First and foremost, it provides
fine-grain fault tolerance for large
jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch. Second,
MapReduce is very useful for handling
data processing and data loading in a
mapReduce
provides fine-grain
fault tolerance
for large jobs;
failure in the middle
of a multi-hour
execution does
not require
restarting the job
from scratch.
heterogenous system with many different storage systems. Third, MapReduce provides a good framework for
the execution of more complicated
functions than are supported directly
in SQL.
References
1. abouzeid, a., bajda-Pawlikowski, K., abadi, d.J.,
silberschatz, a., and rasin, a. Hadoopdb: an
architectural hybrid of Mapreduce and dbMs
technologies for analytical workloads. In Proceedings
of the Conference on Very Large Databases (lyon,
France, 2009); http://db.cs.yale.edu/hadoopdb/
2. aster data systems, Inc. In-Database MapReduce
for Rich Analytics; http://www.asterdata.com/product/
mapreduce.php.
3. Chang, F., dean, J., Ghemawat, s., Hsieh, W.C.,
Wallach, d.a., burrows, M., Chandra, t., Fikes, a.,
and Gruber, r.e. bigtable: a distributed storage
system for structured data. In Proceedings of the
Seventh Symposium on Operating System Design
and Implementation (seattle, Wa, nov. 6–8). usenix
association, 2006; http://labs.google.com/papers/
bigtable.html
4. dean, J. and Ghemawat, s. Mapreduce: simplified
data processing on large clusters. In Proceedings of
the Sixth Symposium on Operating System Design and
Implementation (san Francisco, Ca, dec. 6–8). usenix
association, 2004; http://labs.google.com/papers/
mapreduce.html
5. dewitt, d. and stonebraker, M. Mapreduce: a Major
step backwards blogpost; http://databasecolumn.
vertica.com/database-innovation/mapreduce-a-major-step-backwards/
6. dewitt, d. and stonebraker, M. Mapreduce II
blogpost; http://databasecolumn.vertica.com/
database-innovation/mapreduce-ii/
7. Ghemawat, s., Gobioff, H., and leung, s.-t. the
Google file system. In Proceedings of the 19th ACM
Symposium on Operating Systems Principles (lake
George, ny, oct. 19–22). aCM Press, new york, 2003;
http://labs.google.com/papers/gfs.html
8. Google. Protocol buffers: Google’s data Interchange
Format. documentation and open source release;
http://code.google.com/p/protobuf/
9. Greenplum. Greenplum Mapreduce: bringing next-
Generation analytics technology to the enterprise;
http://www.greenplum.com/resources/mapreduce/
10. Hadoop. documentation and open source release;
http://hadoop.apache.org/core/
11. Hadoop. users list; http://wiki.apache.org/hadoop/
Poweredby
12. olston, C., reed, b., srivastava, u., Kumar, r., and
tomkins, a. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the ACM SIGMOD
2008 International Conference on Management of
Data (auckland, new Zealand, June 2008); http://
hadoop.apache.org/pig/
13. Pavlo, a., Paulson, e., rasin, a., abadi, d.J., de Witt,
d.J., Madden, s., and stonebraker, M. a comparison
of approaches to large-scale data analysis. In
Proceedings of the 2009 ACM SIGMOD International
Conference (Providence, rI, June 29–July 2). aCM
Press, new york, 2009; http://database.cs.brown.edu/
projects/mapreduce-vs-dbms/
14. Pike, r., dorward, s., Griesemer, r., and Quinlan, s.
Interpreting the data: Parallel analysis with sawzall.
Scientific Programming Journal, Special Issue on
Grids and Worldwide Computing Programming Models
and Infrastructure 13, 4, 227–298. http://labs.google.
com/papers/ sawzall.html
Jeffrey Dean ( jeff@google.com) is a Google Fellow in
the systems Infrastructure Group of Google, Mountain
View, Ca.
Sanjay Ghemawat ( sanjay@google.com) is a Google
Fellow in the systems Infrastructure Group of Google,
Mountain View, Ca.