express the type of computations supported by MapReduce. A 2009 paper
by Andrew Pavlo et al. (referred to here
as the “comparison paper”
13) compared the performance of MapReduce
and parallel databases. It evaluated
the open source Hadoop implementation10 of the MapReduce programming
model, DBMS-X (an unidentified commercial database system), and Vertica
(a column-store database system from
a company co-founded by one of the
authors of the comparison paper). Earlier blog posts by some of the paper’s
authors characterized MapReduce as
“a major step backwards.”
5, 6 In this
article, we address several misconceptions about MapReduce in these three
publications:
MapReduce cannot use indices and ˲
implies a full scan of all input data;
MapReduce input and outputs are ˲
always simple files in a file system; and
MapReduce requires the use of in- ˲
efficient textual data formats.
We also discuss other important issues:
MapReduce is storage-system inde- ˲
pendent and can process data without
first requiring it to be loaded into a database. In many cases, it is possible to
run 50 or more separate MapReduce
analyses in complete passes over the
data before it is possible to load the data
into a database and complete a single
analysis;
Complicated transformations are ˲
often easier to express in MapReduce
than in SQL; and
Many conclusions in the compari- ˲
son paper were based on implementation and evaluation shortcomings not
fundamental to the MapReduce model;
we discuss these shortcomings later in
this article.
We encourage readers to read the
original MapReduce paper4 and the
comparison paper13 for more context.
Heterogenous systems
Many production environments contain a mix of storage systems. Customer
data may be stored in a relational database, and user requests may be logged
to a file system. Furthermore, as such
environments evolve, data may migrate
to new storage systems. MapReduce
provides a simple model for analyzing
data in such heterogenous systems.
End users can extend MapReduce to
support a new storage system by defining simple reader and writer implementations that operate on the storage
system. Examples of supported storage
systems are files stored in distributed
file systems,
7 database query results,
2, 9
data stored in Bigtable,
3 and structured
input files (such as B-trees). A single
MapReduce operation easily processes
and combines data from a variety of
storage systems.
Now consider a system in which a
parallel DBMS is used to perform all
data analysis. The input to such analysis must first be copied into the parallel
DBMS. This loading phase is inconvenient. It may also be unacceptably slow,
especially if the data will be analyzed
only once or twice after being loaded.
For example, consider a batch-oriented
Web-crawling-and-indexing system
that fetches a set of Web pages and
generates an inverted index. It seems
awkward and inefficient to load the set
of fetched pages into a database just so
they can be read through once to generate an inverted index. Even if the cost of
loading the input into a parallel DBMS
is acceptable, we still need an appropriate loading tool. Here is another place
MapReduce can be used; instead of
writing a custom loader with its own ad
hoc parallelization and fault-tolerance
support, a simple MapReduce program
can be written to load the data into the
parallel DBMS.
indices
The comparison paper incorrectly said
that MapReduce cannot take advantage of pregenerated indices, leading
to skewed benchmark results in the
paper. For example, consider a large
data set partitioned into a collection
of nondistributed databases, perhaps
using a hash function. An index can
be added to each database, and the
result of running a database query using this index can be used as an input
to MapReduce. If the data is stored in
D database partitions, we will run D
database queries that will become the
D inputs to the MapReduce execution.
Indeed, some of the authors of Pavlo et
al. have pursued this approach in their
more recent work.
11
Another example of the use of indices is a MapReduce that reads from
Bigtable. If the data needed maps to a
sub-range of the Bigtable row space, we
would need to read only that sub-range
instead of scanning the entire Bigtable.
Furthermore, like Vertica and other column-store databases, we will read data
only from the columns needed for this
analysis, since Bigtable can store data
segregated by columns.
Yet another example is the processing of log data within a certain date
range; see the Join task discussion in
the comparison paper, where the Hadoop benchmark reads through 155
million records to process the 134,000
records that fall within the date range
of interest. Nearly every logging system we are familiar with rolls over to
a new log file periodically and embeds
the rollover time in the name of each
log file. Therefore, we can easily run a
MapReduce operation over just the log
files that may potentially overlap the
specified date range, instead of reading
all log files.
complex functions
Map and Reduce functions are often
fairly simple and have straightforward
SQL equivalents. However, in many
cases, especially for Map functions, the
function is too complicated to be expressed easily in a SQL query, as in the
following examples:
Extracting the set of outgoing links ˲
from a collection of HTML documents
and aggregating by target document;
Stitching together overlapping sat- ˲
ellite images to remove seams and to
select high-quality imagery for Google
Earth;
Generating a collection of inverted ˲
index files using a compression scheme
tuned for efficient support of Google
search queries;
Processing all road segments in the ˲
world and rendering map tile images
that display these segments for Google
Maps; and
Fault-tolerant parallel execution of ˲
programs written in higher-level languages (such as Sawzall14 and Pig Latin12) across a collection of input data.
Conceptually, such user defined
functions (UDFs) can be combined
with SQL queries, but the experience
reported in the comparison paper indicates that UDF support is either buggy
(in DBMS-X) or missing (in Vertica).
These concerns may go away over the
long term, but for now, MapReduce is a
better framework for doing more com-