values that are not present for a given
record. Row-based DBMSs generally
have trouble with the tables, often suffering poor performance. On the other
hand, column-based DBMSs (such as
Vertica) mitigate the problem by reading only the relevant attributes for any
query and automatically suppressing
the NULL values.
3 These techniques
have been shown to provide good performance on RDF data sets,
2 and we
expect the same would be true for simpler key-value data.
To the extent that semistructured
data fits the “cooking” paradigm discussed earlier (that is, the data is prepared for loading into a back-end data-processing system), then MR-style
systems are a good fit. If the semistructured data set is primarily for analytical
queries, we expect a parallel column
store to be a better solution.
Quick-and-dirty analyses. One disappointing aspect of many current parallel DBMSs is that they are difficult to
install and configure properly, as users
are often faced with a myriad of tuning
parameters that must be set correctly
for the system to operate effectively.
From our experiences with installing
two commercial parallel systems, an
open-source MR implementation provides the best “out-of-the-box” experience17; that is, we were able to get the
MR system up and running queries
significantly faster than either of the
DBMSs. In fact, it was not until we received expert support from one of the
vendors that we were able to get one
particular DBMS to run queries that
completed in minutes, rather than
hours or days.
Once a DBMS is up and running
properly, programmers must still
write a schema for their data (if one
does not already exist), then load the
data set into the system. This process
takes considerably longer in a DBMS
than in an MR system, because the
DBMS must parse and verify each datum in the tuples. In contrast, the default (therefore most common) way for
MR programmers to load their data is
to just copy it into the MR system’s underlying distributed block-based storage system.
If a programmer must perform some
one-off analysis on transient data, then
the MR model’s quick startup time is
clearly preferable. On the other hand,
professional DBMS programmers and
administrators are more willing to
pay in terms of longer learning curves
and startup times, because the performance gains from faster queries offset
the upfront costs.
Limited-budget operations. Another
strength of MR systems is that most
are open source projects available for
free. DBMSs, and in particular parallel
DBMSs, are expensive; though there
are good single-node open source
solutions, to the best of our knowledge, there are no robust, community-supported parallel DBMSs. Though
enterprise users with heavy demand
and big budgets might be willing to
pay for a commercial system and all
the tools, support, and service agreements those systems provide, users
with more modest budgets or requirements find open source systems more
attractive. The database community
has missed an opportunity by not providing a more complete parallel, open
source solution.
Powerful tools. MR systems are fundamentally powerful tools for ETL-style
applications and for complex analytics. Additionally, they are popular for
“quick and dirty” analyses and for users with limited budgets. On the other
hand, if the application is query-intensive, whether semistructured or rigidly
structured, then a DBMS is probably
the better choice. In the next section,
we discuss results from use cases that
demonstrate this performance superiority; the processing tasks range from
those MR systems ought to be good at to
those that are quite complex queries.
DBms “sweet spot”
To demonstrate the performance
trade-offs between parallel DBMSs and
MR systems, we published a benchmark comparing two parallel DBMSs
to the Hadoop MR framework on a variety of tasks.
17 We wished to discover
the performance envelope of each approach when applied to areas inside
and outside their target application
space. We used two database systems:
Vertica, a commercial column-store
relational database, and DBMS-X, a
row-based database from a large commercial vendor. Our benchmark study
included a simple benchmark presented in the original MR paper from
Google,
7 as well as four other analyti-
cal tasks of increasing complexity we
think are common processing tasks
that could be done using either class
of systems. We ran all experiments on
a 100-node shared-nothing cluster at
the University of Wisconsin-Madison.
The full paper17 includes the complete
results and discussion from all our experiments, including load times; here,
we provide a summary of the most
interesting results. (The source code
for the benchmark study is available
at http://database.cs.brown.edu/proj-ects/mapreduce-vs-dbms/.)
Hadoop is by far the most popular
publicly available version of the MR
framework (the Google version might
be faster but is not available to us), and
DBMS-X and Vertica are popular row-and column-store parallel database
systems, respectively.
In the time since publication of
Pavlo et al.
17 we have continued to
tune all three systems. Moreover, we
have received many suggestions from
the Hadoop community on ways to
improve performance. We have tried
them all, and the results here (as of
August 2009) represent the best we
can do with a substantial amount of
expert help on all three systems. In
fact, the time we’ve spent tuning Hadoop has now exceeded the time we
spent on either of the other systems.
Though Hadoop offers a good out-of-the-box experience, tuning it to obtain
maximum performance was an arduous task. Obviously, performance is a
moving target, as new releases of all
three products occur regularly
Original MR Grep task. Our first
benchmark experiment is the “Grep
task’” from the original MR paper,
which described it as “representative
of a large subset of the real programs
written by users of MapReduce.”
7
For the task, each system must scan
through a data set of 100B records
looking for a three-character pattern.
Each record consists of a unique key
in the first 10B, followed by a 90B
random value. The search pattern is
found only in the last 90B once in every 10,000 records. We use a 1TB data
set spread over the 100 nodes (10GB/
node). The data set consists of 10 billion records, each 100B. Since this is
essentially a sequential search of the
data set looking for the pattern, it provides a simple measurement of how