quickly a software system can scan
through a large collection of records.
The task cannot take advantage of any
sorting or indexing and is easy to specify in both MR and SQL. Therefore, one
would expect a lower-level interface
(such as Hadoop) running directly on
top of the file system (HDFS) to execute faster than the more heavyweight
DBMSs.
However, the execution times in the
table here show a surprising result: The
database systems are about two times
faster than Hadoop. We explain some
of the reasons for this conclusion in the
section on architectural differences.
Web log task. The second task is a
conventional SQL aggregation with
a GROUP BY clause on a table of user
visits in a Web server log. Such data
is fairly typical of Web logs, and the
query is commonly used in traffic analytics. For this experiment, we used
a 2TB data set consisting of 155 million records spread over the 100 nodes
(20GB/node). Each system must calculate the total ad revenue generated for
each visited IP address from the logs.
Like the previous task, the records
must all be read, and thus there is no
indexing opportunity for the DBMSs.
One might think that Hadoop would
excel at this task since it is a straightforward calculation, but the results in
the table show that Hadoop is beaten
by the databases by a larger margin
than in the Grep task.
Join task. The final task we discuss
here is a fairly complex join operation
over two tables requiring an additional
aggregation and filtering operation.
The user-visit data set from the previous task is joined with an additional
100GB table of PageRank values for 18
million URLs (1GB/node). The join task
consists of two subtasks that perform a
complex calculation on the two data
sets. In the first part of the task, each
system must find the IP address that
generated the most revenue within a
particular date range in the user visits.
Once these intermediate records are
generated, the system must then calculate the average PageRank of all pages
visited during this interval.
DBMSs ought to be good at analytical queries involving complex join operations (see the table). The DBMSs are
a factor of 36 and 21 respectively faster
than Hadoop. In general, query times
Benchmark performance on a 100-node cluster.
Hadoop
Grep 284s
Web log 1,146s
join 1,158s
DBms-x
194s
740s
32s
Vertica
108x
268s
55s
Hadoop/DBms-x Hadoop/Vertica
1.5x 2.6x
1.6x 4.3x
36.3x 21.0x
for a typical user task fall somewhere
in between these extremes. In the next
section, we explore the reasons for
these results.
Architectural Differences
The performance differences between
Hadoop and the DBMSs can be explained by a variety of factors. Before
delving into the details, we should say
these differences result from implementation choices made by the two
classes of system, not from any fundamental difference in the two models.
For example, the MR processing model
is independent of the underlying storage system, so data could theoretically
be massaged, indexed, compressed,
and carefully laid out on storage during
a load phase, just like a DBMS. Hence,
the goal of our study was to compare
the real-life differences in performance
of representative realizations of the
two models.
Repetitive record parsing. One contributing factor for Hadoop’s slower
performance is that the default configuration of Hadoop stores data in
the accompanying distributed file
system (HDFS), in the same textual
format in which the data was generated. Consequently, this default
storage method places the burden of
parsing the fields of each record on
user code. This parsing task requires
each Map and Reduce task repeatedly
parse and convert string fields into
the appropriate type. Hadoop provides the ability to store data as key/
value pairs as serialized tuples called
SequenceFiles, but despite this ability
it still requires user code to parse the
value portion of the record if it contains multiple attributes. Thus, we
found that using SequenceFiles without compression consistently yielded
slower performance on our benchmark. Note that using SequenceFiles
without compression was but one of
the tactics for possibly improving Ha-
doop’s performance suggested by the
MR community.
In contrast to repetitive parsing in
MR, records are parsed by DBMSs when
the data is initially loaded. This initial
parsing step allows the DBMSs storage
manager to carefully lay out records
in storage such that attributes can be
directly addressed at runtime in their
most efficient storage representation.
As such, there is no record interpretation performed during query execution
in parallel DBMSs.
There is nothing fundamental about
the MR model that says data cannot be
parsed in advance and stored in optimized data structures (that is, trading off some load time for increased
runtime performance). For example,
data could be stored in the underlying file system using Protocol Buffers
( http://code.google.com/p/protobuf/),
Google’s platform-neutral, extensible
mechanism for serializing structured
data; this option is not available in Hadoop. Alternatively, one could move
the data outside the MR framework
into a relational DBMS at each node,
thereby replacing the HDFS storage
layer with DBMS-style optimized storage for structured data.
4
There may be ways to improve the
Hadoop system by taking advantage of
these ideas. Hence, parsing overhead
is a problem, and SequenceFiles are
not an effective solution. The problem
should be viewed as a signpost for guiding future development.
Compression. We found that enabling
data compression in the DBMSs delivered a significant performance gain.
The benchmark results show that using
compression in Vertica and DBMS-X on
these workloads improves performance
by a factor of two to four. On the other
hand, Hadoop often executed slower
when we used compression on its input
files; at most, compression improved
performance by 15%; the benchmark
results in Dean and Ghemawat7 also