did not use compression.
It is unclear to us why this improvement was insignficant, as essentially
all commercial SQL data warehouses
use compression to improve performance. We postulate that commercial DBMSs use carefully tuned compression algorithms to ensure that
the cost of decompressing tuples
does not offset the performance gains
from the reduced I/O cost of reading
compressed data. For example, we
have found that on modern processors standard Unix implementations
of gzip and bzip are often too slow to
provide any benefit.
Pipelining. All parallel DBMSs operate by creating a query plan that is
distributed to the appropriate nodes
at execution time. When one operator in this plan must send data to the
next operator, regardless of whether
that operator is running on the same
or a different node, the qualifying
data is “pushed” by the first operator
to the second operator. Hence, data is
streamed from producer to consumer;
the intermediate data is never written
to disk; the resulting “back-pressure”
in the runtime system will stall the producer before it has a chance to overrun
the consumer. This streaming technique differs from the approach taken
in MR systems, where the producer
writes the intermediate results to local data structures, and the consumer
subsequently “pulls” the data. These
data structures are often quite large,
so the system must write them out to
disk, introducing a potential bottleneck. Though writing data structures
to disk gives Hadoop a convenient way
to checkpoint the output of intermediate map jobs, thereby improving fault
tolerance, we found from our investigation that it adds significant performance overhead.
Scheduling. In a parallel DBMS,
each node knows exactly what it must
do and when it must do it according to
the distributed query plan. Because the
operations are known in advance, the
system is able to optimize the execution plan to minimize data transmission between nodes. In contrast, each
task in an MR system is scheduled on
processing nodes one storage block at
a time. Such runtime work scheduling at a granularity of storage blocks is
much more expensive than the DBMS
The commercial
DBms products
must move
toward one-button
installs, automatic
tuning that works
correctly, better
Web sites with
example code,
better query
generators,
and better
documentation.
compile-time scheduling. The former
has the advantage, as some have argued,
4 of allowing the MR scheduler
to adapt to workload skew and performance differences between nodes.
Column-oriented storage. In a column
store-based database (such as Vertica),
the system reads only the attributes
necessary for solving the user query.
This limited need for reading data represents a considerable performance
advantage over traditional, row-stored
databases, where the system reads all
attributes off the disk. DBMS-X and
Hadoop/HDFS are both essentially row
stores, while Vertica is a column store,
giving Vertica a significant advantage
over the other two systems in our Web
log benchmark task.
Discussion. The Hadoop community
will presumably fix the compression
problem in a future release. Furthermore, some of the other performance
advantages of parallel databases (such
as column-storage and operating directly on compressed data) can be
implemented in an MR system with
user code. Also, other implementations of the MR framework (such as
Google’s proprietary implementation)
may well have a different performance
envelope. The scheduling mechanism
and pull model of data transmission
are fundamental to the MR block-level
fault-tolerance model and thus unlikely to be changed.
Meanwhile, DBMSs offer transac-tion-level fault tolerance. DBMS researchers often point out that as databases get bigger and the number of
nodes increases, the need for finer-granularity fault tolerance increases as
well. DBMSs readily adapt to this need
by marking one or more operators in a
query plan as “restart operators.” The
runtime system saves the result of these
operators to disk, facilitating “operator
level” restart. Any number of operators
can be so marked, allowing the granularity of restart to be tuned. Such a
mechanism is easily integrated into the
efficient query execution framework of
DBMSs while allowing variable granularity restart. We know of at least two
separate research groups, one at the
University of Washington, the other at
the University of California, Berkeley,
that are exploring the trade-off between
runtime overhead and the amount of
work lost when a failure occurs.