We recently conducted a benchmark study using a popular open-source MR implementation and two
parallel DBMSs.
17 The results show
that the DBMSs are substantially faster than the MR system once the data
is loaded, but that loading the data
takes considerably longer in the database systems. Here, we discuss the
source of these performance differences, including the limiting architectural factors we perceive in the two
classes of system, and conclude with
lessons the MR and DBMS communities can learn from each other, along
with future trends in large-scale data
analysis.
Parallel Database systems
In the mid-1980s the Teradata20 and
Gamma projects9 pioneered a new
architectural paradigm for parallel
database systems based on a cluster of commodity computers called
“shared-nothing nodes” (or separate
CPU, memory, and disks) connected
through a high-speed interconnect.
19
Every parallel database system built
since then essentially uses the techniques first pioneered by these two
projects: horizontal partitioning of
relational tables, along with the partitioned execution of SQL queries.
The idea behind horizontal partitioning is to distribute the rows of a
relational table across the nodes of
the cluster so they can be processed
in parallel. For example, partitioning
a 10-million-row table across a cluster of 50 nodes, each with four disks,
would place 50,000 rows on each of
the 200 disks. Most parallel database
systems offer a variety of partitioning
strategies, including hash, range, and
round-robin partitioning.
8 Under a
hash-partitioning physical layout, as
each row is loaded, a hash function
is applied to one or more attributes
of each row to determine the target
node and disk where the row should
be stored.
The use of horizontal partitioning
of tables across the nodes of a cluster
is critical to obtaining scalable performance of SQL queries8 and leads
naturally to the concept of partitioned
execution of the SQL operators: selection, aggregation, join, projection, and
update. As an example how data partitioning is used in a parallel DBMS, consider the following SQL query:
SELECT custId, amount FROM Sales
WHERE date BETWEEN
“12/1/2009” AND “12/25/2009”;
With the Sales table horizontally
partitioned across the nodes of the
cluster, this query can be trivially
executed in parallel by executing a
SELECT operator against the Sales
records with the specified date predicate on each node of the cluster. The
intermediate results from each node
are then sent to a single node that performs a MERGE operation in order to
Parallel database query execution plans. (a) example operator pipeline for calculating
a single-table aggregate. (b) example operator pipeline for performing a joining on two
partitioned tables.
(a)
(b)
sum
shuffle
select
join
shuffle
select
sales
return the final result to the application program that issued the query.
Suppose we would like to know
the total sales amount for each custId
within the same date range. This is
done through the following query:
SELECT custId, SUM(amount)
FROM Sales
WHERE date BETWEEN
“12/1/2009” AND “12/25/2009”
GROUP BY custId;
If the Sales table is round-robin
partitioned across the nodes in the
cluster, then the rows corresponding
to any single customer will be spread
across multiple nodes. The DBMS
compiles this query into the three-op-erator pipeline in Figure(a), then executes the query plan on all the nodes
in the cluster in parallel. Each SELECT
operator scans the fragment of the
Sales table stored at that node. Any
rows satisfying the date predicate are
passed to a SHUFFLE operator that dynamically repartitions the rows; this
is typically done by applying a hash
function on the value of the custId attribute of each row to map them to a
particular node. Since the same hash
function is used for the SHUFFLE operation on all nodes, rows for the same
customer are routed to the single node
where they are aggregated to compute
the final total for each customer.
As a final example of how SQL is parallelized using data partitioning, consider the following query for finding
the names and email addresses of customers who purchased an item costing
more than $1,000 during the holiday
shopping period:
SELECT C.name, C.email FROM
Customers C, Sales S
WHERE C.custId = S.custId
AND S.amount > 1000
AND S.date BETWEEN
“12/1/2009” AND
“12/25/2009”;
Assume again that the Sales table is
round-robin partitioned, but we now
hash-partition the Customers table
on the Customer.custId attribute. The
DBMS compiles this query into the
operator pipeline in Figure(b) that is
executed in parallel at all nodes in the
cluster. Each SELECT operator scans