its fragment of the Sales table looking
for rows that satisfy the predicate
S.amount > 1000 and S.date
BETWEEN “12/1/2009” and
“12/25/2009.”
Qualifying rows are pipelined into
a shuffle operator that repartitions its
input rows by hashing on the Sales.
custId attribute. By using the same
hash function that was used when
loading rows of the Customer table
(hash partitioned on the Customer.
custId attribute), the shuffle operators
route each qualifying Sales row to the
node where the matching Customer
tuple is stored, allowing the join operator ( C.custId = S.custId) to execute
in parallel on all the nodes.
Another key benefit of parallel
DBMSs is that the system automatically manages the various alternative
partitioning strategies for the tables
involved in the query. For example, if
Sales and Customers are each hash-partitioned on their custId attribute, the
query optimizer will recognize that the
two tables are both hash-partitioned
on the joining attributes and omit the
shuffle operator from the compiled
query plan. Likewise, if both tables are
round-robin partitioned, then the optimizer will insert shuffle operators for
both tables so tuples that join with one
another end up on the same node. All
this happens transparently to the user
and to application programs.
Many commercial implementations are available, including Teradata, Netezza, DataAllegro (Microsoft),
ParAccel, Greenplum, Aster, Vertica,
and DB2. All run on shared-nothing
clusters of nodes, with tables horizontally partitioned over them.
mapping Parallel DBmss
onto mapReduce
An attractive quality of the MR programming model is simplicity; an MR program consists of only two functions—
Map and Reduce—written by a user to
process key/value data pairs.
7 The input
data set is stored in a collection of partitions in a distributed file system deployed on each node in the cluster. The
program is then injected into a distrib-uted-processing framework and executed in a manner to be described. The MR
model was first popularized by Google
in 2004, and, today, numerous open
source and commercial implementations are available. The most popular
MR system is Hadoop, an open-source
project under development by Yahoo!
and the Apache Software Foundation
( http://hadoop.apache.org/).
The semantics of the MR model are
not unique, as the filtering and transformation of individual data items
(tuples in tables) can be executed by a
modern parallel DBMS using SQL. For
Map operations not easily expressed
in SQL, many DBMSs support user-defined functions18; UDF extensibility
provides the equivalent functionality
of a Map operation. SQL aggregates
augmented with UDFs and user-de-fined aggregates provide DBMS users
the same MR-style reduce functionality. Lastly, the reshuffle that occurs
between the Map and Reduce tasks in
MR is equivalent to a GROUP BY operation in SQL. Given this, parallel DBMSs
provide the same computing model as
MR, with the added benefit of using a
declarative language (SQL).
The linear scalability of parallel
DBMSs has been widely touted for
two decades10; that is, as nodes are
added to an installation, the database
size can be increased proportionally
while maintaining constant response
times. Several production databases
in the multi-petabyte range are run
by very large customers operating
on clusters of order 100 nodes.
13 The
people who manage these systems
do not report the need for additional
parallelism. Thus, parallel DBMSs offer great scalability over the range of
nodes that customers desire. There is
no reason why scalability cannot be
increased dramatically to the levels
reported by Jeffrey Dean and Sanjay
Ghemawat,
7 assuming there is customer demand.
Possible Applications
Even though parallel DBMSs are able
to execute the same semantic workload
as MR, several application classes are
routinely mentioned as possible use
cases in which the MR model might be
a better choice than a DBMS. We now
explore five of these scenarios, discussing the ramifications of using one class
of system over another:
ETL and “read once” data sets. The
canonical use of MR is characterized
by the following template of five operations:
Read logs of information from sev- ˲
eral different sources;
Parse and clean the log data; ˲
Perform complex transformations ˲
(such as “sessionalization”);
Decide what attribute data to store; ˲
and
Load the information into a DBMS ˲
or other storage engine.
These steps are analogous to the
extract, transform, and load phases in
ETL systems; the MR system is essentially “cooking” raw data into useful information that is consumed by another
storage system. Hence, an MR system
can be considered a general-purpose
parallel ETL system.
For parallel DBMSs, many products
perform ETL, including Ascential, In-formatica, Jaspersoft, and Talend. The
market is large, as almost all major
enterprises use ETL systems to load
large quantities of data into data warehouses. One reason for this symbiotic
relationship is the clear distinction
as to what each class of system provides to users: DBMSs do not try to do
ETL, and ETL systems do not try to do
DBMS services. An ETL system is typically upstream from a DBMS, as the
load phase usually feeds data directly
into a DBMS.
Complex analytics. In many data-mining and data-clustering applications, the program must make multiple
passes over the data. Such applications
cannot be structured as single SQL aggregate queries, requiring instead a
complex dataflow program where the
output of one part of the application is
the input of another. MR is a good candidate for such applications.
Semi-structured data. Unlike a
DBMS, MR systems do not require users to define a schema for their data.
Thus, MR-style systems easily store
and process what is known as “
semi-structured” data. In our experience,
such data often looks like key-value
pairs, where the number of attributes
present in any given record varies; this
style of data is typical of Web traffic
logs derived from disparate sources.
With a relational DBMS, one way to
model such data is to use a wide table
with many attributes to accommodate multiple record types. Each unrequired attribute uses NULLs for the