contributed articles
Doi: 10.1145/1629175.1629197
MapReduce complements DBMSs since
databases are not designed for extract-
transform-load tasks, a MapReduce specialty.
BY micHAel s ToneBRAkeR, DAniel ABADi,
DAViD J. De Wi TT, sAm mADDen, eRik PAulson,
AnDRe W PAVlo, AnD AlexAnDeR RAsin
mapReduce
and Parallel
DBmss:
friends
or foes?
tHe MApReDUCe7 (MR) pARADiGM has been hailed as a
revolutionary new platform for large-scale, massively
parallel data access.
16 Some proponents claim the
extreme scalability of MR will relegate relational
database management systems (DBMS) to the status
of legacy technology. At least one enterprise, facebook,
has implemented a large data warehouse system
using MR technology rather than a DBMS.
14
Here, we argue that using MR systems to perform
tasks that are best suited for DBMSs yields less than
satisfactory results,
17 concluding that MR is more
like an extract-transform-load (EtL) system than a
DBMS, as it quickly loads and processes large amounts of data in an
ad hoc manner. As such, it complements DBMS technology rather than
competes with it. We also discuss the
differences in the architectural decisions of MR systems and database
systems and provide insight into how
the systems should complement one
another.
The technology press has been focusing on the revolution of “cloud
computing,” a paradigm that entails
the harnessing of large numbers of
processors working in parallel to solve
computing problems. In effect, this
suggests constructing a data center by
lining up a large number of low-end
servers, rather than deploying a smaller set of high-end servers. Along with
this interest in clusters has come a
proliferation of tools for programming
them. MR is one such tool, an attractive option to many because it provides
a simple model through which users
are able to express relatively sophisticated distributed programs.
Given the interest in the MR model
both commercially and academically,
it is natural to ask whether MR systems should replace parallel database
systems. Parallel DBMSs were first
available commercially nearly two decades ago, and, today, systems (from
about a dozen vendors) are available.
As robust, high-performance computing platforms, they provide a high-level programming environment that
is inherently parallelizable. Although
it might seem that MR and parallel
DBMSs are different, it is possible to
write almost any parallel-processing
task as either a set of database queries
or a set of MR jobs.
Our discussions with MR users lead
us to conclude that the most common
use case for MR is more like an ETL system. As such, it is complementary to
DBMSs, not a competing technology,
since databases are not designed to be
good at ETL tasks. Here, we describe
what we believe is the ideal use of MR
technology and highlight the different
MR and parallel DMBS markets.