We generally expect ETL and complex analytics to be amenable to MR
systems and query-intensive workloads
to be run by DBMSs. Hence, we expect
the best solution is to interface an MR
framework to a DBMS so MR can do
complex analytics, and interface to a
DBMS to do embedded queries. HadoopDB,
21 Aster, Greenplum,
Cloudera, and Vertica all have commercially available products or prototypes in this “hybrid” category.
learning from each other
What can MR learn from DBMSs? MR
advocates should learn from parallel
DBMS the technologies and techniques
for efficient query parallel execution.
Engineers should stand on the shoulders of those who went before, rather
than on their toes. There are many
good ideas in parallel DBMS executors
that MR system developers would be
wise to adopt.
We also feel that higher-level languages are invariably a good idea for
any data-processing system. Relational DBMSs have been fabulously
successful in pushing programmers
to a higher, more-productive level of
abstraction, where they simply state
what they want from the system, rather than writing an algorithm for how
to get what they want from the system.
In our benchmark study, we found
that writing the SQL code for each task
was substantially easier than writing
Efforts to build higher-level interfaces on top of MR/Hadoop should be
accelerated; we applaud Hive,
12 and other projects that point the way in this area.
What can DBMSs learn from MR?
The out-of-the-box experience for most
DBMSs is less than ideal for being able
to quickly set up and begin running
queries. The commercial DBMS products must move toward one-button
installs, automatic tuning that works
correctly, better Web sites with example code, better query generators, and
Most database systems cannot deal
with tables stored in the file system (in
situ data). Consider the case where a
DBMS is used to store a very large data
set on which a user wishes to perform
analysis in conjunction with a smaller,
private data set. In order to access the
larger data set, the user must first load
the data into the DBMS. Unless the
user plans to run many analyses, it is
preferable to simply point the DBMS
at data on the local disk without a
load phase. There is no good reason
DBMSs cannot deal with in situ data.
Though some database systems (such
as PostgreSQL, DB2, and SQL Server)
have capabilities in this area, further
flexibility is needed.
Most of the architectural differences
discussed here are the result of the
different focuses of the two classes of
system. Parallel DBMSs excel at efficient querying of large data sets; MR-style systems excel at complex analytics and ETL tasks. Neither is good at
what the other does well. Hence, the
two technologies are complementary,
and we expect MR-style systems performing ETL to live directly upstream
Many complex analytical problems
require the capabilities provided by
both systems. This requirement motivates the need for interfaces between
MR systems and DBMSs that allow each
system to do what it is good at. The result is a much more efficient overall
system than if one tries to do the entire
application in either system. That is,
“smart software” is always a good idea.
This work is supported in part by National Science Foundation grants CRI-
0707437, CluE-0844013, and CluE-
1. abadi, d. J., Madden, s.r., and Hachem, n. Column-stores vs. row-stores: How different are they really?
In Proceedings of the SIGMOD Conference on
Management of Data. aCM Press, new york, 2008.
2. abadi, d.J., Marcus, a., Madden, s.r., and Hollenbach, K.
scalable semantic Web data management using vertical
partitioning. In Proceedings of the 33rd International
Conference on Very Large Databases, 2007.
3. abadi, d.J. Column-stores for wide and sparse data.
In Proceedings of the Conference on Innovative Data
Systems Research, 2007.
4. abouzeid, a., bajda-Pawlikowski, K., abadi, d. J.,
silberschatz, a., and rasin, a. Hadoopdb: an
architectural hybrid of Mapreduce and dbMs
technologies for analytical workloads. In Proceedings
of the Conference on Very Large Databases, 2009.
5. boral, H. et al. Prototyping bubba, a highly parallel
database system. IEEE Transactions on Knowledge
and Data Engineering 2, 1 (Mar. 1990), 4–24.
6. Chaiken, r., Jenkins, b., larson, P., ramsey, b.,
shakib, d., Weaver, s., and Zhou, J. sCoPe: easy and
efficient parallel processing of massive data sets.
In Proceedings of the Conference on Very Large
7. dean, J. and Ghemawat, s. Mapreduce: simplified
data processing on large clusters. In Proceedings of
the Sixth Conference on Operating System Design and
Implementation (berkeley, Ca, 2004).
8. de Witt, d.J. and Gray, J. Parallel database systems:
the future of high-performance database systems.
Commun. ACM 35, 6 (June 1992), 85–98.
9. de Witt, d.J., Gerber, r.H., Graefe, G., Heytens, M.l.,
Kumar, K.b., and Muralikrishna, M. GaMMa: a
high-performance dataflow database machine. In
Proceedings of the 12th International Conference on
Very Large Databases. Morgan Kaufmann Publishers,
Inc., 1986, 228–237.
10. englert, s., Gray, J., Kocher, t., and shah, P. a
benchmark of nonstop sQl release 2 demonstrating
near-linear speedup and scaleup on large databases.
Sigmetrics Performance Evaluation Review 18, 1
(1990), 1990, 245–246.
11. Fushimi, s., Kitsuregawa, M., and tanaka, H. an
overview of the system software of a parallel relational
database machine. In Proceedings of the 12th
International Conference on Very Large Databases,
Morgan Kaufmann Publishers, Inc., 1986, 209–219.
12. Isard, M., budiu, M., yu, y., birrell, a., and Fetterly,
d. dryad: distributed data-parallel programs from
sequential building blocks. SIGOPS Operating System
Review 41, 3 (2007), 59–72.
13. Monash, C. some very, very, very large data
warehouses. In network World.com community
blog, May 12, 2009; http://www.networkworld.com/
14. Monash, C. Cloudera presents the Mapreduce bull
case. In dbMs2.com blog, apr. 15, 2009; http://www.
15. olston, C., reed, b., srivastava, u., Kumar, r., and
tomkins, a. Pig latin: a not-so-foreign language
for data processing. In Proceedings of the SIGMOD
Conference. aCM Press, new york, 2008, 1099–1110.
16. Patterson, d.a. technical perspective: the data center
is the computer. Commun. ACM 51, 1 (Jan. 2008), 105.
17. Pavlo, a., Paulson, e., rasin, a., abadi, d. J., de Witt,
d.J., Madden, s.r., and stonebraker, M. a comparison
of approaches to large-scale data analysis. In
Proceedings of the 35th SIGMOD International
Conference on Management of Data. aCM Press, new
york, 2009, 165–178.
18. stonebraker, M. and rowe, l. the design of Postgres.
In Proceedings of the SIGMOD Conference, 1986,
19. stonebraker, M. the case for shared nothing. Data
Engineering 9 (Mar. 1986), 4–9.
20. teradata Corp. Database Computer System Manual,
Release 1. 3. los angeles, Ca, Feb. 1985.
21. thusoo, a. et al. Hive: a warehousing solution
over a Map-reduce framework. In Proceedings of
the Conference on Very Large Databases, 2009,
Michael Stonebraker ( firstname.lastname@example.org) is an
adjunct professor in the Computer science and artificial
Intelligence laboratory at the Massachusetts Institute of
technology, Cambridge, Ma.
Daniel J. Abadi ( email@example.com) is an assistant
professor in the department of Computer science at yale
university, new Haven, Ct.
David J. De Witt ( firstname.lastname@example.org) is a technical
fellow in the Jim Gray systems lab at Microsoft Inc.,
Samuel Madden ( email@example.com) is a professor
in the Computer science and artificial Intelligence
laboratory at the Massachusetts Institute of technology,
Erik Paulson ( firstname.lastname@example.org) is a Ph.d. candidate
in the department of Computer sciences at the university
of Wisconsin-Madison, Madison, WI.
Andrew Pavlo ( email@example.com) is a Ph.d. candidate
in the department of Computer science at brown
university, Providence, rI.
Alexander Rasin ( firstname.lastname@example.org) is a Ph.d.
candidate in the department of Computer science at
brown university, Providence, rI.