global brand might need to respond
quickly to a trending topic on Twitter.
For those sorts of snap decision-relat-ed tasks, Hadoop is too slow, and other
tools have begun to emerge.
The Hadoop community has been
building real-time response capabilities into HBase, a software stack that
sits atop the basic Hadoop infrastructure. Cloudera’s Lipcon explains that
companies will use Hadoop to generate a complicated model of, say, movie preferences based on millions of
users, then store the result in HBase.
When a user gives a movie a good rating, the website using the tools can
factor that small bit of data into the
model to offer new, up-to-date recommendations. Later, when the latest
data is fed back into Hadoop, these
analyses run at a deeper level, analyzing more preferences and producing a
more accurate model. “This gives you
the sort of best of both worlds—the
better results of a complex model and
the fast results of an online model,”
Lipcon explains.
Cloudant, another real-time engine,
uses a MapReduce-based framework to
query data, but the data itself is stored
as documents. As a result, Miller says,
Cloudant can track new and incoming information and only process the
changes. “We don’t require the daily
extraction of data from one system into
another, analysis in Hadoop, and re-in-jection back into a running application
layer,” he says. “That allows us to analyze results in real time.” And this, he
notes, can be a huge advantage. “Wait-
GraphLab, a new
open source
processing
framework, uses
some of the basic
MapReduce
principles, but pays
more attention to the
networked structure.
ing until overnight to process today’s
data means you’ve missed the boat.”
Miller says Cloudant’s document-
oriented store approach, as opposed
to the column-oriented store adopted
in HBase, also makes it easier to run
unexpected or ad hoc queries—an-
other hot topic in the evolving Ha-
doop ecosystem. In 2009, Google pub-
licly described its own ad hoc analysis
tool, Dremel, and a project to develop
an open source version, Drill, just
launched this summer. “In between
the real-time processing and batch
computation there’s this big hole in
the open source world, and we’re hop-
ing to fill that with Drill,” says MapR’s
Ted Dunning. LinkedIn’s “People You
May Know” functionality would be an
ideal target for Drill, he notes. Current-
ly, the results are on a 24-hour delay.
“They would like to have incremental
results right away,” Dunning says.
Although these efforts differ in their
approaches, they share the same essential goal. Whether it relates to discovering links within pools of DNA, generating
better song suggestions, or monitoring
trending topics on Twitter, these groups
are searching for new ways to extract insights from massive, expanding stores of
information. “A lot of people are talking
about big data, but most people are just
creating it,” says Guestrin. “The real value
is in the analysis.”
Further Reading
Anglade, T.
noSQL Tapes, http://www.nosqltapes.com.
Dean, J. and Ghemawat, S.
MapReduce: Simplified data processing
on large cluters, Proceedings of the 6th
Symposium on Operating Systems Design
and Implementation, San Francisco, 2004.
Ghemawat, S., Gobioff, H., and Leung, S.
The Google file system, Proceedings
of the 19th ACM Symposium on Operating
Systems Principles, Lake George, n Y,
Oct. 19–22, 2003.
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D.,
Guestrin, C., and Hellerstein, J.M.
GraphLab: A new parallel framework for
machine learning, The 26th Conference
on Uncertainty in Artificial Intelligence,
Catalina Island, CA, July 8–11, 2010.
White, T.
Hadoop: The Definitive Guide, O’Reilly Media,
Sebastopol, CA, 2009.
Gregory Mone is a boston, Ma-based writer and the
author of the novel Dangerous Waters.
© 2013 aCM 0001-0782/13/01
Milestones
Supercomputing Visionaries Honored
ACM and the ieee Computer society honored high-performance
computing innovators at the
recent sC12 conference in salt
Lake City, ut. Among those hon-orees were the inventor of the first
multicore processor, biomolecular modeling researchers, and
an expert in managing software
security flaws.
university of notre Dame
computer science and engineering professor Peter Kogge
received the seymour Cray Computer engineering Award. Kogge
developed the space shuttle i/o
processor, invented the Kogge-stone-Adder process for adding
numbers in a computer, and
helped create the first multicore
processor (eXeCuBe) at iBM. he
recently spearheaded DARPA’s
initiative to investigate a supercomputer capable of a quintillion
operations per second.
Klaus Schulten and Lax-
mikant Kale, professors at
the university of illinois at
urbana-Champaign, received
the sidney Fernbach Award for
their contributions to the devel-
opment of “widely used parallel
software for large biomolecular
systems simulation.” schulten
directs the Center for Biomo-
lecular Modeling and was the
first to demonstrate that parallel
computers can be used to solve
the “many-body” problem in
biomolecular modeling. Kale di-
rects the Parallel Programming
Laboratory; his work has focused
on enhancing performance and
productivity via adaptive run-
time systems.
Mary Lou Soffa of the university of Virginia received the
ACM-ieee Computer society Ken
Kennedy Award for her work in
detecting and managing software
security flaws. soffa developed
software tools for debugging and
testing programs to eliminate or
reduce false alarms and improve
operating efficiency. her research
produced automatic, practical
solutions in software engineering,
and systems and programming languages for improving software reliability, security and productivity.