to mine that data, the task is split up.
“Each computer will look at its locally
available data and run a little segment
of the program on that one computer,”
explains Todd Lipcon, an engineer at
Palo Alto, CA-based Hadoop specialist
Cloudera. “It analyzes its local data and
then reports back the results.”
Although Hadoop is open source,
companies like Cloudera and MapR
Technologies of San Jose, CA, have
found a market in developing addition-
al services or packages around mak-
ing it easier to use and more reliable.
MapR, for example, has helped ances-
try.com use Hadoop to carry out pat-
tern matches on its enormous library
of DNA details. After a customer sends
in a saliva sample, the company can ex-
tract the basic biological code and use
a Hadoop-based program to search for
DNA matches—that is, potential mys-
tery relatives—across its database.
analyzing Networks
For all its strengths in large-scale data
processing, however, experts note
MapReduce was not designed to analyze data sets threaded with connections. A social network, for example, is
best represented in graph form, wherein each person becomes a vertex and
an edge drawn between two individuals signifies a connection. Google’s
own work supports the idea that Hadoop is not set up for this breed of data:
The company’s Pregel system, publicly
described for the first time in 2009,
was developed specifically to work with
graph structures, since MapReduce
had fallen short.
Along with a handful of students,
University of Washington network
scientist Carlos Guestrin recently re-
leased a new open source processing
framework, GraphLab, that uses some
of the basic MapReduce principles but
pays more attention to the networked
structure. The data distribution phase
takes the connections into account. “If
I know that you and I are neighbors in
the graph, there will be some compu-
tation that needs to look at your data
and my data,” Guestrin explains. “So
GraphLab will try to look at our data on
the same machine.”
The trick is that GraphLab partitions
this data in a novel way. The standard
method would have been to split the
data into groups of highly connected
“Hadoop is going
to have to evolve,”
says Mike Miller.
“it’s very clear
that there is a need
for other tools.”
vertices. In social networks, however,
a few persons have a disproportionate
number of connections. Those people,
or vertices, could not be stuffed into
single machines, which would end up
forcing numerous computers to communicate. To avoid this inefficiency,
Guestrin says GraphLab’s algorithm
partitions data according to the edges,
so that closely linked edges are on the
same machines. A highly connected
individual such as the pop star Britney
Spears will still live on multiple pieces
of hardware, but far fewer than with the
standard technique. “It minimizes the
number of places that Britney Spears is
replicated,” Guestrin explains.
Hadoop, on the other hand, is agnos-
tic to the structure of the data, accord-
ing to Guestrin. Two pieces of infor-
mation that should be analyzed on the
same computer might end up in differ-
ent clusters. “What ends up happening
is that they have to move data around
a lot,” Guestrin says. “We can be smart
about how data is placed and how data
is communicated between machines.”
Hadoop can often complete the
same tasks as GraphLab, but Guestrin
says his more efficient approach
makes it much faster. On several com-
mon benchmark tests, such as a name
recognition task in which an algorithm
analyzes text and assigns different cat-
egories to words, GraphLab has com-
pleted the job 60 times faster, using the
same hardware.
Running in Real time
In some cases, though, this is not
fast enough. Today’s companies often want results in real time. A hedge
fund might be looking to make a snap
decision based on the day’s events. A
ACM
Member
News
DaNieL sPieLMaN
WiNs MaCaRtHuR
‘GeNius’ aWaRD If you received a $500,000 grant for being a “genius,” how would you react? in the case of Yale university computer scientist Daniel spielman, he did not tell a soul other than
his wife.
“i learned about the award
in mid-september,” explains
spielman, who recently received
a MacArthur Fellowship award.
“But the MacArthur Foundation
people asked me to keep it a
secret until oct. 1. Fortunately,
i received an email before
speaking to them. otherwise, i
would have tipped off my entire
department with my screaming.”
As for spending the money?
spielman has no immediate plans.
“it’s given in 20 installments
over five years,” he says, “so i can’t
do anything too crazy with it. But
i plan to use it to provide more
time to work on research.”
that is good news for the
scientific community—and
society. spielman has devoted his
career to the pursuit of abstract
questions that address how to
measure, predict, and regulate
the environment and behavior.
he helped develop error-
correcting codes that are used
in satellite video broadcasts, as
well as optimization algorithms
that support computational
science and machine learning on
massive datasets, among other
applications.
“Most of my research is
about the design of faster
algorithms. they don’t just
‘speed things up.’ they change
what is reasonable to do. You
wouldn’t, say, browse the
internet on a phone if it took
10 minutes to load a page.
sophisticated algorithms
compress the communications
of the phone to enable it to
transmit reliably without using
too much power. Web pages are
rarely delivered to your phone
through the shortest route in
the internet. Rather, algorithms
are used to manage information
flow and prevent information
traffic jams.”
— Dennis McCafferty