A map of the frequency
with which people in
different places reply to
each other on Twitter. The
brightness of each arc is
proportional to the log of
the number of tweets from
one place addressing
someone in another place,
with locations chunked to
20-mile squares. Communication is shown moving
clockwise from the person
sending the tweet to the
person being addressed.
Data from Twitter
streaming API, May 15 –
October 10, 2011.
See something or say
something: Los Angeles.
Red dots are locations of
Flickr pictures. Blue dots
are locations of Twitter
tweets. White dots are
locations that have been
posted to both.
This graph charts the
frequency of mentions
in the New York Times of
the five U.S. presidents
between 1984 and 2009.
It also depicts story
weights—the darkest lines
shows front-page stories,
the lighter lines indicate
stories buried deeper in
the paper.
May + June 2012
interactions
We describe the challenges of
each of these steps in turn, using
examples from our interviewees to
illustrate them. Each of these steps
provides the HCI practitioner with
ample room to improve the user
experience.
Acquiring data. The first challenge our analysts identified was
determining where the data in
their big-data systems came from.
How do they discover sources of
data? Increasingly, data is available in a wide variety of sources
and formats: Online databases
of public statistics are provided
by the U.S. government (http://
data.gov) and the United Nations
( http://unstats.org); private companies sell data from data marketplaces, such as Microsoft’s Azure
Marketplace and Infochimps.
Experts ran into many problems
with data available online, however. They struggled to figure out
what data was available; even
when it was available in machine-readable formats, it would often
be stored in a schema that made it
hard to use. In many of these systems, however, the data is available only after running an aggregation query—or worse, only in
PDF files filled with text. Once this
data was ready to go online, our
analysts would combine it with
information they collected themselves from sensor systems. This,
in turn, caused new challenges:
For example, it could take a third
dataset to link the zip codes in a
crime database to the area codes
in a phone directory.
There are new opportunities for
improving standards for announcing data, helping people find data,
and formatting data so it can be
more easily entered.
Choosing architecture based on
cost and performance. Whatever
platform the big data analysis is