between nodes.
Naturally, distributed analysis of
big data comes with its own set of
“gotchas.” One of the major problems
is nonuniform distribution of work
across nodes. Ideally, each node will
have the same amount of independent computation to do before results
are consolidated across nodes. If this
is not the case, then the node with the
most work will dictate how long we
must wait for the results, and this will
obviously be longer than we would
have waited had work been distributed uniformly; in the worst case, all the
work may be concentrated in a single
node and we will get no benefit at all
from parallelism.
Whether this is a problem or not
will tend to be determined by how
the data is distributed across nodes;
unfortunately, in many cases this can
come into direct conflict with the imperative to distribute data in such a
way that processing at each node is local. Consider, for example, a dataset
that consists of 10 years of observations collected at 15-second intervals
from 1,000 sensor sites. There are
more than 20 million observations
for each site; and, because the typical analysis would involve time-series
calculations—say, looking for unusual values relative to a moving average
and standard deviation—we decide to
store the data ordered by time for each
sensor site (shown in Figure 5), distributed over 10 computing nodes so
that each one gets all the observations
for 100 sites (a total of two billion observations per node). Unfortunately,
this means that whenever we are interested in the results of only one or
a few sensors, most of our computing
nodes will be totally idle. Whether
the rows are clustered by sensor or by
time stamp makes a big difference in
the degree of parallelism with which
different queries will execute.
We could, of course, store the data
ordered by time, one year per node, so
that each sensor site is represented
in each node (we would need some
communication between successive
nodes at the beginning of the computation to “prime” the time-series calculations). This approach also runs
into the difficulty if we suddenly need
an intensive analysis of the past year’s
worth of data. Storing the data both
ways would provide optimal efficiency
for both kinds of analysis—but the
larger the dataset, the more likely it
is that two copies would be simply too
much data for the available hardware
resources.
Another important issue with distributed systems is reliability. Just as
a four-engine airplane is more likely
to experience an engine failure in a
given period than a craft with two of
the equivalent engines, so too is it 10
times more likely that a cluster of 10
machines will require a service call.
Unfortunately, many of the components that get replicated in clusters—
power supplies, disks, fans, cabling,
and so on—tend to be unreliable. It
is, of course, possible to make a cluster arbitrarily resistant to single-node
failures, chiefly by replicating data
across the nodes. Happily, there is
perhaps room for some synergy here:
data replicated to improve the efficiency of different kinds of analyses,
as noted here, can also provide redundancy against the inevitable node failure. Once again, however, the larger
the dataset, the more difficult it is to
maintain multiple copies of the data.
A meta-Definition
I have tried here to provide an overview of a few of the issues that can
arise when analyzing big data: the inability of many off-the-shelf packages
to scale to large problems; the paramount importance of avoiding sub-optimal access patterns as the bulk of
processing moves down the storage
hierarchy; and replication of data for
storage and efficiency in distributed
processing. I have not yet answered
the question I opened with: What is
“big data,” anyway?
I will take a stab at a meta-definition: big data should be defined at any
point in time as “data whose size forces us to look beyond the tried-and-true methods that are prevalent at
that time.” In the early 1980s, it was a
dataset that was so large that a robotic
“tape monkey” was required to swap
thousands of tapes in and out. In the
1990s, perhaps, it was any data that
transcended the bounds of Microsoft
Excel and a desktop PC, requiring serious software on Unix workstations to
analyze. Nowadays, it may mean data
that is too large to be placed in a rela-
tional database and analyzed with the
help of a desktop statistics/visualiza-tion package—data, perhaps, whose
analysis requires massively parallel
software running on tens, hundreds,
or even thousands of servers.
In any case, as analyses of ever-larg-er datasets become routine, the definition will continue to shift, but one
thing will remain constant: success at
the leading edge will be achieved by
those developers who can look past
the standard, off-the-shelf techniques
and understand the true nature of the
hardware resources and the full panoply of algorithms that are available to
them.
Related articles
on queue.acm.org
Flash Storage Today
Adam Leventhal
http://queue.acm.org/detail.cfm?id=1413262
A Call to Arms
Jim Gray
http://queue.acm.org/detail.cfm?id=1059805
You Don’t Know Jack about Disks
Dave Anderson
http://queue.acm.org/detail.cfm?id=864058
References
1. Codd, e.f. a relational model for large shared data
banks. Commun. ACM 13, 6 (June 1970), 377–387.
2. IbM 3850 Mass storage system; http://www.
columbia.edu/acis/history/mss.html.
3. IbM archives: IbM 3380 direct access storage device;
http://www-03.ibm.com/ibm/history/exhibits/storage/
storage_3380.html.
4. kimball, r. The Data Warehouse Toolkit: Practical
Techniques for Building Dimensional Data Warehouses.
John Wiley & sons, ny, 1996.
5. litke, a.M. What does the eye tell the brain?
Development of a system for the large-scale
recording of retinal output activity. IEEE Transactions
on Nuclear Science 51, 4 (2004), 1434–1440.
6. Postgresql: the world’s most advanced open source
database; http://www.postgresql.org.
7. the r Project for statistical Computing; http://www.r-
project.org.
8. sloan Digital sky survey; http://www.sdss.org.
9. throughput and Interface Performance. tom’s Winter
2008 hard Drive Guide; http://www.tomshardware.
com/reviews/hdd-terabyte-1tb, 2077-11.html.
10. WlCG ( Worldwide lhC Computing Grid); http://lcg.
web.cern.ch/lCG/public/.
11. zero-one-Infinity rule; http://www.catb.org/~esr/
jargon/html/z/ zero-one-Infinity-rule.html.
Adam Jacobs is senior software engineer at 1010data
Inc., where, among other roles, he leads the continuing
development of tenbase, the company’s ultra-high-performance analytical database engine. he has more
than 10 years of experience with distributed processing
of big datasets, starting in his earlier career as a
computational neuroscientist at Weill Medical College of
Cornell university (where he holds the position of Visiting
fellow) and at uCla.