twice the memory slots comes out.
The beauty of today’s mainstream
computer hardware, though, is that
it’s cheap and almost infinitely repli-cable. Today it is much more cost-effective to purchase eight off-the-shelf,
“commodity” servers with eight processing cores and 128GB of RAM each
than it is to acquire a single system
with 64 processors and a terabyte of
RAM. Although the absolute numbers
will change over time, barring a radical change in computer architectures,
the general principle is likely to remain true for the foreseeable future.
Thus, it’s not surprising that distributed computing is the most successful strategy known for analyzing very
large datasets.
Distributing analysis over multiple
computers has significant performance
costs: even with gigabit and 10-gigabit
Ethernet, both bandwidth (sequential
access speed) and latency (thus, random access speed) are several orders
of magnitude worse than RAM. At the
same time, however, the highest-speed
local network technologies have now
surpassed most locally attached disk
systems with respect to bandwidth,
and network latency is naturally much
lower than disk latency.
As a result, the performance cost of
storing and retrieving data on other
nodes in a network is comparable to
(and in the case of random access, potentially far less than) the cost of using
disk. Once a large dataset has been
distributed to multiple nodes in this
way, however, a huge advantage can be
obtained by distributing the
processing as well—so long as the analysis is
amenable to parallel processing.
Much has been and can be said
about this topic, but in the context
of a distributed large dataset, the criteria are essentially related to those
discussed earlier: just as maintaining locality of reference via sequential access is crucial to processes that
rely on disk I/O (because disk seeks
are expensive), so too, in distributed
analysis, processing must include a
significant component that is local
in the data—that is, does not require
simultaneous processing of many disparate parts of the dataset (because
communication between the different processing domains is expensive).
Fortunately, most real-world data
figure 5. Two ways to distribute 10 years of sensor data for 1,000 sites over 10 machines.
timestamp sensor
19990101000000 1
19990101000015 1
19990101000030 1
node 1
reading
timestamp sensor
19990101000000 1
19990101000000 2
19990101000000 3
node 1
reading
20081231235930
1
20081231235945 1
19990101000000 2
19990101000015 2
19990101000030 2
19990101000000
1000
19990101000015 1
19990101000015 2
19990101000015 3
19990101000015 4
20081231235930
2
20081231235945 2
19990101000000 3
19990101000015
1000
19990101000030 1
19990101000030 2
20081231235945
100
19991231235945
100
timestamp sensor
19990101000000 101
19990101000015 101
19990101000030 101
node 2
reading
timestamp sensor
20000101000000 1
20000101000000 2
20000101000000 3
node 2
reading
20081231235930
101
20081231235945 101
19990101000000 102
19990101000015 102
19990101000030 102
20000101000000
1000
20000101000015 1
20000101000015 2
20000101000015 3
20000101000015 4
20081231235930
102
20081231235945 102
19990101000000 103
20000101000015
1000
20000101000030 1
20000101000030 2
20081231235945
200
20001231235945
1000
timestamp sensor
19990101000000 901
19990101000015 901
19990101000030 901
node 10
reading
timestamp sensor
20080101000000 1
20080101000000 2
20080101000000 3
node 10
reading
20081231235930
901
20081231235945 901
19990101000000 902
19990101000015 902
19990101000030 902
20080101000000
1000
20080101000015 1
20080101000015 2
20080101000015 3
20080101000015 4
20081231235930
902
20081231235945 902
19990101000000 903
20080101000015
1000
20080101000030 1
20080101000030 2
20081231235945
1000
20081231235945
1000
analysis does include such a component. Operations such as searching,
counting, partial aggregation, record-wise combinations of multiple fields,
and many time-series analyses (if the
data is stored in the correct order)
can be carried out on each computing
node independently.
Furthermore, where communication between nodes is required, it
often occurs after data has been extensively aggregated; consider, for
example, taking an average of billions
of rows of data stored on multiple
nodes. Each node is required to communicate only two values—a sum and
a count—to the node that produces
the final result. Not every aggregation can be computed so simply, as a
global aggregation of local sub-aggre-gations (consider the task of finding a
global median, for example, instead
of a mean), but many of the important
ones can, and there are distributed algorithms for other, more complicated
tasks that minimize communication