technical Perspective
Distributing Your Data
and having it, too
By hagit Attiya
INTERCONNECTED SYSTEMS—the Internet, wireless networks, or sensor
nets—embrace virtually all computing
environments. Thus our data no longer
needs to be stored, nicely organized, in
centralized databases; it may instead
span a great many heterogeneous locations and be connected through
communication links. Records about
stock-exchange transactions, for example, reside in broker firms and other
financial entities around the world.
But data is worthless if it cannot be
efficiently processed to yield useful information. Enter a data set’s basic aggregates, such as the maximum, mean,
or median, which can be combined to
evaluate statistical properties of the
data and guide decisions and actions,
for example, by detecting trends stock
markets.
Some of these aggregates can be
computed quickly by means of a common network infrastructure—namely,
a “spanning tree”—that connects all
nodes storing information. For example, a simple recursive algorithm can
compute the mean: each node averages the means computed in each subtree rooted at a node; and the resulting
mean, together with the count, is then
forwarded to the parent, which in turn
averages the results from its own sub-trees.
Assuming a unit of time that allows
for receiving messages from all children and processing them; each iteration takes a single time unit, and the
number of time units is proportional
to the depth of the spanning tree, or diameter of the network, denoted D.
Essentially, the same algorithm
computes the maximal or minimal
value and similar aggregates. But the
mean is sensitive to outliers: when the
data has large variations the mean may
misrepresent the data. On days when
stock rates are highly fluctuating, a
single stock transaction at an extremely low quote can significantly sway the
mean, rendering it irrelevant. The me-
dian, other quartiles, or in general the
kth element of the data set are more significant in these situations.
But although the simple algorithm
described here is fine for the mean, it
does not work for computing the median, given that the median at an interior
node is not necessarily the median of
the medians in its sub-trees. A simple
divide-and-conquer approach can be
used instead. This algorithm starts with
the entire set of elements and in each
round randomly chooses a single pivot
element. The algorithm then counts
the number of elements larger than
the pivot, and it recourses in the corresponding subinterval. A fairly straight-forward analysis shows that when the
pivots are chosen uniformly at random
and the element counts are exact, then
the expected number of iterations is asymptotically bounded by the logarithm
of m, the number of nodes in the tree.
Putting this sequential algorithm to
work in a large heterogeneous network,
as done by Kuhn, Locher, and Wattenhofer, illustrates the challenges facing
designers of network algorithms today
and the innovations they must come
up with to tackle them.
The main barrier is in sampling the
pivot from a vast and scattered data
set. The three researchers sidestep
this barrier, however, by instead sending a search expedition to look for the
pivot within the data. The search takes
a random walk down the parent spanning tree, starting from the root and
randomly choosing to which child to
proceed. The choice is biased by the
size of the sub-trees rooted at the children, but a careful tuning of the biases
allows a pivot to be picked uniformly at
random within time 2D.
Instead of choosing one pivot at a
time, the algorithm of Kuhn, Locher,
and Wattenhofer finds D pivots within
time O(D) by staggering the search for
several pivots in overlapping time intervals. Using D pivots, the number of
candidates reduces by a factor of D in
each iteration, leading to a total time
complexity of O(D logD n), with n being
the number of nodes in the network.
Their paper then shows how to avoid
the randomized pivot selection so as
to make the algorithm deterministic,
though at some cost.
Armed with these algorithms, the
problem of computing data aggregates
distributively is now largely solved.
Data can be spread across a network,
yet we can mine it to deduce important
information.
And in case you were wondering: the
answer is negative. No other algorithm
can do better. The authors show that
any algorithm for finding the median
(and in general the kth item) must take
time proportional to D logD n, even if it
can flip coins. In the true tradition of
the theory of distributed computing,
the lower bound follows from the uncertainty inherent in the problem.
The idea is to construct many scenarios, each with a different median,
and use an information-theoretic argument to show that a lot of time is needed in order to disambiguate among
these scenarios.
hagit Attiya ( hagit@cs.technion.ac.il) is a professor of
computer science at the Technion–Israel Institute of
Technology, based in Haifa.