Distributed Selection: A Missing
Piece of Data Aggregation
By Fabian Kuhn, Thomas Locher, and Roger Wattenhofer
abstract
In this article, we study the problem of distributed selection
from a theoretical point of view. Given a general connected
graph of diameter D consisting of n nodes in which each
node holds a numeric element, the goal of a k-selection algorithm is to determine the kth smallest of these elements.
We prove that distributed selection indeed requires more
work than other aggregation functions such as, e.g., the
computation of the average or the maximum of all elements.
On the other hand, we show that the kth smallest element
can be computed efficiently by providing both a randomized
and a deterministic k-selection algorithm, dispelling the
misconception that solving distributed selection through
in-network aggregation is infeasible.
1. intRoDuction
There is a recent growing interest in distributed aggregation,
thanks to emerging application areas such as, e.g., data mining or sensor networks. 2, 8, 23, 24 The goal of distributed aggregation is to compute an aggregation function on a set of distributed values, each value stored at a node in a network. Typical
aggregation functions are max, sum, count, average, median,
variance, kth smallest, or largest value, or combinations thereof
such as, e.g., “What is the average of the 10% largest values?”
The database community classifies aggregation functions
into three categories: distributive (max, min, sum, count), algebraic (plus, minus, average, variance), and holistic (median,
kth smallest, or largest value). Combinations of these functions are believed to support a wide range of reasonable aggregation queries.*
It is well known that distributive and algebraic functions
can easily be computed using the so-called convergecast
operation executed on a pre-computed breadth first search
(BFS) tree: The root of the tree floods a message to the leaves
of the tree, asking the leaves to start the aggregation. The inner nodes of the spanning tree wait until they have received
the aggregated data from all their children, apply the aggregation function to their own data and the aggregated data,
and subsequently forward the aggregation result to their respective parent. Convergecast is fast, as it terminates after at
most 2DT time, where DT denotes the depth of the spanning
tree. Note that the depth of a BFS tree is at most the diameter
D of the original graph G, thus a single convergecast costs
merely 2D time. An example for such a spanning tree in the
context of sensor networks is depicted in Figure 1. However,
* We encourage the reader to think of a natural aggregation (single value
result) query that cannot be formulated by a combination of distributive,
algebraic, and holistic functions.
it is believed that holistic functions cannot be supported by
convergecast. After all, the very name “holistic” indicates
that one “cannot look into” the set of values, more precisely,
that all the values need to be centralized at one node in order
to compute the holistic function. Bluntly, in-network aggregation is considered to be practically impossible for holistic
functions.
For arbitrary k, a selection algorithm answers questions
about the kth smallest value in a set or network. The special
case of the k-selection problem where k = n/2 is the well-known median problem. Generally speaking, selection solves
aggregation queries about order statistics and percentiles.
Surprisingly, little is known about distributed (network) selection, although it is critical to the understanding of data
aggregation.
In this article, we shed some new light on the problem
of distributed selection for general networks with n nodes
and diameter D. In particular, we prove that distributed selection is strictly harder than convergecast by giving a lower
bound of Ω(D log n) on the time complexity in Section 5. In
D
other words, to the best of our knowledge, we are the first
to formally confirm the preconception about holistic functions being strictly more difficult than distributive or algebraic functions. In addition, in Section 4. 1, we present a novel Las Vegas algorithm which matches this lower bound with
high probability, improving the best randomized algorithm.
As for many networks this running time is strictly below collecting all values at one node, our new upper bound proves
that (contrary to common belief) in-network aggregation is
possible also for holistic functions; in fact, in network topologies where the diameter is large, e.g., in grids or in typical
wireless sensor networks, selection can be performed within the same asymptotic time bounds as convergecast. As a
third result, in Section 4. 2, we derandomize our algorithm
and arrive at a deterministic distributed selection algorithm
with a time complexity of O(D log2 n) which constitutes a
D
substantial improvement over prior art.
2. RelateD WoRK
Finding the kth smallest value among a set of n elements is
a classic problem which has been extensively studied in the
past approximately 30 years, both in distributed and non-distributed settings. The problem of finding the median,
i.e., the element for which half of all elements are smaller
and the other half is larger, is a special case of the k-selection
problem which has also received a lot of attention. Blum
et al. 1 proposed the first deterministic sequential algorithm
that, given an array of size n, computes the kth smallest element in O(n) time. The algorithm partitions the n elements
into roughly n/5 groups of five elements and determines the