Distributed Selection: A Missing
Piece of Data Aggregation

By Fabian Kuhn, Thomas Locher, and Roger Wattenhofer

abstract

In this article, we study the problem of distributed selection from a theoretical point of view. Given a general connected graph of diameter D consisting of n nodes in which each node holds a numeric element, the goal of a k-selection algorithm is to determine the kth smallest of these elements. We prove that distributed selection indeed requires more work than other aggregation functions such as, e.g., the computation of the average or the maximum of all elements. On the other hand, we show that the kth smallest element can be computed efficiently by providing both a randomized and a deterministic k-selection algorithm, dispelling the misconception that solving distributed selection through in-network aggregation is infeasible.

1. intRoDuction

There is a recent growing interest in distributed aggregation, thanks to emerging application areas such as, e.g., data mining or sensor networks. 2, 8, 23, 24 The goal of distributed aggregation is to compute an aggregation function on a set of distributed values, each value stored at a node in a network. Typical aggregation functions are max, sum, count, average, median, variance, kth smallest, or largest value, or combinations thereof such as, e.g., “What is the average of the 10% largest values?”

The database community classifies aggregation functions into three categories: distributive (max, min, sum, count), algebraic (plus, minus, average, variance), and holistic (median, kth smallest, or largest value). Combinations of these functions are believed to support a wide range of reasonable aggregation queries.*

It is well known that distributive and algebraic functions can easily be computed using the so-called convergecast operation executed on a pre-computed breadth first search (BFS) tree: The root of the tree floods a message to the leaves of the tree, asking the leaves to start the aggregation. The inner nodes of the spanning tree wait until they have received the aggregated data from all their children, apply the aggregation function to their own data and the aggregated data, and subsequently forward the aggregation result to their respective parent. Convergecast is fast, as it terminates after at most 2DT time, where DT denotes the depth of the spanning tree. Note that the depth of a BFS tree is at most the diameter D of the original graph G, thus a single convergecast costs merely 2D time. An example for such a spanning tree in the context of sensor networks is depicted in Figure 1. However,

* We encourage the reader to think of a natural aggregation (single value result) query that cannot be formulated by a combination of distributive, algebraic, and holistic functions.

it is believed that holistic functions cannot be supported by convergecast. After all, the very name “holistic” indicates that one “cannot look into” the set of values, more precisely, that all the values need to be centralized at one node in order to compute the holistic function. Bluntly, in-network aggregation is considered to be practically impossible for holistic functions.

For arbitrary k, a selection algorithm answers questions about the kth smallest value in a set or network. The special case of the k-selection problem where k = n/2 is the well-known median problem. Generally speaking, selection solves aggregation queries about order statistics and percentiles. Surprisingly, little is known about distributed (network) selection, although it is critical to the understanding of data aggregation.

In this article, we shed some new light on the problem of distributed selection for general networks with n nodes and diameter D. In particular, we prove that distributed selection is strictly harder than convergecast by giving a lower bound of Ω(D log n) on the time complexity in Section 5. In

D

other words, to the best of our knowledge, we are the first to formally confirm the preconception about holistic functions being strictly more difficult than distributive or algebraic functions. In addition, in Section 4. 1, we present a novel Las Vegas algorithm which matches this lower bound with high probability, improving the best randomized algorithm. As for many networks this running time is strictly below collecting all values at one node, our new upper bound proves that (contrary to common belief) in-network aggregation is possible also for holistic functions; in fact, in network topologies where the diameter is large, e.g., in grids or in typical wireless sensor networks, selection can be performed within the same asymptotic time bounds as convergecast. As a third result, in Section 4. 2, we derandomize our algorithm and arrive at a deterministic distributed selection algorithm with a time complexity of O(D log2 n) which constitutes a

D

substantial improvement over prior art.

2. RelateD WoRK

Finding the kth smallest value among a set of n elements is a classic problem which has been extensively studied in the past approximately 30 years, both in distributed and non-distributed settings. The problem of finding the median, i.e., the element for which half of all elements are smaller and the other half is larger, is a special case of the k-selection problem which has also received a lot of attention. Blum et al. 1 proposed the first deterministic sequential algorithm that, given an array of size n, computes the kth smallest element in O(n) time. The algorithm partitions the n elements into roughly n/5 groups of five elements and determines the

References:

Archives