1: kdTree := new KDTree (points)
2: pq := new PriorityQueue ( )
3: foreach p in points {pq . add( <p,kd Tree . nearest ( p )> )}
4: while ( pq.size ( ) != 0) do {
5: Pair <p,n> := pq.get ( ); // return closest pair
6: if ( p.isAlreadyClustered( ) ) continue;
7: if ( n.isAlreadyClustered ( ) ) {
8: pq.add ( <p, kd Tree.nearest ( p )> );
9: continue;
10: }
11: Cluster c := new Cluster (p,n);
12: dendrogram.add ( c );
13: kd Tree.remove ( p );
14: kd Tree.remove ( n );
15: kdTree.add (c);
16: Point m:= kdTree . nearest (c);
17: if (m != ptAtInfinity) pq.add (<c,m>);
18: }
The algorithm iterates over a priority queue whose entries
are ordered pairs of points <x, y>, such that y is the nearest
neighbor of x (we call this nearest(x)). In each iteration
of the while loop, the pair of points at the head of the priority queue—the closest pair—are clustered. These two points
are replaced by a new, representative point. The nearest
neighbor of this point is determined, and the pair is entered
into the priority queue.
To find the nearest neighbor of a point, we can scan
the entire data-set at each step, but this is too inefficient.
Instead, we use a spatial acceleration structure called a
kd-tree to find nearest neighbors. The kd-tree is built at the start
of the algorithm and is kept up to date as points are removed
and added from the space, as seen in Figure 4.
opportunities for Exploiting parallelism: Since each
iteration clusters the two closest points in the current data-set, it may seem that the algorithm is inherently sequential.
However, if we consider the data-set in Figure 3(a), we see
that points a and b, and points c and d can be clustered concurrently since neither cluster affects the other. Intuitively,
if the dendrogram is a long and skinny tree, there may be
few independent iterations, whereas if the dendrogram is a
bushy tree, there is parallelism that can be exploited since
the tree can be constructed bottom-up in parallel. As in the
case of Delaunay mesh refinement, the parallelism is very
data-dependent. In experiments on graphics scenes with
20,000 lights, we have found that on average about 100 clusters can be constructed concurrently11; thus, there is substantial parallelism that can be exploited.
2. 3. Discussion
Existing compile-time parallelization techniques for irregular programs are based on shape analysis, 20 which determines structural invariants in the data structures. The graph
in mesh refinement has no particular structure, and it is also
modified in each iteration of the loop in Figure 2, so compile-time parallelization will not work for this application.
Semi-static approaches using the inspector–executor model18
split computation into two phases: an inspector phase that
determines dependences between units of work and constructs a computation schedule and an executor phase that
executes the resulting schedule in parallel. This approach
does not work for mesh refinement since the dependence
structure changes when the underlying graph is modified by
the algorithm.
These considerations suggest that a fully dynamic
approach in which dependences are detected at runtime
is needed to parallelize codes like mesh refinement and
agglomerative clustering. One such approach to parallelizing mesh refinement has been proposed by Hudson et al., 8
and it has the following steps: ( 1) compute the cavities of
all bad triangles without making any modifications to the
graph, ( 2) build an interference graph in which nodes represent cavities and edges represent overlapping cavities, ( 3)
find a maximal independent set of nodes in this graph, and
( 4) retriangulate the cavities corresponding to these nodes
in parallel, without any synchronization. These steps are
then repeated for the new mesh until convergence. This
approach can be viewed as an extended inspector–executor
approach in which the execution of the inspector and executor are interleaved. However, this approach is very specific to
Delaunay mesh refinement. For example, it is not clear that
it can be used for applications like agglomerative clustering
in which iterations are performed over priority queues.
3. The GaLoiS aPPRoach
The analysis of Section 2 suggests that optimistic parallelization, in which computations are speculatively executed in
parallel and rolled back selectively when dependence conflicts are detected by the runtime system, is the only general-purpose approach to exploiting parallelism in irregular
applications. In this section, we argue that optimistic parallelization needs to be coupled with appropriate programming abstractions, and we describe an implementation of
these ideas in the Galois system.
The need for programming abstractions becomes evident if we consider the mesh refinement code in Figure 2.
The work-list of bad triangles determines a particular order
in which bad triangles are processed by the sequential program, and any auto-parallelized version of this code will be
forced to process bad triangles in the same order. The fact
that bad triangles can actually be processed in any order is
important for parallelization, but it is missing from this code.
Abstractly, we can view the processing of each bad triangle
as an operator that is applied to the graph to modify a small
region of it; the fact that bad triangles can be processed in
any order is equivalent to asserting that the applications of
this operator to the graph “commute” with each other. Since
the structure of the final graph may actually be different for
different operator orderings, we call this application-specific
commutativity for obvious reasons.
Opportunities to exploit commutativity may also arise in
abstract data type (ADT) implementations. Consider a set
ADT that is implemented using a linked list. With respect to
the semantics of sets, insert operations commute with each
other since all insertion orders produce the same set even
though the linked list representation internal to the ADT
may be different for different insertion orders. In this case,