iterator, the number of iterations executed by meshgen
in parallel varies from run to run (the same effect will be
seen on one processor if the scheduling policy is varied).
Therefore, we ran the codes a large number of times, and
determined a distribution for the numbers of committed
and aborted iterations. Figure 10(b) shows that on four processors, meshgen(d) committed roughly the same number
of iterations as it did on one processor, but also aborted
almost as many iterations due to cavity conflicts. The abort
ratio for meshgen(r) is much lower because the scheduling
policy reduces the likelihood of conflicts between processors. The lower abort ratio accounts for the performance
difference between meshgen(d) and meshgen(r). Because
the FGL code is carefully tuned by hand, the cost of an
aborted iteration is substantially less than the corresponding cost in meshgen, so FGL(r) performs only a little better
than FGL(d).
It seems counterintuitive that a randomized scheduling
policy could be beneficial, but a deeper investigation into
the source of cavity conflicts showed that the problem could
be attributed to our use of an STL queue to implement the
work-set. When a bad triangle is refined by the algorithm,
a cluster of smaller bad triangles may be created within
the cavity. In the queue data structure, these new bad triangles are adjacent to each other, so it is likely that they will
be scheduled together for refinement on different processors, leading to cavity conflicts. One conclusion from these
experiments is that domain knowledge is invaluable for
implementing a good scheduling policy.
overhead Breakdown: The Galois system introduces
some overhead over the reference code, even when running on one processor; meshgen(r) takes 58% longer to
execute the reference code on the same input. To understand the overheads of the Galois implementations, we
instrumented the code using PAPI. We broke down the
Galois overhead into four categories: ( 1) commit overhead,
( 2) abort overhead, ( 3) scheduler overhead, which includes
time spent arbitrating conflicts, and ( 4) commutativity
check overhead. The results, as seen in Figure 10(c), show
that roughly three fourths of the Galois overhead goes in
performing commutativity checks. It is clear that reducing
this overhead is key to reducing the overall overhead of the
Galois runtime.
4. 2. agglomerative clustering
For the agglomerative clustering problem, the two main
data structures are the kd-tree and the priority queue. The
kd-tree interface is essentially the same as set, but with the
addition of the nearest neighbor (nearest) method. The
priority queue is an instance of a Poset. Since the priority
queue is used to sequence iterations, the removal and insertion operations (get and add, respectively) are orchestrated
by the commit pool.
To evaluate the agglomerative clustering algorithm,
we modified an existing graphics application called lightcuts that provides a scalable approach to illumination. 23
The code builds a light hierarchy based on a distance metric that factors in Euclidean distance, light intensity, and
light direction. We modified the objects used in the light
96 commuNicaTioNS of The acm | sEPTEMBER 2009 | VoL. 52 | no. 9
clustering code to use Galois interfaces and the Poset iterator for tree construction. The overall structure of the resulting code was discussed in Figure 4. We will refer to this
Galois version as treebuild. We compared the running time
of treebuild against a sequential reference version.
Figure 11 gives the performance results. These results
are similar to the Delaunay mesh generation results
discussed in Section 4. 1, so we describe only the points
of note. The execution times in Figure 11(a) show that
despite the serial dependence order imposed by the priority queue, the Galois system is able to expose a significant
amount of parallelism. The mechanism that allows us to
do this is the commit pool, which allows threads to begin
execution of iterations even if earlier iterations have yet to
commit. The overhead introduced by the Galois system is
44% on a single processor. We see that due to the overhead
of managing the commit pool, the scheduler accounts for
a significant percentage of the overall Galois overhead, as
seen in Figure 11(c).
4. 3. ongoing work
In recent work, we introduced the notion of logical data
partitioning into the Galois system. 13 For mesh refinement,
each partition of the graph is mapped to a core, and each
core processes bad triangles in its own partition. This
mapping reduces the likelihood of conflicts since different cores work in different regions of the graph; unlike the
randomized schedule discussed in Section 4, this approach
also promotes locality of reference. Furthermore, commutativity checks, which are expensive, can be replaced with
locking on partitions. Over-decomposition of the graph
increases the likelihood that a core will have work to do
even if some of its partitions are locked by other cores
figure 11. agglomerative clustering results.
8
Execution time (s)
6
4
2
0
Reference
Treebuild
1
23
# of processors
(a) Execution times
4
Committed Aborted
#ofproc. Max Min Avg Max Min
57846 57846 57846 n/a n/a 1
4 57870 57849 57861 3128 1887
(b) Committed and aborted iterations in treebuild
Avg
n/a
2528
Source of overhead of overhead
Abort 1
Commit 8
Scheduler 39
Commutativity 52