priority, it simply performs the standard rollback operations. The thread which was executing i1 then begins a new
iteration.
This situation is complicated when i2 is the iteration that
must be rolled back. Because the Galois runtime functions
at the user-level, there is no way to roll back an iteration
running on another thread. Instead, i1 undoes the effects
of i2 without explicitly rolling back execution. Next, i1 sets a
flag on i2 telling it to roll back. When the thread running i2
invokes a shared method or attempts to commit, it checks
this flag and completes the rollback.
When an iteration has to be aborted, the callbacks in its
undo log are executed in LIFO order. Note that the arguments used by the callback must have the values present
when the callback was created. This is ensured due to the
LIFO ordering of the undo log, as any later changes to the
arguments will be undone first.
3. 4. Discussion
There is no analog of unordered-set iterators or catch-and-release objects in current TLS systems22, 24 (in fact, most of
these systems auto-parallelize programs in FORTRAN and
C, which have no notion of data abstraction). It is possible
that this might account for the limited performance of these
systems.
The TM paper of Herlihy and Moss7 has inspired a vast
literature on transactions and TM (see Larus and Rajwar15
for a survey of the more important results). The starting
point for the transactional approach is an explicitly parallel
program, and the focus is on reducing the complexities and
overhead of synchronization through the use of the transactional model. In contrast, our starting point is a sequential program, and the focus is on auto-parallelization. The
Galois runtime system exploits optimistic parallelism just
as TM exploits optimistic synchronization. Hardware TM
can be used to implement catch-and-keep classes with low
overhead, but catch-and-release classes must be supported
in software. Herlihy and Koskinen have recently introduced
catch-and-release objects into a software TM to boost its
performance. 6
Ni et al. 16 have proposed to extend the conventional transactional model with open nested transactions and abstract
locking to allow more abstract conflict checking. Open nesting is a mechanism, and it does not specify how the abstract
locks should be used. Semantic commutativity provides the
appropriate definition of semantic conflict for data structures, and open nesting is one possible means of implementing semantic commutativity.
4. eVaLua TioN
Our initial implementation of the Galois system was in C++.
Our evaluation platform was a 4 processor, 1. 5 GHz Itanium
2, with 16KB of L1, 256KB of L2 and 3MB of L3 cache per processor. The threading library was pthreads.
4. 1. Delaunay mesh refinement
We first wrote a sequential Delaunay mesh refinement program without locks, threads etc. to serve as a reference implementation. We then implemented a Galois version (which we
call meshgen), as well as an explicitly parallel, fine-grain locking program (FGL) that uses locks on individual triangles.
The Galois version uses the set iterator, and the runtime system described in Section 3. 3. In all three implementations,
the mesh was represented by a graph that was implemented
as a set of triangles, where each triangle maintained a set of
its neighbors. For meshgen, code for commutativity checks
was added by hand to this graph class; ultimately, we would
like to generate this code automatically from high-level
commutativity specifications like those in Figure 8. We used
an STL queue to implement the work-set. We refer to these
default implementations of meshgen and FGL as meshgen(d)
and FGL(d).
To understand the effect of scheduling policy on performance, we implemented two more versions, FGL(r)
and meshgen(r), in which the work-set was implemented
by a data structure that returned a random element of the
work-set.
The input data-set was generated automatically using
Jonathan Shewchuk’s Triangle program. 21 It had 10,156
triangles and boundary segments, of which 4,837 triangles
were bad.
Execution times and speedups: Execution times for the
five implementations on the Itanium machine are shown in
Figure 10(a). The reference version is the fastest on a single
processor. On four processors, FGL(d) and FGL(r) differ only
slightly in performance. Meshgen(r) performed almost as
well as FGL, although surprisingly, meshgen(d) was twice as
slow as FGL.
statistics on Committed and aborted iterations: To
understand these issues better, we determined the total
number of committed and aborted iterations for different versions of meshgen, as shown in Figure 10(b). On one
processor, meshgen executed and committed 21,918 iterations. Because of the inherent nondeterminism of the set
figure 10. mesh refinement results.
8
Execution time (s)
6
4
2
0
1
Reference
FGL (d)
FGL (r)
Meshgen (d)
Meshgen (r)
23
of processors
(a) Execution times
4
Committed Aborted
#ofproc. Max Min Avg Max Min
1 21918 21918 21918 n/a n/a
4(meshgen(d)) 22128 21458 21736 28929 27711
4(meshgen(r)) 22101 21738 21909 265 151
(b) Committed and aborted iterations for meshgen
Avg
n/a
28290
188
Source of overhead of overhead
Abort 10
Commit 10
Scheduler 3
Commutativity 77