transactional
memory
why is it only a research toy?
environment is that we want to capture the overheads
at the instruction level and eliminate any other nondeterminism introduced by real hardware. The simulator
eliminates all other bookkeeping operations introduced
by instrumentation and provides an accurate breakdown
of the STM overheads.
We study the performance of two STM algorithms: one
that fully validates (fv) the read set after each transactional read and one that uses a global version number
(gv#) to avoid the full validation, while maintaining the
correctness of the operations. The fv algorithm provides
more concurrency at a much higher price. The gv# is
deemed as one of the best tradeoffs for STM implementations.
Figure 4 presents the single-threaded overhead of
these algorithms over sequential runs, illustrating again
the substantial slowdowns that the algorithms induce.
Figure 5 breaks down these overheads into the various
STM components. For both algorithms, the overhead of
transactional reads dominates because of the frequency
of read operations relative to all other operations. The
effectiveness of the global version number in reducing
overheads is shown in the lower read overhead of gv#.
Figure 3 provides another baseline: the overhead of
compiler instrumentation. The performance is measured
on a 16-way POWER5 running AIX 5. 3. For the STMXLC
curve, we use the uninstrumented versions of the codes
and annotate transactional regions and functions using
the language extensions provided by the compiler. 38
Compiler over-instrumentation is more pronounced
in traditional, unmanaged languages such as C and
C++, where a typical compiler instrumentation without
interprocedural analysis may end up instrumenting every
memory reference in the transactional region (except
for stack accesses). For
example, our compiler
instrumentation more
than doubled the number
of dynamic read barriers
in delaunay, genome, and
kmeans. Interprocedural
analysis can help improve
the tightness of compiler 118.1 43. 8
instrumentation for some 8
cases but is generally fv 7 gv#
limited by the accuracy of
global analysis. 6
runtime (norm. to sequential)
STM operations performance. Given this baseline, 5
we now analyze in detail
which operations in the 4
STM cause the overhead.
3
For this purpose, we use a
cycle-accurate simulator of 2
the PowerPC architecture
that provides hooks for 1
instrumentation. The STM
operations and subop- 0 b+tree
erations are instrumented
with these simulator
hooks. The reason for this
FIGURE
4Single-Threaded Overhead of the STM Algorithms
49. 2
delaunay
kmeans
genome