1
0.8
0.6
0.4
0.2
0
1
2
— STM manual
— STMXLC
34 5
of threads
6
7
8
Vacation
Speedup
1
0.8
0.6
0.4
0.2
0
1
2
3
456
of threads
7
8
ally cannot support transactions calling
legacy codes that are not instrumented
(for example, third-party libraries) without seriously limiting concurrency, such
as by serializing transactions.
figure 4: single-threaded overhead of the stm algorithms.
fv
gv#
8
118.1
43.8
49. 2
7
evaluation
Here we use the following set of benchmarks:
˲ b+tree is an implementation of database indexing operations on a b-tree
data structure for which the data is
stored only on the tree leaves. This implementation uses coarse-grain transactions for every tree operation. Each
b+ tree operation starts from the tree
root and descends down to the leaves.
A leaf update may trigger a structural
modification to rebalance the tree. A
rebalancing operation often involves
recursive ascent over the child-parent
edges. In the worst case, the rebalancing operation modifies the entire tree.
Our workload inserts 2,048 items in a
b+tree of order 20. For this code we have
only a transactional version that is not
manually instrumented, therefore experimental results are presented only
in configurations where we can use our
compiler to provide instrumentation;
˲ delaunay implements the Delaunay
Mesh Refinement algorithm described
in Kulkarni et al. 15 The code produces
a guaranteed quality Delaunay mesh.
This is a Delaunay triangulation with
the additional constraint that no angle
in the mesh be less than 30 degrees.
The benchmark takes as input an unrefined Delaunay triangulation and
produces a new triangulation that satisfies this constraint. In the TM implementation of the algorithm, multiple
threads choose their elements from a
work-queue and refine the cavities as
separate transactions.
˲ genome, kmeans, and vacation are
part of the STAMP benchmark suite19
runtime (norm. to sequential)
6
5
4
3
2
1
0
b+tree
delaunay
version 0.9.4. For a detailed description
of these benchmarks see STAMP. 30
Baseline Performance. In Figure 2 we
present a performance comparison of
three STMs: the IBM, 31, 34 Intel, 14 and
Sun’s TL27 STMs. The runs are on a
quad-core, two-way hyperthreaded Intel
Xeon 2.3GHz box running Linux Fedora
Core 6. In these runs, we used the manually instrumented versions of the codes
that aggressively minimize the number
of barriers for the IBM and TL2 STMs.
Since we do not have access to low-level
APIs for the Intel STM, the curves for the
Intel STM are from codes instrumented
by its compiler, which incur additional
barrier overheads due to compiler instrumentation. 36 The graphs are scalability curves with respect to the serial,
non-transactionalized version. Therefore a value of 1 on the y-axis represents
performance equal to the serial version.
The performance of these STMs is mostly on par, with the IBM STM showing
better scalability on delaunay and TL2
obtaining better scalability on genome.
However, the overall performance obtained is very low: on kmeans the IBM
kmeans
genome
vacation
STM barely attains single thread performance at 4 threads, while on vacation
none of the STMs actually overcome the
overhead of transactional memory even
with 8 threads.
Compiler Instrumentation. The compiler is a necessary component of an
STM-based programming environment
that is to be adopted by mass programmers. Its basic role is to eliminate the
need for programmers to manually instrument memory references to STM
read- and write-barriers. While offering
convenience, compiler instrumentation does add another layer of overheads to the STM system by introducing
redundant barriers, often due to conservativeness of compiler analysis, as also
observed in Yoo. 36
Figure 3 provides another baseline:
the overhead of compiler instrumentation. The performance is measured
on a 16-way POWER5 running AIX 5. 3.
For the STMXLC curve, we use the un-instrumented versions of the codes
and annotate transactional regions and
functions using the language extensions provided by the compiler. 31