1 0.8 0.6 0.4 0.2
0
1
2
— STM manual
— STMXLC
34 5
of threads
6
7
8
Vacation
Speedup
1 0.8 0.6 0.4 0.2
0
1
2
3
456
of threads
7
8
ally cannot support transactions calling legacy codes that are not instrumented (for example, third-party libraries) without seriously limiting concurrency, such as by serializing transactions.
figure 4: single-threaded overhead of the stm algorithms.
fv
gv#
8
118.1
43.8
49. 2
7
Here we use the following set of benchmarks:
˲ b+tree is an implementation of database indexing operations on a b-tree data structure for which the data is stored only on the tree leaves. This implementation uses coarse-grain transactions for every tree operation. Each b+ tree operation starts from the tree root and descends down to the leaves. A leaf update may trigger a structural modification to rebalance the tree. A rebalancing operation often involves recursive ascent over the child-parent edges. In the worst case, the rebalancing operation modifies the entire tree. Our workload inserts 2,048 items in a b+tree of order 20. For this code we have only a transactional version that is not manually instrumented, therefore experimental results are presented only in configurations where we can use our compiler to provide instrumentation;
˲ delaunay implements the Delaunay Mesh Refinement algorithm described in Kulkarni et al. 15 The code produces a guaranteed quality Delaunay mesh. This is a Delaunay triangulation with the additional constraint that no angle in the mesh be less than 30 degrees. The benchmark takes as input an unrefined Delaunay triangulation and produces a new triangulation that satisfies this constraint. In the TM implementation of the algorithm, multiple threads choose their elements from a work-queue and refine the cavities as separate transactions.
˲ genome, kmeans, and vacation are part of the STAMP benchmark suite19
runtime (norm. to sequential)
6
5
4
3
2
1
0
b+tree
delaunay
version 0.9.4. For a detailed description of these benchmarks see STAMP. 30
Baseline Performance. In Figure 2 we present a performance comparison of three STMs: the IBM, 31, 34 Intel, 14 and Sun’s TL27 STMs. The runs are on a quad-core, two-way hyperthreaded Intel Xeon 2.3GHz box running Linux Fedora Core 6. In these runs, we used the manually instrumented versions of the codes that aggressively minimize the number of barriers for the IBM and TL2 STMs. Since we do not have access to low-level APIs for the Intel STM, the curves for the Intel STM are from codes instrumented by its compiler, which incur additional barrier overheads due to compiler instrumentation. 36 The graphs are scalability curves with respect to the serial, non-transactionalized version. Therefore a value of 1 on the y-axis represents performance equal to the serial version. The performance of these STMs is mostly on par, with the IBM STM showing better scalability on delaunay and TL2 obtaining better scalability on genome. However, the overall performance obtained is very low: on kmeans the IBM
kmeans
genome
vacation
STM barely attains single thread performance at 4 threads, while on vacation none of the STMs actually overcome the overhead of transactional memory even with 8 threads.
Compiler Instrumentation. The compiler is a necessary component of an STM-based programming environment that is to be adopted by mass programmers. Its basic role is to eliminate the need for programmers to manually instrument memory references to STM read- and write-barriers. While offering convenience, compiler instrumentation does add another layer of overheads to the STM system by introducing redundant barriers, often due to conservativeness of compiler analysis, as also observed in Yoo. 36
Figure 3 provides another baseline: the overhead of compiler instrumentation. The performance is measured on a 16-way POWER5 running AIX 5. 3. For the STMXLC curve, we use the un-instrumented versions of the codes and annotate transactional regions and functions using the language extensions provided by the compiler. 31
References:
Archives