1.0
0.8
speedup
0.6
0.4
0.2
STM manual STMXLC
genome
1.0
0.8
speedup
0.6
0.4
0.2
STM manual STMXLC
vacation
0
1
2
3456 number of threads
7
8
0
1
2
3456
number of threads
7
8
is stored only on the tree leaves (a b+ tree). This implementation uses coarse-grained transactions for every tree operation. Each b+ tree operation starts from the tree root and descends to the leaves. A leaf update may trigger a structural modification to rebalance the tree. A rebalancing operation often involves recursive ascent over the child-parent edges. In the worst case, the rebalancing operation modifies the entire tree. Our workload inserts 2,048 items in a b+ tree of order 20. For this code we have only a transactional version that is not manually instrumented; therefore, experimental results are presented only in configurations where we can use our compiler to provide instrumentation.
• delaunay implements the Delaunay Mesh Refinement algorithm described in Kulkarni et al. 29 The code produces a guaranteed quality Delaunay mesh. This is a Delaunay triangulation with the additional constraint that no angle in the mesh be less than 30 degrees. The benchmark takes as input an unrefined Delaunay triangulation and produces a new triangulation that satisfies this constraint. In the TM implementation of the algorithm, multiple threads choose their elements from a work queue and refine the cavities as separate transactions.
• genome, kmeans, and vacation are part of the STAMP benchmark suite30 version 0.9.4. For a detailed description of these benchmarks, see STAMP. 31
Baseline performance. Figure 2 presents a performance comparison of three STMs: IBM, 32, 33 Intel, 34 and Sun TL2.35 The runs are on a quad-core, two-way hyperthreaded Intel Xeon 2.3-GHz box running Linux Fedora Core 6. In
these runs, we used the manually instrumented versions of the codes, which aggressively minimize the number of barriers for the IBM and TL2 STMs. Since we do not have access to low-level APIs for the Intel STM, its curves are from codes instrumented by the Intel STM compiler, which incurs additional barrier overheads as a result of compiler instrumentation. 36 The graphs are scalability curves with respect to the serial, nontransactionalized version. Therefore, a value of 1 on the y-axis represents performance equal to the serial version.
The performance of these STMs is mostly on par, with the IBM STM showing better scalability on delaunay and TL2 obtaining better scalability on genome. The overall performance obtained is very low, however: on kmeans the IBM STM barely attains single-threaded performance at four threads, while on vacation none of the STMs actually overcomes the overhead of transactional memory even with eight threads.
Compiler instrumentation. The compiler is a necessary component of an STM-based programming environment that is to be adopted by mass programmers. Its basic role is to eliminate the need for programmers to manually instrument memory references to STM read and write barriers. While offering convenience, compiler instrumentation does add another layer of overheads to the STM system by introducing redundant barriers, often resulting from the conservativeness of compiler analysis, as observed in Yoo. 37
References:
Archives