environment is that we want to capture the overheads at the instruction level and eliminate any other nondeterminism introduced by real hardware. The simulator eliminates all other bookkeeping operations introduced by instrumentation and provides an accurate breakdown of the STM overheads.
We study the performance of two STM algorithms: one that fully validates (fv) the read set after each transactional read and one that uses a global version number (gv#) to avoid the full validation, while maintaining the correctness of the operations. The fv algorithm provides more concurrency at a much higher price. The gv# is deemed as one of the best tradeoffs for STM implementations.
Figure 4 presents the single-threaded overhead of these algorithms over sequential runs, illustrating again the substantial slowdowns that the algorithms induce. Figure 5 breaks down these overheads into the various STM components. For both algorithms, the overhead of transactional reads dominates because of the frequency of read operations relative to all other operations. The effectiveness of the global version number in reducing overheads is shown in the lower read overhead of gv#.
Figure 3 provides another baseline: the overhead of compiler instrumentation. The performance is measured on a 16-way POWER5 running AIX 5. 3. For the STMXLC curve, we use the uninstrumented versions of the codes and annotate transactional regions and functions using the language extensions provided by the compiler. 38
Compiler over-instrumentation is more pronounced in traditional, unmanaged languages such as C and C++, where a typical compiler instrumentation without interprocedural analysis may end up instrumenting every memory reference in the transactional region (except for stack accesses). For example, our compiler instrumentation more than doubled the number of dynamic read barriers in delaunay, genome, and kmeans. Interprocedural analysis can help improve the tightness of compiler 118.1 43. 8 instrumentation for some 8 cases but is generally fv 7 gv# limited by the accuracy of global analysis. 6
runtime (norm. to sequential)
STM operations performance. Given this baseline, 5 we now analyze in detail which operations in the 4 STM cause the overhead.
3
For this purpose, we use a cycle-accurate simulator of 2 the PowerPC architecture that provides hooks for 1 instrumentation. The STM operations and subop- 0 b+tree erations are instrumented with these simulator hooks. The reason for this
49. 2
delaunay
kmeans
genome
References:
Archives