other read
end free
malloc write
begin stack_range
desc kernel
100
90
runtime (norm. to sequential)
80
70
60
50
40
30
20
10
0
fv gv#
b+tree
fv gv#
delaunay
fv gv#
kmeans
fv gv#
genome
fv gv#
vacation
figure 6: Percentage of time spent in stm read sub-operations.
return validate sync read data
add metadata to read set
check if metadata is locked
read metadata
calculate metadata
check read after write
setup
call
other
100
90
of cycles (norm. to fv)
80
70
60
50
40
30
20
10
0
fv gv#
b+tree
fv gv#
delaunay
fv gv#
kmeans
fv gv#
genome
fv gv#
vacation
Compiler over-instrumentation is more pronounced in traditional, unmanaged languages, such as C and C++, where a compiler instrumentation without interprocedural analysis may end up instrumenting every memory reference in the transactional region (except for stack accesses). Indeed, our compiler instrumentation more than doubled the number of dynamic read barriers in delaunay, genome, and kmeans . Interprocedural analysis can help improve the tightness of compiler instrumentation for some cases, but is generally limited by the accuracy of global analysis.
STM Operations Performance. Given this baseline, we now analyze in detail which operations in the STM cause the overhead. For this purpose, we use a cycle-accurate simulator of the Power-PC architecture that provides hooks for instrumentation. The STM operations and suboperations are instrumented with these simulator hooks. The reason for this environment is that we want to capture the overheads at instruction level and eliminate any other nondeterminism introduced by real hardware. The simulator eliminates all other bookkeeping operations introduced by
instrumentation and provides an accurate breakdown of the STM overheads.
We study the performance of two STM algorithms: one that fully validates (“fv") the read set after each transactional read and one that uses a global version number (“gv#") to avoid the full validation, while maintaining the correctness of the operations. The fv algorithm provides more concurrency at a much higher price. The gv# is deemed as one of the best trade-offs for STM implementations.
Figure 4 presents the single-threaded overhead of these algorithms over sequential runs, illustrating again the substantial slowdowns that the algorithms induce. Figure 5 breaks down these overheads into the various STM components. For both algorithms, the overhead of transactional reads dominates due to the frequency of read operations relative to all other operations. The effectiveness of the global version number in reducing overheads is shown in the lower read overhead of “gv#.”
Figure 6 gives a fine-grain breakdown of the overheads of the transactional read operation. As expected, the overhead of validating the read set dominates transactional read time in the “fv” configuration. For both algorithms, the isync operations (necessary for ordering the metadata read and data read as well as the data read and validation) form a substantial component. In applications that perform writes before reads in the same transaction (delaunay, kmeans), the time spent checking whether a location has been written by prior writes in the same transaction forms a significant component of the total time. Interestingly, reading the data itself is a negligible amount of the total time, indicating the hurdles that must be overcome for the performance of these algorithms to be compelling.
Figure 7 gives a similar breakdown of the transactional commit operation. As before, the “fv" configuration suffers from having to validate the read set. Other dominant overheads for both configurations are that of having to acquire the metadata for the write set (which involves a sequence of load-linked/store-conditional operations) and the sync operations that are necessary for ordering the metadata acquires, data writes, and metadata releases. Once again, the data writes themselves form a small
References:
Archives