5Percentage of Time Spent in Different STM Operations
100
90
80
percent of cycles (norm. to fv)
70
60
50
40
30
20
other
read
end
free
malloc
write
begin
stack_range
descriptor check
kernel
10
0
fv gv# b+tree
fv gv# delaunay
fv gv# kmeans
fv gv# genome
Figure 6 gives a fine-grained breakdown of the overheads of the transactional read operation. As expected, the overhead of validating the read set dominates transactional read time in the fv configuration. For both algorithms, the isync operations (necessary for ordering the metadata read and data read, as well as the data read and validation) form a substantial component. In applications that perform writes before reads in the same transaction (delaunay, kmeans), the time spent checking whether a location has been written by prior transactional writes in the same transaction forms a significant component of the total time. Interestingly, reading the data itself is a negligible amount of the total time, indicating the hurdles that must be overcome for the performance of these algorithms to be compelling.
Figure 7 gives a similar breakdown of the transactional commit operation. As before, the fv configuration suffers from having to validate the read set. Other dominant overheads for both configurations are those of having to acquire the metadata for the write set (which involves a sequence of load-linked/store-conditional operations) and
the sync operations that are necessary for ordering the metadata acquires, data writes, and metadata releases. Once again, the data writes themselves form a small component of the total time.
Overhead optimizations. There have been many proposals on reducing STM overheads through compiler or runtime techniques. Most of these techniques are complementary to hardware acceleration for STM. • Redundant barrier elimination. One technique is to eliminate barriers to thread-local objects through escape analysis. Such analysis is typically
fv gv# quite effective in identify-
vacation ing thread-local accesses
that are close to the object
allocation site. It can elimi-
nate both read and write
barriers but is often more
effective on write barriers. For example, we observe that an intraprocedural escape analysis can eliminate 40 to 50 percent of write barriers in vacation, genome, and b+tree. Its impact on performance is more limited, however: from negligible to 12 percent. To target redundant read barriers, a whole-program analysis called Not-Accessed-In-Transaction39 eliminates some barriers to read-only objects in transactions.
• Barrier strength reduction. These optimizations do not eliminate barriers but identify at runtime special locations that require only lightweight barrier processing, such as dynamic tracking of thread-local objects40, 41 and runtime filtering of stack references and duplicate references.4 2
• Code generation optimizations. One common technique is to inline the fast path of barriers. It has the potential benefit of reducing function-call overhead, increasing ILP, and exposing reuse of common sub-bar-rier operations. In our experiments, compiler inlining achieved less than 2 percent overall improvement across our benchmark suite.
References:
Archives