FIGURE
5Percentage of Time Spent in Different STM Operations
100
90
80
percent of cycles (norm. to fv)
70
60
50
40
30
20
other
read
end
free
malloc
write
begin
stack_range
descriptor check
kernel
10
0
fv gv#
b+tree
fv gv#
delaunay
fv gv#
kmeans
fv gv#
genome
Figure 6 gives a fine-grained breakdown of the overheads of the transactional read operation. As expected,
the overhead of validating the read set dominates transactional read time in the fv configuration. For both algorithms, the isync operations (necessary for ordering the
metadata read and data read, as well as the data read and
validation) form a substantial component. In applications
that perform writes before reads in the same transaction
(delaunay, kmeans), the time spent checking whether a
location has been written by prior transactional writes
in the same transaction forms a significant component
of the total time. Interestingly, reading the data itself
is a negligible amount of the total time, indicating the
hurdles that must be overcome for the performance of
these algorithms to be compelling.
Figure 7 gives a similar breakdown of the transactional
commit operation. As before, the fv configuration suffers
from having to validate the read set. Other dominant
overheads for both configurations are those of having to
acquire the metadata for the write set (which involves a
sequence of load-linked/store-conditional operations) and
the sync operations that
are necessary for ordering
the metadata acquires,
data writes, and metadata
releases. Once again, the
data writes themselves
form a small component of
the total time.
Overhead optimizations. There have been
many proposals on
reducing STM overheads
through compiler or
runtime techniques. Most
of these techniques are
complementary to hardware acceleration for STM.
• Redundant barrier
elimination. One technique
is to eliminate barriers
to thread-local objects
through escape analysis.
Such analysis is typically
fv gv# quite effective in identify-
vacation ing thread-local accesses
that are close to the object
allocation site. It can elimi-
nate both read and write
barriers but is often more
effective on write barriers. For example, we observe that
an intraprocedural escape analysis can eliminate 40 to 50
percent of write barriers in vacation, genome, and b+tree.
Its impact on performance is more limited, however:
from negligible to 12 percent. To target redundant read
barriers, a whole-program analysis called Not-Accessed-In-Transaction39 eliminates some barriers to read-only
objects in transactions.
• Barrier strength reduction. These optimizations do not
eliminate barriers but identify at runtime special locations
that require only lightweight barrier processing, such as
dynamic tracking of thread-local objects40, 41 and runtime
filtering of stack references and duplicate references.4 2
• Code generation optimizations. One common technique is to inline the fast path of barriers. It has the
potential benefit of reducing function-call overhead,
increasing ILP, and exposing reuse of common sub-bar-rier operations. In our experiments, compiler inlining
achieved less than 2 percent overall improvement across
our benchmark suite.