table 4. Compiler instrumentation
cost with x86 ( 1 — speedupstm-Ce speedupstm-me).
threads
1
2
4
8
16
min
0
0
0
0
0
max
0.42
0.4
0.4
0.47
0.44
avg
0.16
0.17
0.11
0.11
0.17
ing partially visible reads. By making
readers only partially visible, the cost of
reads is reduced, compared to fully visible reads, while improving the scalability of privatization support. To implement partially visible readers, Marathe
et al.
17 used timestamps, while Lev et
al.
16 used a variant of SNZI counters.
10
In addition, Lev et al.
16 avoided use of
centralized privatization metadata to
improve scalability.
stm-Ce Performance
Compiler instrumentation often re-
places more memory references by
STM load and store calls than is strictly
necessary, resulting in reduced per-
formance of generated code, or “over-
instrumentation.”
3, 8, 25 Ideally, the com-
piler replaces only memory accesses
with STM calls when they reference
some shared data. However, the com-
piler does not have information about
all uses of variables in the whole pro-
gram or semantic information about
variable use typically available only to
the programmer (such as which vari-
ables are private to some threads and
which are read-only). For this reason,
the compiler, conservatively, generates
more STM calls than necessary; unnec-
essary STM calls reduce performance
because they are more expensive than
the CPU instructions they replace.
Figure 4. stm-Ce performance with 16-core x86.
1 2 4 8 16
10
9
8
speedup
7
6
5
4
3
2
1
0
˲
Bayes
intruder
Kmeans high
Kmeans Low
Labyrinth
ssca2
additional overheads introduced by
compiler instrumentation remain acceptable, as STM-CE outperforms sequential code on 10 of 14 workloads
with only four threads and on all but
one workload overall.
Further optimizations. Ni et al.
18 described optimizations that replace full
STM load and store calls with specialized, faster versions of the same calls;
for example, some STMs perform fast
reads of memory locations previously
accessed for writing inside the same
transaction. While the compiler we
used supports these optimizations, we
have not yet implemented the lower-cost STM barriers in SwissTM. Compiler data structure analysis was used
by Riegel et al.
22 to optimize the code
generated by the Tanger S TM compiler.
Adl-Tabatabai et al.
1 proposed several optimizations in the Java context
to eliminate transactional accesses to
immutable data and data allocated inside current transactions. Harris et al.
14
used flow-sensitive interprocedural
compiler analysis, as well as runtime
log filtering in Bartok-STM, to identify
objects allocated in the current transaction and eliminate transactional
accesses to them. Eddon and Herlihy9
used dataflow analysis of Java programs to eliminate some unnecessary
transactional accesses.
STM-CT performance. We also performed experiments with STM-CT
(using both compiler instrumentation and transparent privatization)
but defer the result to the companion
technical report.
6 Our experiments
showed that, despite the high costs of
transparent privatization and compiler
overinstrumentation, STM-CT outperformed sequential code on all but four
workloads out of 14. However, STM-CT
requires higher thread counts to outperform sequential code than previous
STM variants for the same workloads,
as it outperformed sequential code
in only five of 14 workloads with four
threads. The overheads of STM-CT are
largely a simple combination of STM-CE and STM-MT overheads; the same
techniques for reducing transparent
privatization and compilation overheads are applicable here.
Programming model. The experiments we report here imply that STM-CE (compiler instrumentation with
explicit privatization) may be the most