Contradicting earlier results. The
results reported by Cascaval et al.
3 indicated STMs do not perform well on
three of the STAMP applications we
also used: kmeans, vacation, and
genome. In our experiments, STM delivered good performance on all three.
Three main reasons for such considerable difference are:
Workload characteristics. A close
look at the experimental settings in
Cascaval et al.
3 reveals their workloads
had higher contention than the default
STAMP workloads. STM usually has the
lowest performance in highly contended workloads, consistent with our previous experiments, as in Figure 1.
To evaluate the impact of workload
characteristics, we ran both default
STAMP workloads and STAMP work-
loads from Cascaval et al.
3 on a ma-
chine with two quad-core Xeon CPUs
that was more similar to the machine in
Cascaval et al.
3 than to the x86 machine
we used in our earlier experiments. Fig-
ure 2a outlines slowdown of workloads
from Cascaval et al.
3 compared to de-
fault STAMP workloads; we used both
low- and high-contention workloads
for kmeans and vacation. Workload
settings from Cascaval et al.
3 do indeed
degrade STM-ME performance. The
performance impact is significant in
kmeans (around 20% for high- and up
to 200% for low-contention workloads)
and in vacation (30% to 50% in both).
The performance is least affected in
genome (around 10%).
Figure 2. impact of experimental settings3 on stm-me performance.
3. 5
(2a) Workload impact
1 2 4 8
3
2. 5
slowdown
2
1. 5
1
to a similar machine without hyper-threading. The figure shows that hyperthreading has a significant effect
on performance, especially with higher
thread counts. Slowdown in genome
with four threads is around 65% and
on two vacation workloads around
40%. The performance difference in
kmeans workloads is significant, even
with a single thread, due to differences
in CPUs not related to hyperthreading. Still, even with kmeans, slowdown
with four threads is much higher than
with one and two threads.
More-efficient STM. Part of the
performance difference is due to a
more efficient STM implementation.
The results reported by Dragojević et
al.
7 suggest that Swiss TM has better
performance than TL2, performing
comparably to the IBM STM in Cascaval et al.
3
We also experimented with TL2,
5
McRT-STM,
1 and TinySTM.
20 Tim Harris of Microsoft Research provided us
with the Bartok STM14 performance
results on a subset of STAMP. All these
experiments confirm our general conclusion about good STM performance
on a range of workloads.
6
Further optimizations. In some workloads, performance degraded when we
used too many concurrent threads.
One possible alternative to improving performance in these cases would
be to modify the thread scheduler so
it avoids running more concurrent
threads than is optimal for a given
workload, based on the information
provided by the STM runtime.
0.5
0
Genome
Kmeans high Kmeans Low Vacation high Vacation Low
3
(2b) hyper-threading impact
2. 5
1 2 4
slowdown
2
1. 5
1
0.5
0
Genome
Kmeans high Kmeans Low Vacation high Vacation Low
stm-mt Performance