hardware supporting higher levels of
concurrency. Specifically, we experimented with a state-of-the-art STM
implementation, SwissTM7 running
three different STMBench712
workloads, all 10 workloads of the STAMP
(0.9.10)
2 benchmark suite, and four
microbenchmarks, all encompassing both large- and small-scale workloads. We considered two hardware
platforms: a Sun Microsystems Ultra-SPARC T2 CPU machine (referred to
as SPARC in the rest of this article)
supporting 64 hardware threads and a
four quad-core AMD Opteron x86 CPU
machine (referred to as x86 in the rest
of this article) supporting 16 hardware
threads. Finally, we also considered
all combinations of privatization and
compiler support for STM (see Table
1). This constitutes the most exhaustive performance comparison of STM
to sequential code published to date.
The experiments in this article
(summarized in Table 2) show that
STM does indeed outperform sequential code in most configurations and
benchmarks, offering a viable paradigm for concurrent programming;
STM with manually instrumented
benchmarks and explicit privatization
outperforms sequential code by up to
29 times on SPARC with 64 concurrent
threads and by up to nine times on
x86 with 16 concurrent threads. More
important, STM performs well with
a small number of threads on many
benchmarks; for example, STM-ME
outperforms sequential code with four
threads on 14 of 17 workloads on our
SPARC machine and on 13 of 17 workloads on our x86 machine. Basically,
these results support early hope about
the good performance of STM and
should motivate further research. Our
results contradict the results of Cascaval et al.
3 for three main reasons:
˲ ˲ STAMP workloads in Cascaval et
al.
3 presented higher contention than
default STAMP workloads;
˲ ˲We used hardware supporting
table 1. stm support.
model
STM-ME
STM-CE
STM-MT
STM-CT
instrumentation
manual
compiler
manual
compiler
more threads and in case of x86 did not
use hyperthreading; and
˲ ˲ We used a state-of-the-art STM implementation more efficient than those
used in Cascaval et al.
3
Clearly, and despite rather good
STM performance in our experiments,
there is room for improvement, and we
use this article to highlight promising
directions. Also, while use of STM involves several programming challenges3 (such as ensuring weak or strong
atomicity, semantics of privatization,
and support for legacy binary code),
alternative concurrency programming
approaches (such as fine-grain locking
and lock-free techniques) are no easier
to use than STM. Such a comparison
was covered previously11, 13, 15, 23 and is beyond our scope here.
evaluation settings
We first briefly describe the S TM library
used for our experimental evaluation,
Swiss TM,
7 along with benchmarks and
hardware settings. Note that our experiments with other state-of-the-art
STMs5, 14, 18, 20 on which we report in the
companion technical report,
6 confirm
the results presented here; SwissTM
and the benchmarks are available at
http://lpd.epfl.ch/site/research/tmeval.
The STM we used in our evaluation
reflects three main features:
Synchronization algorithm. Swiss TM7
is a word-based STM that uses invisible (optimistic) reads, relying on a
time-based scheme to speed up read-set validation, as in Dice et al.
5 and
Riegel et al.
21 SwissTM detects read/
write conflicts lazily and write/write
Privatization
explicit
explicit
transparent
transparent
conflicts eagerly. The two-phase contention manager uses different algorithms for short and long transactions.
This design was chosen to provide
good performance across a range of
workloads7;
Privatization. We implemented
privatization support in Swiss TM using
a simple validation-barriers scheme
described in Spear et al.
24 To ensure
safe privatization, each thread, after
committing transaction T, waits for all
other concurrent transactions to com-
mit, abort, or validate before executing
application code after T; and
We conducted our experiments us-
ing the following benchmarks:
STMBench7. STMBench712 is a synthetic STM benchmark that models
realistic large-scale CAD/CAM/CASE
workloads, defining three different
workloads with different amounts
of contention: read-dominated (10%
write operations), read/write (60% write
operations), and write-dominated (90%
write operations). The main characteristics of STMBench7 are a large data
structure and long transactions compared to other STM benchmarks. In
this sense, STMBench7 is very challenging for STM implementations;
STAMP. Consisting of eight different applications representative of
a Intel’s C/C++ STM compiler generates only
x86 code so was not used in our experiments
on SPARC.
table 2. summary of stm speedup over sequential code.
hardware
SPARC
x86
speedup
hw threads
64
16
stm-me
min max avg
1. 4 29. 7
9. 1
0.54 9. 4
3. 4
stm-Ce
min max avg
———
0.8 9. 3
3. 1
stm-mt
min max avg
1. 2 23. 6
5. 6
0.34 5. 2
1. 8
stm-Ct
min max avg
———
0.5 5. 3
1. 7