quest is potentially in a transaction’s
read or write set by examining the transaction’s signatures. If so, the memory
reference is a potential conflict and the
STM can either abort a transaction or
turn to software validation.
Both HASTM and Sig TM accelerate
read set tracking and validation for
STM systems. Nevertheless, the architectures differ. SigTM encodes read
set and write sets whose size exceeds
the size of private caches. Capacity
and conflict misses do not cause software validation, as with HASTM. On
the other hand, SigTM uses probabi-listic signatures, which never miss a
true conflict, but may produce false
conflicts due to address aliasing in a
Bloom filter. Moreover, the hardware
signatures are relatively compact
and easy to manipulate, so they can
be saved and restored across context
switches and other interruptions. In
HASTM, the mark bits may be lost if
a processor is used to run other tasks.
On the other hand, Sig TM signatures
track physical addresses and their
content must be discarded after the
virtual page mapping is modified.
Hardware acceleration for read set
management has been shown to improve the performance of lock-based,
direct-update, and deferred-update
STM systems by a factor of two. 22, 26
Additional improvements are possible with hardware mechanisms that
target version management for the objects written by the STM. 31
Nevertheless, since most programs read significantly more objects than they write,
these performance improvements are
small.
Hardware Transactional Memory.
The interest in full hardware implementation of TM (HTM) dates to the
initial two papers on TM by Knight16
and Herlihy and Moss13 respectively.
HTM systems require no software instrumentation of memory references
within transaction code. The hardware
manages data versions and tracks
conflicts transparently as software
performs ordinary read and write accesses. Eliminating instrumentation
reduces program overhead and eliminates the need to specialize function
bodies so they can be called within
and outside of a transaction.
HTM systems rely on a computer’s
cache hierarchy and the cache coher-
ence protocol to implement versioning and conflict detection. Caches
observe all reads and writes issued by
a processor, can buffer a significant
amount of data, and can be searched
efficiently because of their associative
organization. All HTMs modify the
first-level caches, but the approach
extends to higher-level caches, both
private and shared. To illustrate the
organization and operation of HTM
systems, we will describe the TCC architecture in some detail and briefly
mention the key attributes of alternative designs.
The Transactional Coherence and
Consistency (TCC) system is a deferred-
update HTM that performs conflict detection when a transaction attempts
to commit. 21 To track the read set and
write set, each cache block is annotated with R and W tracking bits, which
are set on the first read or write access
to the block from within the transaction. Cache blocks in the write set act
as a write buffer and do not propagate
the memory updates until the transaction commits.
TCC commits transactions using
a two-phase protocol. First, the hardware acquires exclusive access to all
cache blocks in the write set using coherence messages. If this step is successful, the transaction is considered
validated. Next, the hardware instantaneously resets all W bits in the cache,
which atomically commits the updates
by this transaction. The new versions
of the data are now globally accessible
by all processors through the normal
coherence protocol of a Multicore
chip. If validation fails, because another processor is also trying to commit a
conflicting transaction, the hardware
reverts to a software handler, which
may abort the transaction or attempt
to commit it under a contention management policy. When a transaction
commits or aborts, all tracking bits are
simultaneously cleared using a gang
reset operation. Absent conflicts, multiple transactions may be committing
in parallel.
Conflict detection occurs as other
processors receive the coherence messages from the committing transaction. Hardware looks up the received
block address in the local caches. If the
block is in a cache and has its R or W
bit set, there is a read-write or a write-write conflict between the committing
and the local transaction. The hardware signals a software handler, which
aborts the local transaction and potentially retries it after a backoff period.
Similar hardware techniques can
support HTM systems with direct
memory updates or early detection
of conflicts. 23 For direct updates, the
hardware transparently logs the original value in a memory block before its
first modification by a transaction. If
the transaction aborts, the log is used
to undo any memory updates. For
early conflict detection, the hardware
acquires exclusive access to the cache
block on the first write and maintains
it until the transaction commits. Under light contention, most HTM designs perform similarly. Under heavier
contention, deferred updates and late
conflict detection lead to fewer pathological scenarios that can be easily
handled with a backoff policy. 3
The performance of an HTM thread
is typically within 2%–10% of the performance of non-transactional code.
An HTM system can outperform a lock-based STM by a factor of four and the
corresponding hardware-accelerated
STM by a factor of two. 22 Nevertheless,
HTM systems face several system challenges that are not an issue for STM
implementations. The caches used to
track the read set, write set, and data
versions have finite capacity and may
overflow on a long transaction. The
transactional state in caches is large
and is difficult to save and restore at interrupts, context switching, and paging