figure 1. x89-tso block diagram.
H/W thread
H/W thread
Write buffer
Write buffer
Lock
Shared memory
thread.
The behavior of the storage subsystem is described in
more detail below, but the main points are:
˲ ˲ The store buffers are FIFO and a reading thread must
read its most recent buffered write, if there is one, to that
address; otherwise reads are satisfied from shared memory.
˲ ˲ An MFENCE instruction flushes the store buffer of that
thread.
˲ ˲ To execute a LOCK’d instruction, a thread must first obtain the global lock. At the end of the instruction, it flushes
its store buffer and relinquishes the lock. While the lock is
held by one thread, no other thread can read.
˲ ˲ A buffered write from a thread can propagate to the
shared memory at any time except when some other thread
holds the lock.
More precisely, the possible interactions between the
threads and the storage subsystem are described by the
following events:
˲ ˲ Wp [a]=u, for a write of value u to address a by thread p
˲ ˲ rp [a]=u, for a read of u from a by thread p
˲ ˲ Fp, for an MFENCE memory barrier by thread p
˲ ˲ lp, at the start of a LOCK’d instruction by thread p
˲ ˲ up, at the end of a LOCK’d instruction by thread p
˲ ˲ tp, for an internal action of the storage subsystem, prop-
agating a write from p’s store buffer to the shared memory
For example, suppose a particular hardware thread p has
come to the instruction inC [ 56] (which adds 1 to the value
at address 56), and p’s store buffer contains a single write to
56, of value 0. In one execution we might see read and write
events, rp[ 56]=0 and Wp[ 56]= 1, followed by two tp events as
the two writes propagate to shared memory. Another execu-
tion might start with the write of 0 propagating to shared
memory, where it could be overwritten by another thread.
Executions of loCK;inC [ 56] would be similar but bracketed
by lp and up events.
1. rp[a]=u: p can read u from memory at address a if p is
not blocked, there are no writes to a in p’s store buffer,
and the memory does contain u at a.
2. rp[a]=u: p can read u from its store buffer for address a
if p is not blocked and has u as the newest write to a in
its buffer.
3. Wp[a]=u: p can write u to its store buffer for address a at
any time.
4. tp: if p is not blocked, it can silently dequeue the oldest
write from its store buffer and place the value in memory at the given address, without coordinating with any
hardware thread.
5. Fp:if p’sstorebufferisempty, itcanexecutean MFenCe
(note that if a hardware thread encounters an MFenCe
instruction when its store buffer is not empty, it can
take one or more tp steps to empty the buffer and
proceed, and similarly in 7 below).
6. lp: if the lock is not held, it can begin a LOCK’d
instruction.
7. up: if p holds the lock, and its store buffer is empty, it
can end a LOCK’d instruction.
Technically, the formal versions of these rules27 define a
labeled transition system (with the events as labels) for the
storage subsystem, and we define the behavior of the whole
system as a parallel composition of that and transition systems for each thread, synchronizing on the non-t labels as
in CCS. 25
Additionally, we tentatively impose a progress condition, that each memory write is eventually propagated from
the relevant store buffer to the shared memory. This is not
stated in the documentation and is hard to test. We are
assured that it holds at least for AMD processors.
For write-back cacheable memory, and the fragment
of the instruction set that we consider, we treat lFenCe
and sFenCe semantically as no-ops. This follows the Intel
and AMD documentation, both of which imply that these
fences do not order store/load pairs which are the only
reorderings allowed in x86-TSO. Note, though, that elsewhere it is stated that the Intel sFenCe flushes the store
buffer. 5, vol.3A, § 11. 10
3. 2. Litmus tests
For our introductory SB example from Section 1, x86-TSO
permits the given behavior for the same reasons as set forth
there. For each of the examples in Section 2 (IRIW, n6, and
n5/n4b), x86-TSO permits the given final state if and only
if it is observable in our testing of actual processors, i.e.,
for IRIW it is forbidden (in contrast to IWP and AMD3.14),
for n6 it is allowed (in contrast to IWP), and for n5/n4b it
is forbidden (in contrast to the Intel SDM rev. 29–34). For
all the other relevant tests from the current Intel and AMD
manuals the stated behavior agrees with x86-TSO. We now
go through Examples 8–1 to 8–10 from rev. 34 of the Intel
SDM, and the three other tests from AMD3.15, and explain
the x86-TSO behavior in each case.