unsupported by any examples. It is hard to see precisely
what this prose means, especially without additional
knowledge or assumptions about the microarchitecture
of particular implementations. The uncertainty about
x86 behavior that at least some systems programmers
had about earlier IA- 32 processors can be gauged from an
extensive discussion about the correctness of a proposed
optimization to a Linux spinlock implementation. 1 The
discussion is largely in microarchitectural terms, not just
in terms of the specified architecture, and seems to have
been resolved only with input from Intel staff. We return to
this optimization in Section 4, where we can explain why it
is sound with respect to x86-TSO.
n6
and following some testing, IRIW is not observable in
practice, even without MFenCes. It appears that some
JVM implementations depend on this fact, and would not
be correct if one assumed only the IWP/AMD3.14/x86-CC
architecture. 15
Second, more seriously, x86-CC and IWP are unsound with
respect to current processors. The following example, n6, due
to Paul Loewenstein [personal communication, Nov. 2008]
shows a behavior that is observable (e.g., on an Intel Core 2
Proc 0
Proc 1
2. 2. i WP/amD3.14/x86-cc
In August 2007, an Intel White Paper4 (IWP) gave a somewhat
more precise model, with 8 informal-prose principles P1–P8
supported by 10 examples (known as litmus tests). This was
incorporated, essentially unchanged, into later revisions of
the Intel SDM (including rev. 26–28), and AMD gave similar, though not identical, prose and tests in rev. 3. 14 of their
manual3, vol. 2, § 7. 2 (AMD3.14). These are essentially causal-consistency models, 9 and they allow different processors to
see writes to independent locations in different orders, as
in the IRIW litmus test11 below. a AMD3.14 allows this explicitly, while IWP allows it implicitly, as IRIW is not ruled out by
the stated principles. Microarchitecturally, IRIW can arise
Mov [x]¬ 1
Mov eAX¬[x]
Mov ebX¬[y]
Allowed Final State: Proc 0:eAX= 1 ∧ Proc 0:ebX=0 ∧ [x]= 1
Mov [y]¬ 2
Mov [x]¬ 2
IRIW
Proc 0
Mov [x]¬ 1
Proc 2
Mov eAX¬[x]
Mov ebX¬[y]
Forbidden Final State: Proc 2:eAX= 1 ∧ Proc 2:ebX=0
∧ Proc 3:eCX= 1 ∧ Proc 3:edX=0
Proc 1
Mov [y]¬ 1
Proc 3
Mov eCX¬[y]
Mov edX¬[x]
duo) but that is disallowed by x86-CC and by any interpretation we can make of IWP principles P1, 2, 4 and 6. 27, A. 5
To see why this could be allowed by multiprocessors with
FIFO store buffers, suppose that first the Proc 1 write of
[y]= 2 is buffered, then Proc 0 buffers its write of [x]= 1, reads
[x]= 1 from its own store buffer, and reads [y]=0 from main
memory, then Proc 1 buffers its [x]= 2 write and flushes its
buffered [y]= 2 and [x]= 2 writes to memory, then finally Proc
0 flushes its [x]= 1 write to memory.
The AMD3.14 manual is not expressed in terms of a
clearly identified set of principles, and the main text (vol. 2,
§ 7. 2) leaves the ordering of stores to a single location unconstrained, though elsewhere the manual describes a microarchitecture with store buffers and cache protocols that
strongly implies that memory is coherent. In the absence of
an analogue of the IWP P6, the reasoning prohibiting n6 does
not carry over.
from store buffers that are shared between some but not all
processors.
However, both require that, in some sense, causality is
respected, as in the IWP principle “P5. In a multiprocessor
system, memory ordering obeys causality (memory ordering
respects transitive visibility).”
We used these informal specifications as the basis for a
formal model, x86-CC, 31 for which a key issue was giving a
reasonable interpretation to this “causality,” which is not
defined in IWP or AMD3.14. Apart from that, the informal
specifications were reasonably unambiguous—but they
turned out to have two serious flaws.
First, they are arguably rather weak for programmers. In
particular, they admit the IRIW behavior above but, under
reasonable assumptions on the strongest x86 memory
barrier, MFenCe, adding MFenCes would not suffice to
recover sequential consistency (instead, one would have to
make liberal use of x86 LOCK’d instructions). 31, § 2. 12 Here,
the specifications seem to be much looser than the behavior
of implemented processors: to the best of our knowledge,
a We use Intel assembly syntax throughout except that we use an arrow ← to
indicate the direction of data flow, so Mov [x]← 1 is a write of 1 to address x
and Mov eAX←[x] is a read from address x into register eAX. Initial states
are all 0 unless otherwise specified.
2. 3. intel sDm rev. 29–34 (nov. 2008–mar. 2010)
The most recent substantial change to the Intel memory-model specification, at the time of writing, was in revision
29 of the Intel SDM (revisions 29–34 are essentially identical
except for the LFENCE text). This is in a similar informal-prose style to previous versions, again supported by litmus
tests, but is significantly different to IWP/x86-CC/AMD3.14.
First, the IRIW final state above is forbidden, 5, Example 8–7, vol. 3A
and the previous coherence condition: “P6. In a multiprocessor system, stores to the same location have a total order”
has been replaced by: “Any two stores are seen in a consistent
order by processors other than those performing the stores” (we
label this P9).
Second, the memory barrier instructions are now inclu-
ded. It is stated that reads and writes cannot pass MFENCE
instructions, together with more refined properties for
SFENCE and LFENCE
Third, same-processor writes are now explicitly ordered:
“Writes by a single processor are observed in the same order by
all processors” (P10) (we regarded this as implicit in the IWP
“P2. Stores are not reordered with other stores”).
This revision appears to deal with the unsoundness, admitting the n6 behavior above, but, unfortunately, it is still problematic. The first issue is, again, how to interpret “causality”