as used in P5. The second issue is one of weakness: the new
P9 says nothing about observations of two stores by those
two processors themselves (or by one of those processors
and one other). The following examples (which we call n5
and n4b) illustrate potentially surprising behavior that arguably violates coherence. Their final states are not allowed in
x86-CC, are not allowed in a pure store-buffer implementation or in x86-TSO, and we have not observed them on actual
processors. However, the principles stated in revisions 29–34
of the Intel SDM appear, presumably unintentionally, to
allow them. The AMD3.14 vol. 2, § 7. 2 text taken alone would
allow them, but the implied coherence from elsewhere in
the AMD manual would forbid them. These points illustrate
n5
Proc 0
Proc 1
Mov [x]¬ 1
Mov eAX¬[x]
Mov [x]¬ 2
Mov ebX¬[x]
Forbidden Final State: Proc 0:eAX= 2 ∧ Proc 1:ebX= 1
n4b
Proc 0 Proc 1
Mov eAX¬[x]
Mov [x]¬ 1
Mov eCX¬[x]
Mov [x]¬ 2
Forbidden Final State: Proc 0:eAX= 2 ∧ Proc 1:eCX= 1
once again the difficulty of writing unambiguous and correct
loose specifications in informal prose.
2. 4. amD3.15 (nov. 2009)
In November 2009, AMD produced a new revision, 3. 15, of
their manuals. The main difference in the memory-model
specification is that IRIW is now explicitly forbidden.
Summarizing the key litmus-test differences, we have the
IWP/x86-CC
3. 14
29–34 3. 15
IRIW
n6
n5/n4b
√* / √
×*/×
×*/×
√
√*
×*
×
√*
√*
×
√*
×*
Actual
processors
Not observed
Observed
Not observed
following, where √ and × entries are explicit in the specifica-
tion text and starred entries indicate possible deductions,
some of which may not have been intended.
There are also many non-differences: tests for which the
behaviors coincide in all three cases. We return to these,
and go through the other tests from the Intel and AMD
documentation, in Section 3. 2.
3. ouR x86-tso PRoGRammeR’s moDeL
Given these problems with the informal specifications, we
cannot produce a useful rigorous model by formalizing the
“principles” they contain, as we attempted with x86-CC. 31
Instead, we have to build a reasonable model that is consistent with the given litmus tests, with observed processor
behavior, and with what we know of the needs of programmers, the vendors’ intentions, and the folklore in the area.
We emphasize that our aim is a programmer’s model, of
the allowable behaviors of x86 processors as observed by
assembly programs, not of the internal structure of processor implementations, or of what could be observed on
hardware interfaces. We present the model in an abstract-machine style to make it accessible, but are concerned
only with its external behavior; its buffers and locks are
highly abstracted from the microarchitecture of processor
implementations.
The fact that store buffering is observable, as in the SB
and n6 examples, but IRIW is not (and IRIW is explicitly
forbidden in the SDM revs. 29–34 and AMD3.15), together
with additional tests that prohibit many other reorderings,
strongly suggests that, apart from store buffering, all processors share the same view of memory. Moreover, different processors or hardware threads do not observably share
store buffers. This is in sharp contrast to x86-CC, where each
processor has a separate view order of its memory accesses
and other processors’ writes. To the best of our knowledge,
for the usual write-back memory, no other aspects of the
microarchitecture (the out-of-order execution, cache hierarchies and protocols, interconnect topology, and so on)
are observable to the programmer, except in so far as they
affect performance.
This is broadly similar to the SPARC Total Store Ordering
(TSO) memory model, 2, 32 which is essentially an axiomatic
description of the behavior of store-buffer multiprocessors. Accordingly, we have designed a TSO-like model for
x86, called x86-TSO. 27 It is defined mathematically in two
styles: an abstract machine with explicit store buffers and
an axiomatic model that defines valid executions in terms
of memory orders; they are formalized in HOL420 and are
proved equivalent. The abstract machine conveys the
programmer-level operational intuition behind x86-TSO;
we describe it informally in the next subsection. The axiomatic model supports constraint-based reasoning about
example programs (e.g. by our memevents tool in Section
3. 3); it is similar to that of SPARCv8, 2, App. K but we also deal
with x86 CISC instructions with multiple memory accesses
and with x86 barriers and atomic (or LOCK’d) instructions.
The x86 supports a range of atomic instructions: one can
add a loCK prefix to many read–modify–write instructions
(Add, inC, etc.), and the XChG instruction is implicitly
LOCK’d. There are three main memory barriers: MFenCe,
sFenCe and lFenCe.
3. 1. the abstract machine
Our programmer’s model of a multiprocessor x86 system
is illustrated in Figure 1. At the top of the figure are a number of hardware threads, each corresponding to a single
in-order stream of instruction execution. (In this programmer’s model there is no need to consider physical processors explicitly; it is the hardware threads that correspond to
the Proc n columns in the tests we give.) They interact with a
storage subsystem, drawn as the dotted box.
The state of the storage subsystem comprises a shared
memory that maps addresses to values, a global lock to
indicate when a particular hardware thread has exclusive access to memory, and one store buffer per hardware