the other thread must acquire the lock
before its access. Thus, it is also possible to define data races as conflicting accesses not ordered by synchronization,
as is done in Java. These definitions are
essentially equivalent. 1, 12
A program that does not allow a data
race is said to be data-race-free. The data-race-free model guarantees sequential consistency only for data-race-free
programs. 1, 3 For programs that allow
data races, the model does not provide
any guarantees.
The restriction on data races is
not onerous. In addition to locks for
avoiding data races, modern programming languages generally also provide
a mechanism, such as Java’s volatile variables, for declaring that certain variables or fields are to be used
for synchronization between threads.
Conflicting accesses to such variables
may occur simultaneously—since they
are explicitly identified as
synchronization (vs. ordinary), they do not create a
data race.
To write Figure 1 correctly under data-race-free, we need simply identify the
shared variables X and Y as synchronization variables. This would require the
implementation to do whatever is necessary to ensure sequential consistency,
in spite of those simultaneous accesses.
It would also obligate the implementation to ensure that these synchronization accesses are performed indivisibly;
if a 32-bit integer is used for synchronization purposes, it should not be visibly
accessed as two 16-bit halves.
This “sequential consistency for da-
ta-race-free programs” approach allevi-
ates the problems discussed with pure
sequential consistency. Most important
hardware and compiler optimizations
continue to be allowed for ordinary ac-
cesses—care must be taken primarily
at the explicitly identified (infrequent)
synchronization accesses since these
are the only ones through which such
optimizations and granularity consid-
erations affect program outcome. Fur-
ther, synchronization-free sections of
the code appear to execute atomically
and the requirement to explicitly iden-
tify concurrent accesses makes it easier
for humans and compilers to under-
stand the code. (For more detail, see our
technical report. 11)
industry Practice and evolution
Hardware memory models. Most hardware supports relaxed models that are
weaker than sequential consistency.
These models take an implementation-or performance-centric view, where the
desirable hardware optimizations drive
the model specification. 1, 2, 20 Typical
driving optimizations relax the program
figure 3. hardware may not execute atomic or indivisible writes.
Assume a fence imposes program order. Assume core 3’s and core 4’s caches have X and Y.
The two writes generate invalidations for these caches. These could reach the caches in a different
order, giving the result shown and a deduction that X’s update occurs both before and after Y’s.
initially X = y = 0
core 1
X = 1;
core 3 core 4
r1 = X; r3 = Y;
fence; fence;
r2 = Y; r4 = X;
Can r1 = 1, r2 = 0, r3 = 1, r4 = 0, violating write atomicity?
core 2
Y = 1;
order requirement of sequential consistency. For example, Sparc’s TSO guarantees that a thread’s memory accesses
will become visible to other threads in
program order, except for the case of a
write followed by a read. Such models
additionally provide fence instructions
to enable programmers to explicitly impose orderings that are otherwise not
guaranteed; for example, TSO programmers may insert a fence between
a thread’s write and read to ensure the
execution preserves that order.
Such a program-orderings + fences
style of specification is simple, but
many subtleties make it inadequate. 1, 2
First, this style implies that a write is
an atomic or indivisible operation that
becomes visible to all threads at once.
As Figure 3 illustrates, however, hardware may make writes visible to different threads at different times through
write buffers and shared caches. Incorporating such optimizations increases
the complexity of the memory model
specification. Thus, the full TSO specification, which incorporates one of the
simplest atomicity optimizations, is
much more involved than the simple
description here. PowerPC implements
more aggressive forms of the optimization, with a specification that is complex and difficult to interpret even for
experts. The x86 documentation from
both AMD and Intel was ambiguous on
this issue; recent updates now clarify
the intent, but remain informal.
Second, in well-written software, a
thread usually relies on synchronization
interactions to reason about the ordering or visibility of memory accesses on
other threads. Thus, it is usually overkill
to require that two program-ordered
accesses always become visible to all
threads in the same order or a write appears atomic to all threads regardless of
the synchronization among the threads.
Instead, it is sufficient to preserve ordering and atomicity only among mutually
synchronizing threads. Some hardware
implementations attempt to exploit
this insight, albeit often through ad hoc
techniques, thereby further complicat-ing the memory model.
Third, modern processors perform
various forms of speculation (for example, on branches and addresses)
which can result in subtle and complex
interactions with data and control dependences, as illustrated in Figure 4. 1, 29