on other hardware, there are apparent
mismatches, most probably caused by
the lack of a well-understood programming language model when the hardware was designed. On x86, it is almost
sufficient to map synchronization loads
and stores directly to ordinary load and
store instructions. The hardware provides sufficient guarantees to ensure
that ordinary memory operations are
not visibly reordered with synchronization operations. However it fails to prevent reordering of a synchronization
store followed by a synchronization
load; thus this implementation does
not prevent the incorrect outcome for
Figure 1.
This may be addressed by translating a synchronization store to an ordinary store instruction followed by an expensive fence. The sole purpose of this
fence is to prevent reordering of the synchronization store with a subsequent
synchronization load. In practice, such
a synchronization load is unlikely to follow closely enough (Dekker’s algorithm
is not commonly used) to really constrain the hardware. But the only available fence instruction constrains all
memory reordering around it, including that involving ordinary data accesses, and thus overly constrains the hardware. A better solution would involve
distinguishing between two flavors of
loads and stores (ordinary and synchronization), roughly along the lines of
Itanium’s ld.acq and st.rel. 23 This,
however, requires a change to the instruction set architecture, usually a difficult proposition.
We suspect the current situation
makes the fence instruction more expensive than necessary, in turn motivating additional language-level complexity such as C++ low-level atomics or
lazySet() in Java.
Lessons Learned
Data-race-free provides a simple and
consistent model for threads and
shared variables. We are convinced it
is the best model today to target during
initial software development. Unfortunately, its lack of any guarantees in the
presence of data races and mismatch
with current hardware implies three
significant weaknesses:
Debugging. Accidental introduction
of a data race results in “undefined be-
havior,” which often takes the form of
surprising results later during program
execution, possibly long after the data
race has resulted in corrupted data. Al-
though the usual immediate result of
a data race is that an unexpected, and
perhaps incomplete value is read, or
that an inconsistent value is written,
we point out in prior work12 that other
results, such as wild branches, are also
possible as a result of compiler opti-
mizations that mistakenly assume the
absence of data races. Since such races
are difficult to reproduce, the root cause
of such misbehavior is often difficult to
identify, and such bugs may easily take
weeks to track down. Many tools to aid
such debugging (for example, CHESS30
and RaceFuzzer35) also assume sequen-
tial consistency, somewhat limiting
their utility.
implications for Languages
In spite of the dramatic convergence
in the debate on memory models,
the state of the art imposes a difficult
choice: a language that supposedly has
strong safety and security properties,
but no clear definition of what value a
shared-memory read may return (the
Java case), versus a language with clear
semantics, but that requires abandoning security properties promised by languages such as Java (the C++ case). Unfortunately, modern software needs to
be both parallel and secure, and requiring a choice between the two should
not be acceptable.
A pessimistic view would be to
abandon shared-memory altogether.
However, the intrinsic advantages of a
global address space are, at least anecdotally, supported by the widespread
use of threads despite the inherent
challenges. We believe the fault lies not
in the global address space paradigm,
but in the use of undisciplined or “wild
shared-memory,” permitted by current
systems.
Data-race-free was a first attempt to
formalize a shared-memory discipline
via a memory model. It proved inadequate because the responsibility for
following this discipline was left to the
programmer. Further, data-race-free
by itself is, arguably, insufficient as a
discipline for writing correct, easily
debuggable, and maintainable shared-memory code; for example, it does not
completely eliminate atomicity violations or non-deterministic behavior.
Moving forward, we believe a critical
research agenda to enable “parallelism
for the masses” is to develop and promote disciplined shared-memory models
that:
˲ ˲ are simple enough to be easily teach-able to undergraduates; that is, minimally provide sequential consistency to
programs that obey the required discipline;
˲ ˲ enable the enforcement of the discipline; that is, violations of the discipline
should not have undefined or horrendously complex semantics, but should