CONCURRENC Y
Consider per-CPU locking. Per-CPU locking (that is,
acquiring a lock based on the current CPU identifier) can
be a convenient technique for diffracting contention,
as a per-CPU lock is not likely to be contended (a CPU
can run only one thread at a time). If one has short hold
times and operating modes that have different coherence
requirements, one can have threads acquire a per-CPU
lock in the common (noncoherent) case, and then force
the uncommon case to grab all the per-CPU locks to
construct coherent state. Consider this concrete (if trivial)
example: if one were implementing a global counter that
is frequently updated but infrequently read, one could
implement a per-CPU counter protected by its own lock.
Updates to the counter would update only the per-CPU
copy, and in the uncommon case in which one wanted to
read the counter, all per-CPU locks could be acquired and
their corresponding values summed.
Two notes on this technique: first, it should be
employed only when the data indicates that it’s necessary, as it clearly introduces substantial complexity into
the implementation; second, be sure to have a single
order for acquiring all locks in the cold path: if one case
acquires the per-CPU locks from lowest to highest and
another acquires them from highest to lowest, deadlock
will (naturally) result.
Know when to broadcast—and when to signal.
Virtually all condition variable implementations allow threads
waiting on the variable to be awakened either via a signal
(in which case one thread sleeping on the variable is
awakened) or via a broadcast (in which case all threads
sleeping on the variable are awakened). These constructs
have subtly different semantics: because a broadcast will
awaken all waiting threads, it should generally be used
to indicate state change rather than resource availability.
If a condition broadcast is used when a condition signal
would have been more appropriate, the result will be a
thundering herd: all waiting threads will wake up, fight
over the lock protecting the condition variable, and
(assuming that the first thread to acquire the lock also
consumes the available resource) sleep once again when
they discover that the resource has been consumed.
This needless scheduling and locking activity can have
a serious effect on performance, especially in Java-based
systems, where notifyAll() (i.e., broadcast) seems to have
entrenched itself as a preferred paradigm; changing these
calls to notify() (i.e., signal) has been known to result in
substantial performance gains. 7
Learn to debug postmortem. Among some Cassandras of concurrency, a deadlock seems to be a particular
bogeyman of sorts, having become the embodiment of
all that is difficult in lock-based multithreaded programming. This fear is somewhat peculiar, because deadlocks
are actually among the simplest pathologies in software:
because (by definition) the threads involved in a deadlock
cease to make forward progress, they do the implementer
the service of effectively freezing the system with all state
intact. To debug a deadlock, one need have only a list of
threads, their corresponding stack backtraces, and some
knowledge of the system. This information is contained
in a snapshot of state so essential to software development that its very name reflects its origins at the dawn of
computing: it is a core dump.
Debugging from a core dump—postmortem debugging—is an essential skill for those who implement
parallel systems: problems in highly parallel systems are
not necessarily reproducible, and a single core dump is
often one’s only chance to debug them. Most debuggers
support postmortem debugging, and many allow user-defined extensions. 8 We encourage practitioners to understand their debugger’s support for postmortem debugging
(especially of parallel programs) and to develop extensions specific to debugging their systems.
Design your systems to be composable. Among the
more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable:
“Locks and condition variables do not support modular
programming,” reads one typically brazen claim, “
building large programs by gluing together smaller programs[:]
locks make this impossible.” 9 The claim, of course,
is incorrect. For evidence one need only point at the
composition of lock-based systems such as databases and
operating systems into larger systems that remain entirely
unaware of lower-level locking.
There are two ways to make lock-based systems completely composable, and each has its own place. First (and
most obviously), one can make locking entirely internal
to the subsystem. For example, in concurrent operating
systems, control never returns to user level with in-kernel
locks held; the locks used to implement the system itself
are entirely behind the system call interface that constitutes the interface to the system. More generally, this
model can work whenever a crisp interface exists between