ber of readers. Because the number of
readers must be updated atomically, acquiring the lock as a reader requires the
same bus transaction—a read-to-own—
as acquiring a mutex, and contention
on that line can hurt every bit as much.
There are still many situations where
long hold times (for example, performing I/O under a lock as reader) more than
pay for any memory contention, but one
should be sure to gather data to make
sure that it is having the desired effect
on scalability. And even in those situations where a readers/writer lock is appropriate, an additional note of caution
is warranted around blocking semantics. If, for example, the lock implementation blocks new readers when a writer
is blocker (a common paradigm to avoid
writer starvation), one cannot recursively
acquire a lock as reader: if a writer blocks
between the inital acquistion as reader
and the recursive acquisition as reader,
deadlock will result when the recursive
acquisition is blocked. None of this is to
say that readers/writer locks shouldn’t
be used, just that they shouldn’t be romanticized.
Know when to broadcast—and when
to signal. Virtually all condition variable
implementations allow threads waiting
on the variable to be awoken either via a
signal (in which case one thread sleeping on the variable is awoken) or via a
broadcast (in which case all threads
sleeping on the variable are awoken).
These constructs have subtly different
semantics: because a broadcast will
awaken all waiting threads, it should
generally be used to indicate state
change rather than resource availability.
If a condition broadcast is used when
a condition signal would have been
more appropriate, the result will be a
thundering herd: all waiting threads will
wake up, fight over the lock protecting
the condition variable and (assuming
that the first thread to acquire the lock
also consumes the available resource)
sleep once again when they discover
that the resource has been consumed.
This needless scheduling and locking activity can have a serious effect on
performance, especially in Java-based
systems, where notifyAll() (that is,
broadcast) seems to have entrenched
itself as a preferred paradigm; changing
these calls to notify() (or signal) has
been known to result in substantial performance gains. 6
Design your systems to be composable.
Among the more galling claims of the
detractors of lock-based systems is the
notion that they are somehow uncomposable: “Locks and condition variables do not support modular programming,” reads one typically brazen claim,
“building large programs by gluing together smaller programs: locks make
this impossible.” 8 The claim, of course,
is incorrect; for evidence one need only
point at the composition of lock-based
systems like databases and operating
systems into larger systems that remain
entirely unaware of lower-level locking.
There are two ways to make lock-based systems completely composable,
and each has its own place. First (and
most obviously) one can make locking entirely internal to the subsystem.
For example, in concurrent operating
systems, control never returns to user-level with in-kernel locks held; the locks
used to implement the system itself are
entirely behind the system call interface that constitutes the interface to the
system. More generally, this model can
work whenever there is a crisp interface
between software components: as long
as control flow is never returned to the
caller with locks held, the subsystem will
remain composable.
Secondly, (and perhaps counterin-tuitively), one can achieve concurrency
and composability by having no locks
whatsoever. In this case, there must be
no global subsystem state; all subsystem
state must be captured in per-instance
state, and it must be up to consumers of
the subsystem to assure that they do not
access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used
concurrently by different subsystems
and in different contexts. A concrete example of this is the AVL tree implementation that is used extensively in the Solaris kernel. As with any balanced binary
tree, the implementation is sufficiently
complex to merit componentization,
but by not having any global state, the
implementation may be used concurrently by disjoint subsystems—the only
constraint is that manipulation of a single AVL tree instance must be serialized.
the concurrency Buffet
It’s difficult to communicate over a decade of accumulated wisdom in a single
article, and space does not permit us ex-
ploration of the more arcane (albeit important) techniques we have used to deliver concurrent software that are both
high-performing and reliable. Despite
our attempt to elucidate some of the
important lessons that we have learned
over the years, concurrent software remains, in a word, difficult. Some have
become fixated on this difficulty, viewing the coming of multicore computing
as cataclysmic for software. This fear is
unfounded, for it ignores the fact that
relatively few software engineers need
to actually write multithreaded code:
for most, concurrency can be achieved
by standing on the shoulders of those
subsystems that are highly parallel in
implementation. Those practitioners
implementing a database or an operating system or a virtual machine will
continue to sweat the details of writing
multithreaded code, but for everyone
else, the challenge is not how to implement those components but rather how
to best use them to deliver a scalable
system. And while lunch might not be
exactly free, it is practically all-you-can-eat. The buffet is open. Enjoy!
References
1. barroso, l. a., Gharachorloo, K., mcnamara, r.,
nowatzyk, a., Qadeer, s., sano, b., smith, s., stets,
r., and verghese, b. piranha: a scalable architecture
based on single-chip multiprocessing. in Proceedings
of the 27th Annual international Symposium on
Computer Architecture. aCm, ny, 282–293, 2000.
2. Cantrill, b. postmortem object type identification. in
Proceedings of the 5th International Workshop on
Automated Debugging, 2003.
3. Cantrill, b. hidden in plain sight. Queue 4, 1 (feb.
2006), 26–36.
4. Cantrill, b. a spoonful of sewage. a. oram and G.
Wilson, eds. Beautiful Code. o’reilly, 2007.
5. de Witt, d. and Gray, J. parallel database systems:
the future of high performance database systems.
Commun. ACM 35, 6 (June 1992), 85–98.
6. mcKusick, K. a conversation with Jarod Jenson.
Queue 4, 1 (feb. 2006), 16–24.
7. oskin, m. the revolution inside the box. Commun. ACM
51, 7 (July 2008), 70–78.
8. peyton-Jones, s. beautiful concurrency. a. oram and
G. Wilson, eds. Beautiful Code. o’reilly, 2007.
9. shavit, n. transactions are tomorrow’s loads and
stores. Commun. ACM 51, 8 (aug. 2008), 90–90.
10. sutter, h. and larus, J. software and the concurrency
revolution. Queue 3, 7 (sept. 2005), 54–62.
Bryan Cantrill is a distinguished engineer at sun
microsystems, where he works on concurrent systems.
along with colleagues mike shapiro and adam
leventhal, bryan developed dtrace, a facility for dynamic
instrumentation of production systems that was directly
inspired by bryan’s frustration in understanding the
behavior of concurrent systems.
Jeff Bonwick is a fellow at sun microsystems, where
he works on concurrent systems. best known for
inventing and leading the development of sun’s zettabyte
filesystem (zfs), bonwick has also written (or rather,
rewritten) many of the most parallel subsystems in the
solaris kernel, including the synchronization primitives,
kernel memory allocator, and thread-blocking mechanism.