ber of readers. Because the number of readers must be updated atomically, acquiring the lock as a reader requires the same bus transaction—a read-to-own— as acquiring a mutex, and contention on that line can hurt every bit as much. There are still many situations where long hold times (for example, performing I/O under a lock as reader) more than pay for any memory contention, but one should be sure to gather data to make sure that it is having the desired effect on scalability. And even in those situations where a readers/writer lock is appropriate, an additional note of caution is warranted around blocking semantics. If, for example, the lock implementation blocks new readers when a writer is blocker (a common paradigm to avoid writer starvation), one cannot recursively acquire a lock as reader: if a writer blocks between the inital acquistion as reader and the recursive acquisition as reader, deadlock will result when the recursive acquisition is blocked. None of this is to say that readers/writer locks shouldn’t be used, just that they shouldn’t be romanticized.

Know when to broadcast—and when to signal. Virtually all condition variable implementations allow threads waiting on the variable to be awoken either via a signal (in which case one thread sleeping on the variable is awoken) or via a broadcast (in which case all threads sleeping on the variable are awoken). These constructs have subtly different semantics: because a broadcast will awaken all waiting threads, it should generally be used to indicate state change rather than resource availability. If a condition broadcast is used when a condition signal would have been more appropriate, the result will be a thundering herd: all waiting threads will wake up, fight over the lock protecting the condition variable and (assuming that the first thread to acquire the lock also consumes the available resource) sleep once again when they discover that the resource has been consumed. This needless scheduling and locking activity can have a serious effect on performance, especially in Java-based systems, where notifyAll() (that is, broadcast) seems to have entrenched itself as a preferred paradigm; changing these calls to notify() (or signal) has been known to result in substantial performance gains. 6

Design your systems to be composable. Among the more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable: “Locks and condition variables do not support modular programming,” reads one typically brazen claim, “building large programs by gluing together smaller programs: locks make this impossible.” 8 The claim, of course, is incorrect; for evidence one need only point at the composition of lock-based systems like databases and operating systems into larger systems that remain entirely unaware of lower-level locking.

There are two ways to make lock-based systems completely composable, and each has its own place. First (and most obviously) one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user-level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system. More generally, this model can work whenever there is a crisp interface between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.

Secondly, (and perhaps counterin-tuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be no global subsystem state; all subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they do not access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used concurrently by different subsystems and in different contexts. A concrete example of this is the AVL tree implementation that is used extensively in the Solaris kernel. As with any balanced binary tree, the implementation is sufficiently complex to merit componentization, but by not having any global state, the implementation may be used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized.

 

the concurrency Buffet It’s difficult to communicate over a decade of accumulated wisdom in a single article, and space does not permit us ex-

ploration of the more arcane (albeit important) techniques we have used to deliver concurrent software that are both high-performing and reliable. Despite our attempt to elucidate some of the important lessons that we have learned over the years, concurrent software remains, in a word, difficult. Some have become fixated on this difficulty, viewing the coming of multicore computing as cataclysmic for software. This fear is unfounded, for it ignores the fact that relatively few software engineers need to actually write multithreaded code: for most, concurrency can be achieved by standing on the shoulders of those subsystems that are highly parallel in implementation. Those practitioners implementing a database or an operating system or a virtual machine will continue to sweat the details of writing multithreaded code, but for everyone else, the challenge is not how to implement those components but rather how to best use them to deliver a scalable system. And while lunch might not be exactly free, it is practically all-you-can-eat. The buffet is open. Enjoy!

 

References

1. barroso, l. a., Gharachorloo, K., mcnamara, r., nowatzyk, a., Qadeer, s., sano, b., smith, s., stets, r., and verghese, b. piranha: a scalable architecture based on single-chip multiprocessing. in Proceedings of the 27th Annual international Symposium on Computer Architecture. aCm, ny, 282–293, 2000.

2. Cantrill, b. postmortem object type identification. in Proceedings of the 5th International Workshop on Automated Debugging, 2003.

3. Cantrill, b. hidden in plain sight. Queue 4, 1 (feb. 2006), 26–36.

4. Cantrill, b. a spoonful of sewage. a. oram and G. Wilson, eds. Beautiful Code. o’reilly, 2007.

5. de Witt, d. and Gray, J. parallel database systems: the future of high performance database systems. Commun. ACM 35, 6 (June 1992), 85–98.

6. mcKusick, K. a conversation with Jarod Jenson. Queue 4, 1 (feb. 2006), 16–24.

7. oskin, m. the revolution inside the box. Commun. ACM 51, 7 (July 2008), 70–78.

8. peyton-Jones, s. beautiful concurrency. a. oram and G. Wilson, eds. Beautiful Code. o’reilly, 2007.

9. shavit, n. transactions are tomorrow’s loads and stores. Commun. ACM 51, 8 (aug. 2008), 90–90.

10. sutter, h. and larus, J. software and the concurrency revolution. Queue 3, 7 (sept. 2005), 54–62.

 

Bryan Cantrill is a distinguished engineer at sun microsystems, where he works on concurrent systems. along with colleagues mike shapiro and adam leventhal, bryan developed dtrace, a facility for dynamic instrumentation of production systems that was directly inspired by bryan’s frustration in understanding the behavior of concurrent systems.

Jeff Bonwick is a fellow at sun microsystems, where he works on concurrent systems. best known for inventing and leading the development of sun’s zettabyte filesystem (zfs), bonwick has also written (or rather, rewritten) many of the most parallel subsystems in the solaris kernel, including the synchronization primitives, kernel memory allocator, and thread-blocking mechanism.

References:

Archives