CONCURRENC Y
and the size of the hash table must be contained in the
table itself—but the implementation complexity is modest
given the alternatives, and (assuming hash tables are
doubled when they are resized) the cost in terms of
memory is only a factor of two. For an example of this
in production code, the reader is directed to the file
descriptor locking in Solaris, the source code for which
can be found by searching the Internet for “flist_grow.”
Be aware of false sharing. There are a variety of different protocols for keeping memory coherent in caching
multiprocessor systems. Typically, these protocols dictate
that only a single cache may have a given line of memory
in a dirty state. If a different cache wishes to write to the
dirty line, the new cache must first read-to-own the dirty
line from the owning cache. The size of the line used for
coherence (the coherence granularity) has an important
ramification for parallel software: because only one cache
may own a line a given time, one wishes to avoid a situation where two (or more) small, disjoint data structures
are both contained within a single line and accessed in
parallel by disjoint caches. This situation—called false
sharing—can induce suboptimal scalability in otherwise
scalable software. This most frequently arises in practice
when one attempts to defract contention with an array of
locks: the size of a lock structure is typically no more than
the size of a pointer or two and is usually quite a bit less
than the coherence granularity (which is typically on the
order of 64 bytes). Disjoint CPUs acquiring different locks
can therefore potentially contend for the same cache line.
False sharing is excruciating to detect dynamically: it
requires not only a bus analyzer, but also a way of translating from the physical addresses of the bus to the virtual
addresses that make sense to software, and then from
there to the actual structures that are inducing the false
sharing. (This process is so arduous and error-prone that
we have experimented—with some success—with static
mechanisms to detect false sharing. 10) Fortunately, false
sharing is rarely the single greatest scalability inhibitor
in a system, and it can be expected to be even less of an
issue on a multicore system (where caches are more likely
to be shared among CPUs). Nonetheless, this remains an
issue that the practitioner should be aware of, especially
when creating arrays that are designed to be accessed in
parallel. (In this situation, array elements should be padded out to be a multiple of the coherence granularity.)
Consider using nonblocking synchronization routines
to monitor contention. Many synchronization primitives
have different entry points to specify different behavior if
the primitive is unavailable: the default entry point will
typically block, whereas an alternative entry point will
return an error code instead of blocking. This second variant has a number of uses, but a particularly interesting
one is the monitoring of one’s own contention: when an
attempt to acquire a synchronization primitive fails, the
subsystem can know that there is contention. This can be
especially useful if a subsystem has a way of dynamically
reducing its contention. For example, the Solaris kernel
memory allocator has per-CPU caches of memory buffers.
When a CPU exhausts its per-CPU caches, it must obtain
a new series of buffers from a global pool. Instead of
simply acquiring a lock in this case, the code attempts to
acquire the lock, incrementing a counter when this fails
(and then acquiring the lock through the blocking entry
point). If the counter reaches a predefined threshold, the
size of the per-CPU caches is increased, thereby dynamically reducing contention.
When reacquiring locks, consider using generation
counts to detect state change. When lock ordering
becomes complicated, at times one will need to drop
one lock, acquire another, and then reacquire the first.
This can be tricky, as state protected by the first lock
may have changed during the time that the lock was
dropped—and reverifying this state may be exhausting,
inefficient, or even impossible. In these cases, consider
associating a generation count with the data structure;
when a change is made to the data structure, a generation count is bumped. The logic that drops and reacquires
the lock must cache the generation before dropping the
lock, and then check the generation upon reacquisition:
if the counts are the same, the data structure is as it was
when the lock was dropped and the logic may proceed; if
the count is different, the state has changed and the logic
may react accordingly (for example, by reattempting the
larger operation).
Use wait- and lock-free structures only if you absolutely must. Over our careers, we have each implemented
wait- and lock-free data structures in production code, but
we did this only in contexts in which locks could not be
acquired for reasons of correctness. Examples include the
implementation of the locking system itself, 11 the subsystems that span interrupt levels, and dynamic instrumentation facilities. 12 These constrained contexts are the