torically been acute: there was a time
when multiprocessors were so rare
that many software development shops
simply didn’t have access to one. Fortunately, with the rise of multicore CPUs,
this is no longer a problem: there is no
longer any excuse for not being able to
find at least a two-processor (dual core)
machine, and with only a little effort,
most will be able (as of this writing) to
run their code on an eight-processor
(two socket, quad core) machine.
But if the physical situation has
improved, the second of these problems—knowing how to put load on the
system—has worsened: production
deployments have become increas-
velopment or in production—it is useless
to software development if the impediments to its scalability can’t be understood. Understanding scalability inhibitors on a production system requires the
ability to safely dynamically instrument
its synchronization primitives, and in
developing Solaris, our need for this was
so historically acute that it led one of us
(Bonwick) to develop a technology to do
this (lockstat) in 1997. This tool became
instantly essential—we quickly came to
wonder how we ever resolved scalability
problems without it—and it led the other of us (Cantrill) to further generalize
dynamic instrumentation into DTrace,
a system for nearly arbitrary dynamic
ingly complicated, with loads that are
difficult and expensive to simulate in
development. It is essential that load
generation and simulation be treated as
a first-class problem; the earlier this development problem is tackled, the earlier you will be able to get critical data
that may have tremendous implications
for the software. And while a test load
should mimic its production equivalent as closely as possible, timeliness is
more important than absolute accuracy:
the absence of a perfect load simulation
should not prevent you from simulating
load altogether, as it is much better to
put a multithreaded system under the
wrong kind of load than under no load
whatsoever.
Once a system is loaded—be it in de-
instrumentation of production systems
that first shipped in Solaris in 2004, and
has since been ported to many other systems including FreeBSD and MacOS. 3
(The instrumentation methodology in
lockstat has been reimplemented to be
a DTrace provider, and the tool itself
has been reimplemented to be a DTrace
consumer.)
Today, dynamic instrumentation
continues to provide us with the data
we need to not only to find those parts
of the system that are inhibiting scalability, but to gather sufficient data to
understand which techniques will be
best suited to reduce that contention.
Prototyping new locking strategies is expensive, and one’s intuition is frequently wrong; before breaking up a lock or
rearchitecting a subsystem to make it
more parallel, we always strive to have
the data in hand indicating the subsystem’s lack of parallelism is a clear inhibitor to system scalability!
Know when—and when not—to break
up a lock. Global locks can naturally become scalability inhibitors, and when
gathered data indicates a single hot
lock, it is reasonable to want to break up
the lock into per-CPU locks, a hash table
of locks, per-structure locks, and so on.
This might ultimately be the right course
of action, but before blindly proceeding
down that (complicated) path, carefully
examine the work done under the lock:
breaking up a lock is not the only way to
reduce contention, and contention can
be (and often is) more easily reduced by
reducing the hold time of the lock. This
can be done by algorithmic improvements (many scalability improvements
have been had by reducing execution
under the lock from quadratic time to
linear time!) or by finding activity that
is needlessly protected by the lock. As
a classic example of this latter case: if
data indicates that you are spending
time (say) deallocating elements from
a shared data structure, you could de-queue and gather the data that needs
to be freed with the lock held, and defer
the actually deallocation of the data until after the lock is dropped. Because the
data has been removed from the shared
data structure under the lock, there is
no data race (other threads see the removal of the data as atomic), and lock
hold time has been reduced with only
a modest increase in implementation
complexity.
Be wary of readers/writer locks. If there
is a novice error when trying to break up
a lock, it is this: seeing that a data structure is frequently accessed for reads and
infrequently accessed for writes, it can
be tempting to replace a mutex guarding the structure with a readers/writer
lock to allow for concurrent readers.
This seems reasonable, but unless the
hold time for the lock is long, this solution will scale no better (and indeed,
may well scale worse) than having a single lock. Why? Because the state associated with the readers/writer lock must
itself be updated atomically, and in the
absence of a more sophisticated (and
less space-efficient) synchronization
primitive, a readers/writer lock will use a
single word of memory to store the num-