torically been acute: there was a time when multiprocessors were so rare that many software development shops simply didn’t have access to one. Fortunately, with the rise of multicore CPUs, this is no longer a problem: there is no longer any excuse for not being able to find at least a two-processor (dual core) machine, and with only a little effort, most will be able (as of this writing) to run their code on an eight-processor (two socket, quad core) machine.
But if the physical situation has improved, the second of these problems—knowing how to put load on the system—has worsened: production deployments have become increas-
velopment or in production—it is useless to software development if the impediments to its scalability can’t be understood. Understanding scalability inhibitors on a production system requires the ability to safely dynamically instrument its synchronization primitives, and in developing Solaris, our need for this was so historically acute that it led one of us (Bonwick) to develop a technology to do this (lockstat) in 1997. This tool became instantly essential—we quickly came to wonder how we ever resolved scalability problems without it—and it led the other of us (Cantrill) to further generalize dynamic instrumentation into DTrace, a system for nearly arbitrary dynamic
ingly complicated, with loads that are difficult and expensive to simulate in development. It is essential that load generation and simulation be treated as a first-class problem; the earlier this development problem is tackled, the earlier you will be able to get critical data that may have tremendous implications for the software. And while a test load should mimic its production equivalent as closely as possible, timeliness is more important than absolute accuracy: the absence of a perfect load simulation should not prevent you from simulating load altogether, as it is much better to put a multithreaded system under the wrong kind of load than under no load whatsoever.
Once a system is loaded—be it in de-
instrumentation of production systems that first shipped in Solaris in 2004, and has since been ported to many other systems including FreeBSD and MacOS. 3 (The instrumentation methodology in lockstat has been reimplemented to be a DTrace provider, and the tool itself has been reimplemented to be a DTrace consumer.)
Today, dynamic instrumentation continues to provide us with the data we need to not only to find those parts of the system that are inhibiting scalability, but to gather sufficient data to understand which techniques will be best suited to reduce that contention. Prototyping new locking strategies is expensive, and one’s intuition is frequently wrong; before breaking up a lock or
rearchitecting a subsystem to make it more parallel, we always strive to have the data in hand indicating the subsystem’s lack of parallelism is a clear inhibitor to system scalability!
Know when—and when not—to break up a lock. Global locks can naturally become scalability inhibitors, and when gathered data indicates a single hot lock, it is reasonable to want to break up the lock into per-CPU locks, a hash table of locks, per-structure locks, and so on. This might ultimately be the right course of action, but before blindly proceeding down that (complicated) path, carefully examine the work done under the lock: breaking up a lock is not the only way to reduce contention, and contention can be (and often is) more easily reduced by reducing the hold time of the lock. This can be done by algorithmic improvements (many scalability improvements have been had by reducing execution under the lock from quadratic time to linear time!) or by finding activity that is needlessly protected by the lock. As a classic example of this latter case: if data indicates that you are spending time (say) deallocating elements from a shared data structure, you could de-queue and gather the data that needs to be freed with the lock held, and defer the actually deallocation of the data until after the lock is dropped. Because the data has been removed from the shared data structure under the lock, there is no data race (other threads see the removal of the data as atomic), and lock hold time has been reduced with only a modest increase in implementation complexity.
Be wary of readers/writer locks. If there is a novice error when trying to break up a lock, it is this: seeing that a data structure is frequently accessed for reads and infrequently accessed for writes, it can be tempting to replace a mutex guarding the structure with a readers/writer lock to allow for concurrent readers. This seems reasonable, but unless the hold time for the lock is long, this solution will scale no better (and indeed, may well scale worse) than having a single lock. Why? Because the state associated with the readers/writer lock must itself be updated atomically, and in the absence of a more sophisticated (and less space-efficient) synchronization primitive, a readers/writer lock will use a single word of memory to store the num-
References:
Archives