it possible to speed up the program (i.e., reduce the total runtime). Amdahl’s law expresses this as:
_____ 1_____ (1-P) + P/S
Here P is the fraction of the program that can be parallelized, and S is the number of execution units.
This is the theory. Making it a reality is another issue. Simply writing a normal program by itself is a problem, as can be seen in the relentless stream of bug fixes available for programs. Trying to split a program into multiple pieces that can be executed in parallel adds a whole dimension of additional problems:
• Unless the program consists of multiple independent pieces from the onset and should have been written as separate programs in the first place, the individual pieces have to collaborate. This usually takes the form of sharing data in memory or on secondary storage.
• Write access to shared data cannot happen in an uncontrolled fashion. Allowing a program to see an inconsistent, and hence unexpected, state must be avoided at all times. This is a problem if the state is represented by the content of multiple memory locations. Processors are not able to modify an arbitrary number (in most cases not even two) of independent memory locations atomically.
• To deal with multiple memory locations, “traditional” parallel programming has had to resort to synchronization. With the help of mutex (mutual exclusion) directives, a program can ensure that it is alone in executing an operation protected by the mutex object. If all read or write accesses to the protected state are performed while holding the mutex lock, it is guaranteed that the program will never see an inconsistent state. Today’s programming environments (e.g., POSIX) allow for an arbitrary number of mutexes to coexist, and there are special types of mutexes that allow for multiple readers to gain access concurrently. The latter is allowed since read accesses do
not change the state. These mechanisms allow for reasonable scalability if used correctly.
• Locking mutexes open a whole new can of worms, though. Using a single program-wide mutex would in most cases dramatically hurt program performance by decreasing the portion of the program that can run in parallel (P in the formula). Using more mutexes increases not only P, but also the overhead associated with locking and unlocking the mutexes. This is especially problematic if, as it should be, the critical regions are only lightly contended. Dealing with multiple mutexes also means the potential for deadlocks exists. Deadlocks happen if overlapping mutexes are locked by multiple threads in a different order. This is a mistake that happens all too easily. Often the use of mutexes is hidden in library functions and not immediately visible, complicating the whole issue.
The programmer is caught between two problems:
• Increasing the part of the program that can be executed in parallel (P).
• Increasing the complexity of the program code and therefore the potential for problems.
An incorrectly functioning program can run as fast as you can make it run, but it will still be useless. Therefore, the parallelization must go only so far as not to introduce problems of the second kind. How much parallelism this is depends on the experience and knowledge of the programmer. Over the years many projects have been developed that try to automatically catch problems related to locking. None is succeeding in solving the problem for programs of sizes that appear in the real world. Static analysis is costly and complex. Dynamic analysis has to depend on heuristics and on the quality of the test cases.
For complex projects it is not possible to convert the whole project at once to allow for more parallelism. Instead, programmers iteratively add ever more fine-grained locking. This can be a long process, and if the testing of the intermediate steps isn’t thorough enough, problems that are not caused by the most recently added set of changes might pop up. Also, as experience has shown, it is sometimes very hard to get rid of the big locks. For an example, look at the BKL (big kernel lock) discussions on the Linux kernel mailing list. The BKL was introduced when Linux first gained SMP (symmetric multiprocessing) support in the mid-90s, and we still haven’t gotten rid of it in 2008.
References:
Archives