the sequential (single-processor) time
and the parallel time, is:
S= 1
1–p + pn
In other words, S does not grow linearly in n. For example, given an application and a 10-processor machine,
Amdahl’s Law says that even if we manage to parallelize 90% of the application, but not the remaining 10%, then
we end up with a fivefold speedup, but
not a 10-fold speedup. Doubling the
number of cores to 20 will only raise us
to a sevenfold speedup. So the remaining 10%, those we continue to execute
sequentially, cut our utilization of the
10 processor machine in half, and limit
us to a 10-fold speedup no matter how
many cores we add.
What are the 10% we found difficult
to parallelize? In many mainstream
applications they are the parts of the
program involving interthread interaction and coordination, which on
multicore machines are performed by
concurrently accessing shared data
structures. Amdahl’s Law tells us it is
worthwhile to invest an effort to derive
as much parallelism as possible from
these 10%, and a key step on the way to
doing so is to have highly parallel concurrent data structures.
Unfortunately, concurrent data
structures are difficult to design.
There is a kind of tension between
correctness and performance: the
more one tries to improve performance, the more difficult it becomes
to reason about the resulting algorithm as being correct. Some experts
blame the widely accepted threads-and-objects programming model
(that is, threads communicating via
shared objects), and predict its eventual demise will save us. My experience with the alternatives suggests
this model is here to stay, at least
for the foreseeable future. So let us,
in this article, consider correctness
and performance of data structures
on multicore machines within the
threads-and-objects model.
In the concurrent world, in contrast
to the sequential one, correctness has
two aspects: safety, guaranteeing that
nothing bad happens, and liveness,
guaranteeing that eventually something good will happen.
The safety aspects of concurrent
data structures are complicated by the
need to argue about the many possible
interleavings of methods called by different threads. It is infinitely easier and
more intuitive for us humans to specify
how abstract data structures behave in
a sequential setting, where there are no
interleavings. Thus, the standard approach to arguing the safety properties
of a concurrent data structure is to specify the structure’s properties sequentially, and find a way to map its concurrent
executions to these “correct” sequential
ones. There are various approaches for
doing this, called consistency conditions. Some familiar conditions are se-rializability, linearizability, sequential
consistency, and quiescent consistency.
When considering liveness in a concurrent setting, the good thing one expects to happen is that method calls
eventually complete. The terms under which liveness can be guaranteed
are called progress conditions. Some
familiar conditions are deadlock-freedom, starvation-freedom, lock-freedom, and wait-freedom. These
conditions capture the properties an
implementation requires from the underlying system scheduler in order to
guarantee that method calls complete.
For example, deadlock-free implementations depend on strong scheduler
support, while wait-free ones do all the
work themselves and are independent
of the scheduler.
Finally, we have the performance
of our data structures to consider. Historically, uniprocessors are modeled
as Turing machines, and one can argue the theoretical complexity of data
structure implementations on uniprocessors by counting the number of
steps—the machine instructions—that
method calls might take. There is an immediate correlation between the theoretical number of uniprocessor steps and
the observed time a method will take.
In the multiprocessor setting, things
are not that simple. In addition to the
actual steps, one needs to consider
whether steps by different threads re-
quire a shared resource or not, because
these resources have a bounded capac-
ity to handle simultaneous requests.
For example, multiple instructions ac-
cessing the same location in memory
cannot be serviced at the same time.
In its simplest form, our theoretical
complexity model requires us to con-
sider a new element: stalls. 2, 7–10 When
threads concurrently access a shared
resource, one succeeds and others in-
cur stalls. The overall complexity of
the algorithm, and hence the time it
might take to complete, is correlated
to the number of operations together
with the number of stalls (obviously
this is a crude model that does not take
into account the details of cache co-
herence). From an algorithmic design
point of view, this model introduces a
continuum starting from centralized
structures where all threads share data
by accessing a small set of locations,
incurring many stalls, to distributed
structures with multiple locations, in
which the number of stalls is greatly re-
duced, yet the number of steps neces-
sary to properly share data and move it
around increases significantly.