in Figure 2 calls tryPop(), which attempts to acquire the lock and remove
the first node from the stack. If it succeeds, it throws an exception if the
stack is empty, and otherwise it returns
the node referenced by top. If tryPop()
fails to acquire the lock it returns null
and is called again until it succeeds.
What are the safety, liveness, and
performance properties of our imple-
mentation? Well, because we use a
single lock to protect the structure, it
is obvious its behavior is “atomic” (the
technical term used for this is lineariz-
able15). In other words, the outcomes of
our concurrent execution are equiva-
lent to those of a sequential execution
in which each push or pop take effect
at some non-overlapping instant dur-
ing their method calls. In particular,
we could think of them taking effect
when the executing thread acquired
the lock. Linearizability is a desired
property because linearizable objects
can be composed without having to
know anything about their actual im-
plementation.
figure 3. the lock-free tryPush() and tryPop() methods.
1 public class LockFreeStack<T> {
2 private AtomicReference<Node> top =
3 new AtomicReference<Node>(null);
4...
5
6 protected boolean tryPush(Node node) {
7 Node oldTop = top.get();
8 node.next = oldTop;
9 return top.compareAndSet(oldTop, node);
10 }
11
12 protected Node tryPop() throws EmptyException {
13 Node oldTop = top.get();
14 if (oldTop == null) {
15 throw new EmptyException();
16 }
17 Node newTop = oldTop.next;
18 if ( top.compareAndSet(oldTop, newTop)) {
19 return oldTop;
20 } else {
21 return null ;
22 }
23 }
figure 4. the EliminationBackoffStack<T>.
c:pop()
A:return(b)
A:pop()
c:return(d)
top
b:push(b)
d
e
f
b:ok
each thread selects a random location in the array. If thread A’s pop() and thread B’s push() calls
arrive at the same location at about the same time, then they exchange values without accessing the
shared lock-free stack. A thread C, that does not meet another thread, eventually pops the shared lock-free stack.
from the system, all threads accessing
the stack will be delayed whenever one
is preempted. Modern operating systems can deal with these issues, and
will have to become even better at handling them in the future.
In terms of progress, the locking
scheme is deadlock-free, that is, if several threads all attempt to acquire the
lock, one will succeed. But it is not
starvation-free: some thread could be
unlucky enough to always fail in its CAS
when attempting to acquire the lock.
The centralized nature of the lock-based stack implementation introduces
a sequential bottleneck: only one thread
at a time can complete the update of the
data structure’s state. This, Amdahl’s
Law tells us, will have a very negative effect on scalability, and performance will
not improve as the number of cores/
threads increases.
But there is another separate phenomenon here: memory contention.
Threads failing their CAS attempts on
the lock retry the CAS again even while
the lock is still held by the last CAS “
winner” updating the stack. These repeated
attempts cause increased traffic on the
machine’s shared bus or interconnect.
Since these are bounded resources, the
result is an overall slowdown in performance, and in fact, as the number
of cores increases, we will see performance deteriorate below that obtain-able on a single core. Luckily, we can
deal with contention quite easily by adding a contention manager into the code
(Line 21 in figures 1 and 2).
The most popular type of contention manager is exponential backoff:
every time a CAS fails in tryPush() or
tryPop(), the thread delays for a certain random time before attempting
the CAS again. A thread will double the
range from which it picks the random
delay upon CAS failure, and will cut
it in half upon CAS success. The randomized nature of the backoff scheme
makes the timing of the thread’s attempts to acquire the lock less dependent on the scheduler, reducing the
chance of threads falling into a repetitive pattern in which they all try to CAS
at the same time and end up starving.
Contention managers1, 12, 19 are key tools
in the design of multicore data structures, even when no locks are used, and
I expect them to play an even greater
role as the number of cores grows.