If an even number of threads passes
through a balancer, the outputs are
evenly balanced on the top and bottom wires, but the balancer’s state remains unchanged.
The idea behind the Elimination-Tree<T> 20, 22 is to place an Elimination Array in front of the bit in every
balancer as in Figure 6. If two popping
threads meet in the array, they leave
on opposite wires, without a need to
touch the bit, as anyhow it would have
remained in its original state. If two
pushing threads meet in the array, they
also leave on opposite wires. If a push
or pop call does not manage to meet
another in the array, it toggles the bit
and leaves accordingly. Finally, if a
push and a pop meet, they eliminate,
exchanging items as in the Elimina-tionBackoffStack. It can be shown
that this implementation provides
a quiescently consistent stack,a in
which, in most cases, it takes a thread
O(log k) steps to complete a push or a
pop, where k is the number of lock-free
stacks on its output wires.
es to memory, and to maintain locality
as much as possible.
What are the implications for our
stack design? Consider completely relaxing the LIFO property in favor of a
Pool<T> structure in which there is
no temporal ordering on push() and
pop() calls. We will provide a concurrent lock-free implementation of a pool
that supports high parallelism, high locality, and has a low cost in terms of the
overall number of accesses to memory.
How useful is such a concurrent pool?
I would like to believe that most concurrent applications can be tailored to
use pools in place of queues and stacks
figure 6. the Elimination Tree<T>.
elimination
balancer
A:push( 6)
(perhaps with some added liveness
conditions)…time will tell.
Our overall concurrent pool design
is quite simple. As depicted in Figure
7, we allocate a collection of n concurrent lock-free stacks, one per computing thread (alternately we could
allocate one stack per collection of
threads on the same core, depending
on the specific machine architecture).
Each thread will push and pop from
its own assigned stack. If, when it attempts to pop, it finds its own stack
is empty, it will repeatedly attempt
to “steal” an item from another randomly chosen stack.b The pool has, in
½ width
elimination
balancer
51
c:return( 5)
1
A: ok
2
b:return( 6)
b:pop()
1
a Pool Made of stacks
The collection of stacks accessed in
parallel in the elimination tree provides
quiescently consistent LIFO ordering
with a high degree of parallelism. However, each method call involves a logarithmic number of memory accesses,
each involving a CAS operation, and
these accesses are not localized, that
is, threads are repeatedly accessing locations they did not access recently.
This brings us to the final two issues one must take into account when
designing concurrent data structures:
the machine’s memory hierarchy and
its coherence mechanisms. Mainstream multicore architectures are
cache coherent, where on most machines the L2 cache (and in the near future the L3 cache as well) is shared by
all cores. A large part of the machine’s
performance on shared data is derived
from the threads’ ability to find the
data cached. The shared caches are
unfortunately a bounded resource,
both in their size and in the level of access parallelism they offer. Thus, the
data structure design needs to attempt
to lower the overall number of access-
c:pop()
D:pop()
3
0
e:push( 7)
4
e: ok
D:return( 7)
each balancer in Tree[ 4] is an elimination balancer. the state depicted is the same as in Figure 5. From
this state, a push of item 6 by thread a will not meet any others on the elimination arrays and so will
toggle the bits and end up on the 2nd stack from the top. two pops by threads b and C will meet in the
top balancer’s array and end up going up and down without touching the bit, ending up popping the last
two values 5 and 6 from the top two lock-free stacks. Finally, threads D and e will meet in the top array
and “eliminate” each other, exchanging the value 7 and leaving the tree. this does not ruin the tree’s state
since the states of all the balancers would have been the same even if the threads had both traversed
all the way down without meeting: they would have anyhow followed the same path down and ended up
exchanging values via the same stack.
figure 7. the concurrent Pool<T>.
each thread performs push() and
pop() calls on a lock-free stack
and attempts to steal from other
stacks when a pop() finds the
local stack empty. In the figure,
thread C will randomly select the
top lock-free stack, stealing the
value 5. If the lock-free stacks
are replaced by lock-free deques,
thread C will pop the oldest value,
returning 1.
5 1 A:push( 5)
6 2 b:push( 6)
c:pop()
4 D:pop()
choose
random
stack to
steal from
a To keep things simple, pop operations should
block until a matching push appears.
b One typically adds a termination detection protocol14 to the structure to guarantee that threads
will know when there remain no items to pop.