the common case, the same O( 1) complexity per method call as the original
lockfree stack, yet provides a very high
degree of parallelism. The act of stealing itself may be expensive, especially
when the pool is almost empty, but
there are various techniques to reduce
the number of steal attempts if they
are unlikely to succeed. The randomization serves the purpose of guaranteeing an even distribution of threads
over the stacks, so that if there are
items to be popped, they will be found
quickly. Thus, our construction has
relaxed the specification by removing
the causal ordering on method calls
and replacing the deterministic liveness and complexity guarantees with
probabilistic ones.
As the reader can imagine, the O( 1)
step complexity does not tell the whole
story. Threads accessing the pool will
tend to pop items that they themselves recently pushed onto their own
designated stack, therefore exhibiting good cache locality. Moreover,
since chances of a concurrent stealer
are low, most of the time a thread accesses its lock-free stack alone. This
observation allows designers to create
a lockfree “stack-like” structure called
a Dequec that allows the frequently accessing local thread to use only loads
and stores in its methods, resorting
to more expensive CAS based method
calls only when chances of synchronization with a conflicting stealing
thread are high. 3, 6
The end result is a pool implementation that is tailored to the costs of
the machine’s memory hierarchy and
synchronization operations. The big
hope is that as we go forward, many of
these architecture-conscious optimizations, which can greatly influence performance, will move into the realm of
compilers and concurrency libraries,
and the need for everyday programmers to be aware of them will diminish.
What next?
The pool structure ended our sequence of relaxations. I hope the reader has come to realize how strongly
the choice of structure depends on
c This Deque supports push() and pop()
methods with the traditional LIFO semantics
and an additional popTop() method for stealers that pops the first-in (oldest) item. 5
the machine’s size and the application’s concurrency requirements. For
example, small collections of threads
can effectively share a lock-based or
lock-free stack, slightly larger ones an
elimination stack, but for hundreds
of threads we will have to bite the bullet and move from a stack to a pool
(though within the pool implementation threads residing on the same core
or machine cluster could use a single
stack quite effectively).
In the end, we gave up the stack’s
LIFO ordering in the name of performance. I imagine we will have to do the
same for other data structure classes.
For example, I would guess that search
structures will move away from being
comparison based, allowing us to use
hashing and similar naturally parallel
techniques, and that priority queues
will have a relaxed priority ordering
in place of the strong one imposed by
deleting the minimum key. I can’t wait
to see what these and other structures
will look like.
As we go forward, we will also need
to take into account the evolution of
hardware support for synchronization. Today’s primary construct, the
CAS operation, works on a single
memory location. Future architectures
will most likely support synchronization techniques such as transactional
memory, 13, 21 allowing threads to instantaneously read and write multiple locations in one indivisible step. Perhaps
more important than the introduction
of new features like transactional memory is the fact that the relative costs of
synchronization and coherence are
likely to change dramatically as new
generations of multicore chips role out.
We will have to make sure to consider
this evolution path carefully as we set
our language and software development goals.
Concurrent data structure design
has, for many years, been moving forward at glacial pace. Multicore processors are about to heat things up, leaving us, the data structure designers
and users, with the interesting job of
directing which way they flow. Let’s try
to get it right.
References
1. agar wal, a. and Cherian, m. adaptive backoff
synchronization techniques. In Proceedings of
the 16th International Symposium on Computer
Architecture (may 1989), 396−406.
2. anderson, J. and kim, y. an improved lower bound
for the time complexity of mutual exclusion. In
Proceedings of the 20th Annual ACM Symposium on
Principles of Distributed Computing (2001), 90− 99.
3. arora, n.s., blumofe, r.D. and Plaxton, C.g. thread
scheduling for multiprogrammed multiprocessors.
Theory of Computing Systems 34, 2 (2001), 115−144.
4. aspnes, J., herlihy, m. and shavit, n. Counting
networks. J. ACM 41, 5 (1994), 1020−1048.
5. blumofe, r.D. and leiserson, C.e. scheduling
multithreaded computations by work stealing. J. ACM
46, 5 (1999), 720−748.
6. Chase, D. and lev, y. Dynamic circular work-stealing deque. In Proceedings of the 17th Annual
ACM Symposium on Parallelism in Algorithms and
Architectures (2005). aCm Press, ny, 21− 28.
7. Cypher, r. the communication requirements of
mutual exclusion. In ACM Proceedings of the Seventh
Annual Symposium on Parallel Algorithms and
Architectures (1995), 147-156.
8. Dwork, C., herlihy, m. and waarts, o. Contention in
shared memory algorithms. J. ACM 44, 6 (1997),
779−805.
9. Fich, F.e., hendler, D. and shavit, n. linear lower
bounds on real-world implementations of concurrent
objects. In Proceedings of the 46th Annual IEEE
Symposium on Foundations of Computer Science
(2005).Ieee Computer society, washington, D.C.,
165−173.
10. gibbons, P.b., matias, y. and ramachandran, V. the
queue-read queue-write Pram model: accounting for
contention in parallel algorithms. SIAM J. Computing
28, 2 (1999), 733−769.
11. hendler, D., shavit, n. and yerushalmi, l. a scalable
lock-free stack algorithm. J. Parallel and Distributed
Computing 70, 1 (Jan. 2010), 1− 12.
12. herlihy, m., luchangco, V., moir, m. and scherer III,
w.n. software transactional memory for dynamic-sized data structures. In Proceedings of the 22nd
Annual Symposium on Principles of Distributed
Computing. aCm, ny, 2003, 92− 101.
13. herlihy, m. and moss, e. transactional memory:
architectural support for lock-free data structures.
SIGARCH Comput. Archit. News 21, 2 (1993),
289−300.
14. herlihy, m. and shavit, n. The Art of Multiprocessor
Programming. morgan kaufmann, san mateo, Ca,
2008.
15. herlihy, m. and wing, J. linearizability: a correctness
condition for concurrent objects. ACM Trans.
Programming Languages and Systems 12, 3 (July
1990), 463−492.
16. moir, m., nussbaum, D., shalev, o. and shavit, n.
using elimination to implement scalable and lock-free fifo queues. In Proceedings of the 17th Annual
ACM Symposium on Parallelism in Algorithms and
Architectures. aCm Press, ny, 2005, 253−262.
17. moir, m. and shavit, n. Concurrent data structures.
Handbook of Data Structures and Applications, D.
metha and s. sahni, eds. Chapman and hall/CrC
Press, 2007, 47-14, 47-30.
18. scherer III, w.n., lea, D. and scott, m.l. scalable
synchronous queues. In Proceedings of the 11th ACM
SIGPLAN Symposium on Principles and Practice of
Parallel Programming. aCm Press, ny, 2006, 147−156.
19. scherer III, w.n. and scott, m.l. advanced contention
management for dynamic software transactional
memory. In Proceedings of the 24th Annual ACM
Symposium on Principles of Distributed Computing.
aCm, ny, 2005, 240−248.
20. shavit, n. and touitou, D. elimination trees and the
construction of pools and stacks. theory of Computing
systems 30 (1997), 645−670.
21. shavit, n. and touitou, D. software transactional
memory. Distributed Computing 10, 2 (Feb. 1997),
99−116.
22. shavit, n. and Zemach, a. Diffracting trees. ACM
Transactions on Computer Systems 14, 4 (1996),
385−428.
23. treiber, r.k. systems programming: Coping with
parallelism. technical report rJ 5118 (apr. 1986).
Ibm almaden research Center, san Jose, Ca.
nir Shavit is a professor of computer science at tel-aviv
university and a member of the scalable synchronization
group at oracle labs. he is a recipient of the 2004 aCm/
eatCs gödel Prize.
© 2011 aCm 0001-0782/11/0300 $10.00