Doi: 10.1145/1506409.1506431
Scalable Synchronous Queues
By William N. Scherer III, Doug Lea, and Michael L. Scott
abstract
In a thread-safe concurrent queue, consumers typically
wait for producers to make data available. In a synchronous
queue, producers similarly wait for consumers to take the
data. We present two new nonblocking, contention-free synchronous queues that achieve high performance through a
form of dualism: The underlying data structure may hold
both data and, symmetrically, requests.
We present performance results on 16-processor SPARC
and 4-processor Opteron machines. We compare our algorithms to commonly used alternatives from the literature
and from the Java SE 5.0 class java.util.concurrent
.SynchronousQueue both directly in synthetic
microbenchmarks and indirectly as the core of Java’s
ThreadPoolExecutor mechanism. Our new algorithms
consistently outperform the Java SE 5.0 SynchronousQueue
by factors of three in unfair mode and 14 in fair
mode; this translates to factors of two and ten for the
ThreadPoolExecutor. Our synchronous queues have been
adopted for inclusion in Java 6.
1. intRoDuction
Mechanisms to transfer data between threads are among
the most fundamental building blocks of concurrent systems. Shared memory transfers are typically effected via
a concurrent data structure that may be known variously as a
buffer, a channel, or a concurrent queue. This structure serves
to “pair up” producers and consumers. It can also serve to
smooth out fluctuations in their relative rates of progress by
buffering unconsumed data. This buffering, in systems that
provide it, is naturally asymmetric: A consumer that tries to
take data from an empty concurrent queue will wait for a
producer to perform a matching put operation; however, a
producer need not wait to perform a put unless space has
run out. That is, producers can “run ahead” of consumers,
but consumers cannot “run ahead” of producers.
A synchronous queue provides the “pairing up” function
without the buffering; it is entirely symmetric: Producers
and consumers wait for one another, “shake hands,” and
leave in pairs. For decades, synchronous queues have played
a prominent role in both the theory and practice of concurrent programming. They constitute the central synchronization primitive of Hoare’s CSP8 and of languages derived from
it, and are closely related to the rendezvous of Ada. They are
also widely used in message-passing software and in stream-style “hand-off” algorithms. 2, Chap. 8 (In this paper we focus on
synchronous queues within a multithreaded program, not
across address spaces or distributed nodes.)
Unfortunately, design-level tractability of synchronous
queues has often come at the price of poor performance.
“Textbook” algorithms for put and take may repeatedly suffer from contention (slowdown due to conflicts
with other threads for access to a cache line) and/or
blocking (loops or scheduling operations that wait for activity in
another thread). Listing 1, for example, shows one of the
most commonly used implementations, due to Hanson. 3
It employs three separate semaphores, each of which is a
potential source of contention and (in acquire operations)
blocking.a
The synchronization burden of algorithms like Hanson’s
is especially significant on modern multicore and multiprocessor machines, where the OS scheduler may take
thousands of cycles to block or unblock threads. Even an
uncontended semaphore operation usually requires special
read-modify-write or memory barrier (fence) instructions,
each of which can take tens of cycles.b
Listing 1: hanson’s synchronous queue. semaphore sync indicates
whether item is valid (initially, no); send holds 1 minus the number
of pending puts; recv holds 0 minus the number of pending takes.
00 public class HansonSQ<E> {
01 E item = null;
02 Semaphore sync = new Semaphore(0);
03 Semaphore send = new Semaphore( 1);
04 Semaphore recv = new Semaphore(0);
05
06
07
08
09
10
11
12
13
14 public void put(E x) {
15 send.acquire();
16 item = x;
17 recv.release();
18 sync.acquire();
19 }
20 }
Public E take() {
recv.acquire();
E x = item;
sync.release();
send.release();
return x;
}
a
Semaphores are the original mechanism for scheduler-based synchronization (they date from the mid-1960s). Each semaphore contains a counter and
a list of waiting threads. An acquire operation decrements the counter and
then waits for it to be nonnegative. A release operation increments the
counter and unblocks a waiting thread if the result is nonpositive. In effect, a
semaphore functions as a non-synchronous concurrent queue in which the
transferred data is null.
b
Read-modify-write instructions (e.g., compare_and_swap [CAS]) facilitate constructing concurrent algorithms via atomic memory updates.
Fences enforce ordering constraints on memory operations.
A previous version of this paper was published in
Proceedings of the 11th ACM Symposium on Principles and Practice
of Parallel Programming, Mar. 2006.