thread happens to retry its dequeue operation first once data
becomes available. Further, each invocation of the totalized
method introduces performance-degrading contention for
memory–interconnect bandwidth.
As an alternative, suppose we could register a request for a
hand-off partner. Inserting this reservation could be done in a
nonblocking manner, and checking to see whether a partner
has arrived to fulfill our reservation could consist of reading a
Boolean flag in the request data structure. A dual data structure16, 19 takes precisely this approach: Objects may contain
both data and reservations. We divide partial methods into
separate, first-class request and follow-up operations, each of
which has its own invocation and response. A total queue, for
example, would provide dequeue_request and dequeue_
followup methods (Listing 2). By analogy with Lamport’s
bakery algorithm, 10 the request operation returns a unique
ticket that represents the reservation and is then passed as an
argument to the follow-up method. The follow-up, for its part,
returns either the desired result (if one is matched to the ticket)
or, if the method’s precondition has not yet been satisfied, an
error indication.
The key difference between a dual data structure and
a “totalized” partial method is that linearization of the
p_request call allows the dual data structure to determine the fulfillment order for pending requests. In addition, unsuccessful follow-ups, unlike unsuccessful calls
to totalized methods, are readily designed to avoid bus or
memory contention. For programmer convenience, we provide demand methods, which wait until they can return successfully. Our implementations use both busy-wait spinning
and scheduler-based suspension to effect waiting in threads
whose preconditions are not met.
When reasoning about progress, we must deal with the fact
that a partial method may wait for an arbitrary amount of time
(perform an arbitrary number of unsuccessful follow-ups)
before its precondition is satisfied. Clearly it is desirable that
requests and follow-ups be nonblocking. In practice, good
system performance will also typically require that unsuccessful follow-ups not interfere with other threads’ progress. We
define a data structure as contention-free if none of its follow-up
operations, in any execution, performs more than a constant
number of remote memory accesses across all unsuccessful
invocations with the same request ticket. On a machine with
an invalidation-based cache coherence protocol, a read of
Listing 2: combined operations: dequeue pseudocode (enqueue is
symmetric).
datum dequeue(SynchronousQueue Q) {
reservation r = Q.dequeue_reserve();
do {
datum d = Q.dequeue_followup(r);
if (failed != d) return d;
/ else delay -- spinning and/or scheduler-based /
**
while (!timed_out());
if ( Q.dequeue_abort(r)) return failed;
return Q.dequeue_followup(r);
}
location o by thread t is said to be remote if o has been written
by some thread other than t since t last accessed it; a write by
t is remote if o has been accessed by some thread other than t
since t last wrote it. On a machine that cannot cache remote
locations, an access is remote if it refers to memory allocated
on another node. Compared to the local-spin property, 13 contention freedom allows operations to block in ways other than
busy-wait spinning; in particular, it allows other actions to be
performed while waiting for a request to be satisfied.
3. aLgoRithm DescRiPtions
In this section we discuss various implementations of synchronous queues. We start with classic algorithms used
extensively in production software, then we review newer
implementations that improve upon them. Finally, we
describe our new algorithms.
3. 1. classic synchronous queues
Perhaps the simplest implementation of synchronous queues
is the naive monitor-based algorithm that appears in Listing 3.
In this implementation, a single monitor serializes access to
a single item and to a putting flag that indicates whether a
producer has currently supplied data. Producers wait for the
flag to be clear (lines 15–16), set the flag ( 17), insert an item
( 18), and then wait until a consumer takes the data ( 20–21).
Consumers await the presence of an item (05–06), take it (07),
and mark it as taken (08) before returning. At each point where
their actions might potentially unblock another thread, producer and consumer threads awaken all possible candidates
(09, 20, 24). Unfortunately, this approach results in a number
of wake-ups quadratic in the number of waiting producer and
consumer threads; coupled with the high cost of blocking or
Listing 3: naive synchronous queue.
00 public class NaiveSQ<E> {
01 boolean putting = false;
02 E item = null;
03
04
05
06
07
08
09
10
11
12
13 public synchronized void put (E e) {
14 if (e == null) return;
15 while (putting)
16 wait();
17 putting = true;
18 item = e;
19 notifyAll();
20 while (item != null)
21 wait();
22 putting = false;
23 notifyAll();
24 }
25 }
public synchronized E take() {
while (item == null)
wait();
E e = item;
item = null;
notifyAll();
return e;
}