isz1 and isz2. But, again, there is a conflict-free implementation
based on adding a Boolean “zeroness” snapshot as well as per-thread counters. isz simply returns this snapshot. When dec
reduces a per-thread value to zero or below, it reads and sums
all per-thread values and updates the snapshot if necessary.
3. 5. Discussion
The rule pushes state and history dependence to an extreme:
it makes a statement about a single history. In broadly commutative interfaces, the arguments and system states for
which a set of operations commutes often collapse into
fairly well-defined classes (e.g., file creation might commute
whenever the containing directories are different). In practice, implementations scale for whole classes of states and
arguments, not just for specific histories.
On the other hand, there can be limitations on ho w broadly
an implementation can scale. It is sometimes the case that a
set of operations commutes in more than one class of situation, but no single implementation can scale for all classes.
For instance, in our modified reference counter, H′ 1, H′ 2,
and H′ 3 all SIM-commute in H′, and we described a scalable
implementation for each situation. However, H′ 4 does not
SIM-commute, even though it is a union of SIM-commutative
pieces: if the two dec operations were reordered to the start of
the region, then the isz operations would have to return different values. Any reasonable counter implementation must
fail to scale in H′ 4, because isz must return different values
depending on whether it ran before or after the dec invocations, and this requires communication between the cores
that ran dec and isz. This can be proved using a converse of
the rule: when a history contains a non-SIM-commutative
region, no non-degenerate implementation can be scalable
in that region. 6 (The non-degeneracy condition eliminates
implementations that, for example, never respond to any
invocation, or always respond with an error return value.)
In our experience, real-world interface operations rarely
demonstrate such mutually exclusive implementation choices.
For example, the POSIX implementation in Section 5 scales
quite broadly, with only a handful of cases that would require
4. DEFINING COMMUTATIVE INTERFACES
This section demonstrates more situations of interface-level
reasoning enabled by the rule, using POSIX, the standard
interface for Unix-like operating systems.
The following sections explore four general classes of
changes that make POSIX operations commute in more situations, enabling more scalable implementations.
4. 1. Decompose compound operations
Many POSIX APIs combine several operations into one, limiting the combined operation’s commutativity. For example,
fork both creates a new process and snapshots the current process’s entire memory state, file descriptor state, signal mask,
and several other properties. As a result, fork fails to commute with most other operations in the same process, including memory writes, address space operations, and many file
descriptor operations. However, applications often follow fork
with exec, which undoes most of fork’s suboperations. With
3. 3. Rule
We can now formally state the scalable commutativity rule.
Assume an interface specification that has a correct
implementation and a history H = X Y exhibited by that
implementation. Whenever Y SIM-commutes in H, there exists
a correct implementation of whose steps in Y are conflict-free.
Since, given reasonable workload assumptions, conflict-free
operations empirically scale on modern multicore hardware,
this implementation is scalable in Y.
Our proof of the rule constructs the scalable implementation from the correct reference implementation, and relies
on our abstract machine definition and our definition of
3. 4. Example
Consider a reference counter interface with four operations.
reset(v) sets the counter to a specific value v, inc and dec
increment and decrement the counter and return its new
value, and isz returns Z if the counter value is zero and NZ
otherwise. The caller is expected to never decrement below
zero, and once the counter reaches zero, the caller should
not invoke inc.
Consider the counter history
The region H1 SIM-commutes in H, so the rule tells us that a
correct implementation exists that is conflict-free for H1. In fact,
this is already true of a simple shared-counter implementation:
its isz reads the shared counter, but does not write it.
But H2 does not SIM-commute in H, so no scalable implementation is implied—and, in fact, none is possible. The
problem is that the caller can reason about order via the dec
return values. Only a degenerate implementation, such as
one that refused to respond to certain requests, could avoid
tracking this order in a nonconflict-free way.
We can make dec commute by eliminating its return value.
If we modify the specification so that inc and dec return nothing, then any region consisting exclusively of these operations
commutes in any history. A version of H with this modified
H′ 2, unlike H2, SIM-commutes, so there must be an implementation that is conflict-free there. Per-thread counters
give us such an implementation: each dec can modify
its local counter, while isz sums the per-thread values.
Per-thread and per-core sharding of data structures like
this is a common and long-standing pattern in scalable
The rule highlights at least one more opportunity in this
history. H′ 3 also SIM-commutes in H. However, the per-thread
counter implementation is not conflict-free for H′ 3: dec3 will
write one component of the state that is read and summed by