could be spread across nodes. The log is accessed by at most
one thread per node, and it provides coordination and consistency across nodes.
A variable log Tail contains the index of the next available
entry. Each node has a replica of the data structure and a
variable local Tail indicating how far in the log the replica
has been updated. A node elects a temporary leader thread
called a combiner to write to the buffer (Section 3. 3).
The combiner writes many operations (a batch) to the
log at a time. To do so, it first allocates space by using a CAS
(typically 2–8). In addition, the cost of a CAS is amor-
tized over many operations due to batching.
• Read and Write to the Log in Parallel: Combiners can
concurrently read the log to update their local replicas.
Moreover, combiners can also concurrently write to the
log: after combiners have reserved new entries using
CAS, combiners can fill their entries concurrently.
• Read Locally in Parallel: Read-only operations in the
data structure execute against the local replica, and so
they can proceed in parallel if the replica is fresh.
Checking for freshness might fetch a cache line across
nodes, but this fetch populates the local cache and benefits many local readers. Readers execute in parallel
with combiners on different nodes, and with the local
combiner when it is filling entries in the log.
• Use Compact Representation of Shared Data: Operations
often have a shorter description than the effects they
produce, and thus communicating the operation via
the log incurs less communication across cores than
sharing the modifications to the data structure. For
example, clearing a dictionary might modify many
parts of the data structure, but we only communicate
the operation description across nodes.
A complication that must be addressed is how to recycle
the log. This must be done without much coordination, for
performance, but must also ensure that a log entry is recycled only after it has been applied at all the replicas. Roughly
speaking, NR uses a lightweight lazy mechanism that
reduces synchronization by delegating responsibility of
recycling to one of the threads.
In what follows, we describe these ideas in more detail.
3. 3. Intra-node coordination: combining
To execute an operation, a thread posts its operation in a
reserved slota and tries to become the combiner by acquiring
the combiner lock. The combiner reads the slots of the
threads in the node and forms a batch B of operations to
execute. The combiner then proceeds to write the operations in B to the log, and to update the local replica with the
entries from the log.
To avoid small inefficient batches, the combiner in NR
waits if the batch size is smaller than a parameter min_batch.
Rather than idle waiting, the combiner refreshes the local
replica from the log, though it might need to refresh it again
after finally adding the batch to the log. Figure 2 depicts the
general ideas.
3. 4. Inter-node coordination: circular buffer
Node Replication replicates the data structure across nodes
using a log realized as a shared circular buffer that stores
update operations on the data structure. This buffer can be
allocated from the memory of one of the NUMA nodes, or it
Figure 2. NR replicates the data structure across nodes. A shared
log stores updates that are later applied to each replica. Here, there
are two nodes and hence two replicas of a tree. The replicas are not
in sync, because the right replica has incorporated more updates
from the shared log. Threads in the same node share the replica in
that node; they coordinate access to the replica using a lock and
a technique called flat combining (Section 3. 3). Flat combining is
particularly efficient in UMA systems. Effectively, NR treats each
node as a separate UMA system.
NUMA node 1
Local tail
NUMA node 2
Local
replica
Local tail
Thread 1 Thread 2
Thread 1 Thread 2
Figure 3. The shared log in NR is realized as a circular buffer, shown
here as an array for simplicity. There is a global log-Tail variable
that indicates the first unreserved entry in the log. Each node has a
local Tail variable that indicates the next operation in the log to be
executed on each local replica. The figure shows only one thread for
each node—the thread that is currently chosen as the combiner for
that node—but there are other threads. Thread 1’s replica executed
5 operations from the log. Thread 2’s replica executed 3 more
operations and found an “empty” reserved entry that is not yet filled.
A combiner must wait for all empty entries preceding its batch in the
log. Readers can return when they find an empty entry (Section 3. 6).
Thread 1
Local
replica
Local tail Local tail
Thread 2
Local
replica
Shared log
Log Tail Log grows
down
Updates must wait
Reads can return
Full ( 1)
Full ( 1)
Full ( 1)
Full ( 1)
Full ( 1)
Full ( 1)
Full ( 1)
Empty (0)
Empty (0)
Empty (0)
Empty (0)
Empty (0)
Full ( 1)
a We call slots the locations where threads post operations for the combiners;
we call entries the locations in the shared log.