3. 2. Basic idea
Node Replication replicates the data structure on each
NUMA node, so that threads can execute operations on a
replica that is local to their node. Replication brings two
benefits. First, an operation can access the data structure on
memory that is local to the node. Second, operations can
execute concurrently across nodes on different replicas.
Replication, however, raises the question of how threads
coordinate access to the replicas and maintain them in sync.
For efficiency, NR uses different mechanisms to coordinate threads within nodes and across nodes. At the highest
level, NR leverages the fact that coordination within a node
is cheaper than across nodes.
Within each node, NR uses flat combining (a technique
from concurrent computing5). Flat combining batches operations from multiple threads and then executes the batch
using a single thread, called the combiner. The combiner is
analogous to a leader in distributed systems. In NR, we
batch operations from threads in the same node, using one
combiner per node. The combiner of a node is responsible
for checking if threads within the node have any outstanding update operations, and then it executes all such operations on behalf of the other threads. Which thread is the
combiner? The choice is made dynamically among threads
within a node that have outstanding operations. The combiner changes over time: it abdicates when it finishes executing the outstanding updates, up to a maximum number.
Batching can gather many operations, because there are
many threads per node (e.g., 28 in our machine). Batching in
NR is advantageous because it localizes synchronization
within a node.
Across nodes, threads coordinate through a shared log (a
technique from distributed systems1). The combiner of each
node reserves entries in the log, writes the outstanding
update operations to the log, brings the local replica up-to-date by replaying the log if necessary, and executes the local
outstanding update operations.
Node Replication applies an optimization to read-only
operations (operations that do not change the state of the
data structure). Such operations execute without going
through the log, by reading directly the local replica. To
ensure consistency (linearizability8), the operation must
ensure that the local replica is fresh: the log must be replayed
at least until the last operation that completed before the
read started.
We have considered an additional optimization, which
dedicates a thread to run the combiner for each node; this
thread replays the log proactively. This optimization is sensible for systems that have many threads per node, which is
an ongoing trend in processor architecture. However, we
have not employed this optimization in the results we present here.
The techniques above provide a number of benefits:
• Reduce Cross-Node Synchronization and Contention: NR
appends to the log without acquiring locks; instead, it
uses the atomic Compare-And-Swap (CAS) instruction
on the log tail to reserve new entries in the log. The CAS
instruction incurs little cross-node synchronization
3. NODE REPLICATION (NR)
Node Replication is a NUMA-aware algorithm for concurrent data structures. Unlike traditional algorithms, which
target a specific data structure, NR implements all data
structures at once. Furthermore, NR is designed to work well
under operation contention. Specifically, under update-heavy contended workloads, some algorithms drop performance as we add more cores; in contrast, NR can avoid the
drops, so that the parallelizable parts of the application can
benefit from more cores without being hindered by the data
structures. NR cannot always outperform specialized data
structures with tailored optimizations, but it can be competitive in a broad class of workloads.
While NR can provide any concurrent data structures, it
does not automatically convert entire single-threaded applications to multiple threads. Applications have a broad interface, unlike data structures, so they are less amenable to
black-box methods.
3. 1. API
To work with an arbitrary data structure, NR expects a single-threaded implementation of the data structure provided as
four generic methods:
Create() → ptr
Execute(ptr, op, args) → result
IsReadOnly(ptr, op) → Boolean
Destroy()
The Create method creates an instance of the data structure, returning its pointer. The Execute method takes a
data structure pointer, an operation, and its arguments; it
executes the operation on the data structure, returning
the result. The method must produce side effects only on
the data structure and it must not block. Operation results
must be deterministic, but we allow nondeterminism
inside the operation execution and the data structure
(e.g., levels of nodes in a skip list). Similarly, operations
can use randomization internally, but results should not
be random (results can be pseudorandom with a fixed initial seed). The IsReadOnly method indicates if an operation is read-only; we use this information for read-only
optimizations in NR. The Destroy method deallocates the
replicas and the log. NR provides a new method
ExecuteConcurrent that can be called concurrently from
different threads.
For example, to implement a hash table, a developer
provides a Create method that creates an empty hash
table; an Execute method that recognizes three op parameters (insert, lookup, remove) with the args parameter
being a key-value pair or a key; and a IsReadOnly method
that returns true for op=lookup and false otherwise. The
Execute method implements the three operations of a
hash table in a single-threaded setting (not thread-safe).
NR then provides a concurrent (thread-safe) implementation of the hash table via a new method
ExecuteConcurrent. For convenience, the developer may
subsequently write three simple wrappers (insert, lookup,
remove) that invoke ExecuteConcurrent with the appropriate op parameter.