98 COMMUNICATIONS OF THE ACM | DECEMBER 2018 | VOL. 61 | NO. 12
systems either implement a simple service (e.g., get/put)
that can partition requests across cores; 10 or they develop
sophisticated concurrent data structures from scratch to
support more complex operations, 11 and doing this requires
expertise in concurrent algorithms. This is where our black-box approach comes handy: NR provides these concurrent
data structures automatically from single-threaded implementations. For Redis, we were able to convert a single-threaded sorted set into a concurrent one with just 20 new
lines of wrapper code. The result outperforms data structures obtained from other methods by up to 14x.
Although NR is powerful, easy to use, and efficient, it
has three limitations. First, it incurs space overhead due
to replication: it consumes n times more memory, where
n is the number of nodes. Thus, NR is best suited for
smaller structures that occupy just a fraction of the available memory (e.g., up to hundreds of MB). Second, NR is
blocking: a thread that stops executing operations can
block the progress of other threads; in practice, we did
not find that to be a problem as long as threads keep executing operations on the data structure. Finding a nonblocking variant of NR is an interesting research
direction. Finally, NR may be outperformed by non-black-box algorithms crafted for a given data structure—
For example, a lock-free skip list running on
low-contention workloads, or a NUMA-aware stack. 2
Thus, the generality of black-box methods has some cost.
However, in some cases NR outperforms even the crafted
algorithms; we observe this for the same lock-free skip
list running instead on high-contention workloads.
We plan to make the source code for NR available in our
project page at https://research.vmware.com/projects/
nodereplication.
2. BACKGROUND
2. 1. NUMA architectures
Our work is motivated by recent trends in computer archi-
tecture. To support a large number of cores, data center
servers have adopted a NUMA architecture with many pro-
cessor sockets or nodes (see Figure 1). Each node has
many processor cores and a shared cache, while individ-
ual cores have private caches. Sharing a cache line within
a node is more efficient than across nodes because the
cache coherence protocol operates more efficiently
within a node. Each node has some local memory, and a
core can access local memory faster than memory in a
remote node. A similar architecture—Non-Uniform
Cache Access (NUCA)—has a single shared memory but
nodes have local caches as in NUMA. Our ideas are appli-
cable to NUCA too. NUMA is everywhere now. A high-per-
formance Intel server might have eight processors
(nodes), each with 28 cores, while a typical server might
have two processors, each with 8–16 cores. AMD and
Oracle have similar machines. To best use these cores, we
need appropriate concurrent data structures.
2. 2. Concurrent data structures
Concurrent data structures permit many threads to operate
on common data using a high-level interface. When a data
structure is accessed concurrently by many threads, its
semantics are typically defined by a property called linearizability, 8 which provides strong consistency. Linearizability
requires that each operation appear to take effect instantly at
some point between the operation’s invocation and response.
The key challenge in designing concurrent data structures is dealing with operation contention, which occurs
when an operation often affects the output of another
operation. More precisely, given an execution, we say that
an operation affects another if the removal of the first
causes the second to return a different result. For example, a write of a new value affects a subsequent read. A
workload has operation contention if a large fraction of
operations affect a large fraction of operations occurring
soon after them. Examples include a storage system where
users read and write a popular object, a priority queue
where threads often remove the minimum element, a
stack where threads push and pop data, and a bounded
queue where threads enqueue and dequeue data. Non-examples include read-only workloads and write-only
workloads where writes do not return a result. Operation
contention is challenging because operations must
observe each other across cores.
Much work has been devoted to designing and implementing efficient concurrent data structures; we provide a
broad overview in Calciu, Sen et al. 3 Unfortunately, each data
structure requires its own algorithm with novel techniques,
which involve considerable work from experts in the field.
To get a sense, a new concurrent data structure often leads
to a scientific publication just for its algorithm.
Unfortunately, most existing concurrent data structures
and techniques are for Uniform Memory Access (UMA),
including some prior black-box methods. 5, 6, 16 These algorithms are not sensitive to the asymmetry and limitations of
NUMA, which hinders their performance. 9 There are some
recent NUMA-aware algorithms, 2, 12, 14 but they cover few data
structures. Moreover, these solutions are not applicable
when applications compose data structures and wish to
modify several of them with a single composed operation
(e.g., remove an item from a hash table and a skip list simultaneously). This is the case in the Redis application, which
we describe later in the paper.
Core
Cache
Core
Cache
Core
Cache
Core
Cache
Cache
Core
Cache
Core
Cache
Core
Cache
Core
Cache
Cache
Memory Memory
Node Node
Figure 1. NUMA architecture of a modern server in a data center.
The server has many processor sockets, herein called nodes. Each
node has many processor cores and some local memory. Nodes are
connected by an interconnect, so that cores in one node can access
the remote memory of another node, but these accesses come at a
cost. Typically, cores have local caches, and cores on a node share a
last level cache.