How to Implement Any
Concurrent Data Structure
By Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera
DOI: 10.1145/3282506
Abstract
We propose a method called Node Replication (NR) to
implement any concurrent data structure. The method
takes a single-threaded implementation of a data structure
and automatically transforms it into a concurrent (
thread-safe) implementation. The result is designed to work well
with and harness the power of modern servers, which are
complex Non-Uniform Memory Access (NUMA) machines
with many processor sockets and subtle performance characteristics. Using NR requires no expertise in concurrent
data structure design, and the result is free of concurrency
bugs. NR represents a paradigm shift of how concurrent
algorithms are developed: rather than designing for a data
structure, we design for the architecture.
1. INTRODUCTION
Concurrent data structures are everywhere in the software
stack, from the kernel (e.g., priority queues for scheduling), to
application libraries (e.g., tries for memory allocation), to applications (e.g., balanced trees for indexing). These data structures, when inefficient, can cripple the performance of the
system.
Due to recent architectural changes, high-performance
servers today are Non-Uniform Memory Access (NUMA)
machines. Such machines have multiple processor sockets,
herein called nodes, each with some local cache and memory.
Although cores in a node can access the memory in other
nodes, it is faster to access local memory and to share cache
lines within a node than across nodes. To fully harness the
power of NUMA, data structures must take this asymmetry
into consideration: they must be NUMA-aware to reduce cross-node communication and minimize accesses to remote
caches and memory.
Unfortunately, there are few NUMA-aware concurrent
data structures, and designing new ones is hard. The key
challenge is how to deal with contention on the data structure, where simple techniques limit concurrency and scale
poorly, while efficient techniques are complex, error-prone,
and rigid (Section 2).
We propose a new technique, called Node Replication
(NR), to obtain NUMA-aware data structures, by automatically transforming any single-threaded data structure into a
corresponding concurrent (thread-safe) NUMA-aware structure. NR is general and black-box: it requires no inner knowledge of the structure and no expertise in NUMA software
design. The resulting concurrent structure provides strong
consistency in the form of linearizability. 8
Node Replication combines ideas from two disciplines:
distributed systems and shared-memory algorithms. NR
maintains per-node replicas of an arbitrary data structure and
The original version of this paper, titled “Black-box
Concurrent Data Structures for NUMA Architectures,”
was published in ASPLOS 2017. For more information,
please check https://research.vmware.com/projects/
nodereplication.
synchronizes them via a shared log (an idea from distributed
systems1). The shared log is realized by a hierarchical,
NUMA-aware design that uses flat combining5 within nodes
and lock-free appending across nodes (ideas from shared-memory algorithms). With this interdisciplinary approach,
only a handful of threads need to synchronize across nodes,
so most synchronization occurs efficiently within each node.
Node Replication represents a paradigm shift of how concurrent algorithms are designed. Currently, each new concurrent data structure requires its own design, and our
community of experts has spent decades writing papers and
developing algorithms for all kinds of structures (skip lists,
queues, priority queues, and hash tables, etc). However, computer architectures are now fluid with the introduction of new
memory features (non-volatility, in-memory processing),
new memory models (NUMA, non-coherent caches), new
processing elements (GPU, FPGA, TPU), new processor features (transactional memory, SGX), and more. Unfortunately,
the old algorithms do not work well in the new architectures,
so the community has to redesign the algorithms for each
new architecture.
Node Replication shows there is a better way to design
algorithms, by using a black-box approach that is independent of the data structure. Thus, rather than designing for a
data structure, we design for the architecture. This approach
significantly reduces the design effort to a few architectures,
instead of the product of the number of architectures and
the number of data structures. While we demonstrate the
black-box approach for NUMA here, we envision its general
applicability to other new architectures as they emerge.
Node Replication cannot always outperform algorithms
that specialize for a single data structure and architecture.
However, perhaps surprisingly, NR performs well in many
cases, particularly when there is contention, where an operation often affects the output of other operations. On a contended priority queue and a dictionary, NR can outperform
lock-free algorithms by up to 2.4x and 3.1x with 112 threads;
and NR can outperform a lock-based solution by 8x and 30x
on the same data structures. To demonstrate the benefits to
a real application, we apply NR to the data structures of the
Redis storage server. Many systems have shown how servers
can scale the handling of network requests and minimize
Remote Procedure Calls (RPC) bottlenecks. 10 There is less
research on how to scale the servicing of the requests. These