support that. LF requires a mechanism to garbage collect
memory, such as hazard pointers13 or epoch reclamation; 4
these mechanisms can reduce performance by 5x. We do not
use these mechanisms, so the reported numbers for LF are
better than in reality.
Summary of results. On the real data structures (Section
4. 1), we find that NR outperforms other methods at many
threads under high operation contention, with the exception of NUMA-aware algorithms tailored to the data structure. The other methods, including lock-free algorithms,
tend to lose significant performance beyond a NUMA node.
We also find that NR consumes more memory than other
methods. On a real application’s data structures (Section
4. 2), NR outperforms alternatives by 2.6x–14x on workloads
with 10% updates, or by 1.1x– 4.4x on 100% updates.
Testbed. We use a Dell server with 512GB RAM and 56
cores on four Intel Xeon E7-4850v3 processors at 2.2GHz.
Each processor is a NUMA node with 14 cores, a 35MB
shared L3 cache, and a private L2/L1 cache of size
256KB/64KB per core. Each core has 2 hyper-threads for a
total of 112 hyper-threads. Cache lines have 64B.
4. 1. Real data structures
These experiments use two real data structures: a skip list
priority queue and a skip list dictionary. (Additional results
using two other data structures are given in Calciu, Sen
et al., 3 but these results are qualitatively similar to the ones
we present here.) A priority queue provides two update operations and one read-only operation: insert(i) inserts element i,
deleteMin() removes and returns the smallest element, and
findMin() returns the smallest element without removing it.
We implement these operations using a skip list to order the
elements and keep the minimum at the beginning of the
list. A dictionary provides operations to insert, lookup, and
remove elements, and we use a skip list to provide the dictionary. NR, FC, and FC+ internally use the same single-threaded implementation of a skip list; 15 FC uses the flat
combining implementation from Hendler et al. 5 For the LF,
we use the skip-list-based priority queue and skip list dictionary from Herlihy and Shavit. 7
We use the benchmark from the flat combining paper, 5
which runs a mix of generic add, remove, and read operations. We map these operations to each data structure as
generic priorityqueue dictionary
add insert(rnd, v) insert(rnd, v)
remove deleteMin() delete(rnd)
read findMin() lookup(rnd)
Here, rnd indicates a key chosen at random and v is an arbitrary value. We use the same ratio of add and remove to
keep the data structure size approximately constant over
time, and the results aggregate these two operations as
“update operations.” We consider two ratios of update-to-read operations: 10%, 100% updates (90%, 0% reads). For
the priority queue, we choose random keys from a uniform
distribution. For the dictionary, we vary the operation contention by drawing the keys from two distributions: uniform (low contention) and zipf with parameter 1. 5 (high
Coupled data structures. In some applications, data
structures are read or updated together. For example, Redis
implements sorted sets using a hash table and a skip list,
which are updated atomically by each request. NR can provide these atomic updates, by treating the data structures as
a single larger data structure with combined operations.
Fake update operations. Some update operations become
readonly during execution (e.g., remove of a nonexistent
key). Black-box methods must know about read-only operations at invocation time. If updates become read-only often,
one can first attempt to execute them as read-only and, if not
possible, then execute them as updates (e.g., remove(key)
first tries to look up the key). This requires a simple wrapper
around remove(). We did not implement this.
We have evaluated NR to answer five broad questions: How
does NR scale with the number of cores for different data
structures and workloads? How does NR compare with other
concurrent data structures? What is the benefit of NR to real
applications? How does NR behave on different NUMA
architectures? What are the benefits of NR’s techniques?
What are the costs of NR? Here, we highlight the most repre-
sentative results and focus on the first three questions; the
complete set of results are available in Calciu, Sen et al. 3 We
report on two classes of experiments:
• Real Data Structures (Section 4. 1): We run micro-
benchmarks on real data structures: a skip list priority
queue and a skip list dictionary.
• Real Application (Section 4. 2): We run macro-benchmarks
on the data structures of a real application: the Redis
storage server modified to use many threads.
We compare NR against other methods (baselines) shown
in Figure 4. Single Lock (SL) and Readers-Writer Lock (RWL)
are methods often used in practice; they work by protecting
the data structure with a SL or a single RWL. For RWL we use
the same readers-writer lock as NR (Section 3. 8). FC consists
of flat combining used to implement the entire data structure. FC can be used as a black-box method, but it can also
use data-structure-specific optimizations to combine operations for faster execution (hence its name); we use these
optimizations whenever possible. FC+ is an improvement of
FC by using a readers-writer lock to execute read-only operations more efficiently. Lock-Free Baseline (LF) is a lock-free
algorithm specialized for a specific data structure; this baseline is available only for some data structures. In the real
application (Redis), threads must atomically update multiple data structures but existing lock-free algorithms do not
Figure 4. Other methods for comparison (baselines).
SL One big lock (spinlock)
RWL One big readers-writer lock
FC Flat combining
FC+ Flat combining with readers-writer lock
LF Lock-free algorithm