(a) For 10% updates, all methods drop in performance at
the NUMA node boundaries due to the cross-node
overheads; but NR drops little, making it the best
after one NUMA node. At max threads, NR is better
than LF, FC+, FC, RWL, SL by 1.7x, 6x, 7x, 27x, 41x.
Checking the CPU performance counters, NR had the
fewest L3 cache misses and L3 cache misses served
from remote caches, indicating lower cross-node
traffic.
(b) For 100% updates, LF loses its advantage due to
higher operation contention: even within a NUMA
node, NR is close to LF. After one node, NR is best as
before. At max threads, NR is better than LF, FC+, FC,
SL, RWL by 2.4x, 2.5x, 3.3x, 8x, 9.4x.
(c) In some methods, one thread outperforms many
threads, but not when there is work outside the data
structure, as in many real applications. In such applications, we need more threads to scale the application and we want the shared data structure to not
become a bottleneck.
(d) Node Replication remains the best method even as
we vary the amount of external work e and cache pollution. With e=512, NR is better than FC+, LF, FC, SL,
RWL by 1.7x, 1.8x, 2.8x, 12.6x, 16.9x.
(e) The cost of NR is that it consumes more memory,
namely, 148MB of memory at 112 threads ( 4.4x the
other methods): 12MB for the log and 34MB for each
of the four replicas. Technically, NR has another cost:
it executes an operation many times, one per replica.
However, this cost is relatively small as NR makes up
for it with better overall performance.
Results for dictionary. For the dictionary, we see the following results (see Figure 6). When there are updates, performance depends on the level of contention. With low
contention (uniform keys), LF outperforms other methods
(it is off the charts): at maximum threads, it is 7x and 14x better than NR for 10% and 100% updates, respectively. This is
due to the parallelism of the skip list unhindered by contention. Excluding LF, NR outperforms the other methods (with
100% updates, it does so after threads grow beyond a node).
However, with high contention (zipf keys), LF loses its
benefit, becoming the worst method for 100% updates.
There is a high probability of collisions in the vicinity of the
hot keys and the skip list starts to suffer from many failed
CASs: with uniform keys, the skip list has ≈300K failed CASs,
but with the zipf keys this number increases to >7M. NR is
the best method after 8 threads. Contention in the data
structure does not disrupt the NR log. On the contrary, data
structure contention improves cache locality with NR. With
maximum threads and 10% updates, NR is better than LF,
FC+, FC, RWL, SL by 3.1x, 4.0x, 6.8x, 16x, 30x. With 100%
updates, NR is better by 2.8x, 1.8x, 2.4x, 5.7x, 4.3x.
4. 2. Redis
We now consider the data structures of the Redis server, made
concurrent using various black-box methods, including NR.
We evaluate the sorted set data structure in Redis, which
sorts items based on a score. In Redis, sorted sets are
Between operations the benchmark optionally does work
by writing e random locations external to the data structure.
This work causes cache pollution and reduces the arrival
rate of operations. We first populate the data structure with
200,000 items, and then measure the performance of the
methods for various workload mixes. In each experiment,
we fix a method, a ratio of update-to-read operations, an
external work amount e, and a number of threads.
Results for priority queue. For the priority queue, we see
the following results (see Figure 5).
Figure 5. Performance of priority queue made concurrent using
different methods. Vertical lines show the boundaries between
NUMA nodes.
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100 110
ops
/
us
threads
NR
LF
FC+
FC
RWL
SL
(a) 10% update rate, e=0
0
2
4
6
8
10
12
0 10 20 30 40 50 60 70 80 90 100 110
op
s/
u
s
threads
NR
LF
FC+
FC
RWL
SL
(b) 100% update rate, e=0
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60 70 80 90 100 110
op
s/
u
s
threads
NR
FC+
LF
FC
RWL
SL
(c) 100% update rate, e=512
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 16 32 64 128 256 512
o
ps
/
us
Work
NR
FC+
LF
FC
RWL
SL
(d) 100% update rate, max threads
(e) NR others memory at max threads (MB) 148 34