distribution to generate client load. We modified the benchmark to support hybrid read/write workloads using the
update-read mix of the YCSB benchmark (0%, 10% updates)
in addition to 100% updates.
To overcome the significant overheads of the Redis RPC
and approximate a high-performance RPC, 10 we invoke
Redis’s operations directly at the server after the RPC layer,
instead of generating requests from remote clients.
In each experiment, we create a single sorted set with
10,000 items. We launch multiple threads that repeatedly
read or update a uniformly distributed random item using
ZRANK or ZINCRBY, respectively. In each experiment, we fix
an update ratio, a method, and a number of cores, and we
measure the aggregate throughput.
Results. We see the following results (see Figure 7). For
10% updates, we see that all methods except NR drop after
threads grow beyond a single node, making NR the best
method for maximum threads. NR is better than FC+, RWL,
FC, SL by 2.6x, 3.9x, 4.9x, 14x, respectively. For 100%
updates, NR is better by 1.1x, 3.7x, 1.1x, 4.4x, respectively.
For 0% updates, RWL, NR and FC+ scale well and have
almost identical performance, while FC and SL do not scale
(the graph is omitted).
While its scalability is not perfect, NR is the best method
here. As discussed, the goal is to reduce data structure bottlenecks so that the rest of the application benefits from
adding cores.
5. CONCLUSION
Node Replication is a general black-box method to transform single-threaded data structures into NUMA-aware concurrent data structures. Lock-free data structures are
considered state-of-the-art, but they were designed for UMA.
Creating new lock-free algorithms for NUMA is a herculean
effort, as each data structure requires highly specialized new
techniques. NR also required comparable effort, but once
Figure 7. Performance of Redis application.
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70 80 90 100 110
o
ps/
us
threads
NR
FC+
RWL
FC
SL
(a) 10% update rate
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70 80 90 100 110
o
ps
/us
threads
NR
FC+
FC
RWL
SL
(b) 100% update rate
0
2
4
6
8
10
0 10 20 30 40 50 60 70 80 90 100 110
op
s/
us
threads
LF
NR
FC+
RWL
FC
SL
(a) uniform keys, 10% update rate
0.0
0.2
0.4
0.6
0.8
1.0
1. 2
1. 4
0 10 20 30 40 50 60 70 80 90 100 110
ops
/
us
threads
LF
NR
FC+
FC
SL
RWL
(b) uniform keys, 100% update rate
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
0 10 20 30 40 50 60 70 80 90 100110
op
s/
u
s
threads
NR
LF
FC+
FC
RWL
SL
(c) zipf keys, 10% update rate
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
0 10 20 30 40 50 60 70 80 90 100 110
op
s/
u
s
threads
NR
FC+
FC
SL
RWL
LF
(d) zipf keys, 100% update rate
(e) NR others
memory at max threads (MB)148 34
Figure 6. Performance of skip list dictionary.
implemented by a composed data structure that combines a
hash table (for fast lookup) and a skip list (for fast rank/
range queries). Every element in the sorted set is kept in
both hash table and skip list. These underlying data structures must be updated atomically without the possibility
that a user observes an update reflected in the hash table
without it being reflected in the skip list, and vice versa.
For read operations, we use the ZRANK command, which
returns the rank of an item in the sorted order. ZRANK finds
the item in the hash table; if present, it finds its rank in the
skip list. For update operations we use ZINCRBY, which
increases the score of an item by a chosen value. ZINCRBY
finds the item in the hash table; if present, it updates its
score, and deletes and reinserts it into the skip list.
We used the redis-benchmark utility provided in the