3. 8. Better readers-writer lock
Vyukov’s distributed readers-writer lock uses a per-reader
lock to reduce reader overhead; the writer must acquire the
locks from all readers. We modify this algorithm to reduce
writer overhead as well, by adding an additional writer lock.
To enter the critical section, the writer must acquire the
writer lock and wait for all the readers locks to be released,
without acquiring them; to exit, it releases its lock. A reader
waits if the writer lock is taken, then acquires its local lock,
and checks the writer lock again; if this lock is taken, the
reader releases its local lock and restarts; otherwise, it enters
the critical section; to exit, it releases the local lock. With
this scheme, the writer and readers incur just one atomic
write each on distinct cache lines to enter the critical section. Readers may starve if writers keep coming, but this is
unlikely with NR, as often only one thread wishes to be a
writer at a time (the combiner) and that thread has significant work outside the critical section.
3. 9. Practical considerations
We now discuss some important practical considerations
that arised when we implemented NR.
Software and hardware threads. So far, we have assumed
that software threads correspond one-to-one with hardware
threads, and we have used the term thread indistinguishably to refer to either of them. However, in practice applications may have many more software threads than available
hardware threads. To handle this situation, we can have
more combiner slots than hardware threads, and then
assign each software thread to a combiner slot. Beyond a
certain number of software threads, they can share combiner slots using CAS to insert requests. When a software
thread waits for the local combiner, it yields instead of
spinning, so that the underlying hardware thread can run
other software threads to generate larger combiner batches
and increase efficiency.
Log length. NR uses a circular array for its log; if the array
gets full, threads pause until older entries are consumed.
This is undesirable, so one should use a large log, but how
large? The solution is to dynamically resize the log if it gets
full. This is done by writing a special log entry that indicates
that the log has grown so that all replicas agree on the new
size after consuming the special entry. This scheme gradually adjusts the log size until it is large enough.
Memory allocation. Memory allocation can become a
performance bottleneck. We need an allocator that ( 1)
avoids too much coordination across threads, and ( 2) allocates memory local to each node. We use a simple allocator
in which threads get buffers from local pools. The allocator
incurs coordination only if a buffer is allocated in one thread
and freed in another; this requires returning the buffer to
the allocating thread’s pool. This is done in batches to
reduce coordination.
Inactive replica. If threads in a node execute no operation on the data structure, the replica of that node stops
replaying entries from the log, causing the log to fill up.
This problem is solved by periodically running a thread per
node that refreshes the local replica if the node has no operations to execute.
local Tail to right before the entries it allocated. In doing so
the combiner may find empty entries allocated by other
threads; in that case, it waits until the entry is filled (
identified by a bit in the entry). Figure 3 shows two combiners
accessing the log to update their local replicas, which they
do in parallel.
3. 5. Recycling log entries
Each log entry has a bit that alternates when the log wraps
around to indicate empty entries. An index logMin stores the
last known safe location to write; for efficiency, this index is
updated only when a thread reaches a low mark in the log,
which is max_batch entries before logMin. The thread that
reserves the low mark entry updates logMin to the smallest
local Tail of all nodes; meanwhile, other threads wait for
logMin to change. This scheme is efficient: it incurs no synchronization and reads localTail rarely if the log is large. A
drawback is that a slow node becomes a bottleneck if no
thread on that node updates the localTail. This problem is
avoided using a larger log size.
3. 6. Read-only operations
Threads performing read-only operations (readers) do not
reserve space in the log, because their operations do not
affect the other replicas. Moreover, a reader that is updating
from the log can return and proceed with the read if it
encounters an empty entry. Unlike flat combining, NR optimizes read-only operations by executing them directly on
the local replica using a readers-writer lock for each node.
The combiner acquires the lock in write mode when it
wishes to modify the local replica, while reader threads
acquire the lock in read mode. To avoid stale reads that violate linearizability, a reader must ensure the local replica is
fresh. However, the replica need not reflect all operations up
to logTail, only to the last operation that had completed
before the reader started. To do this, we keep a
completedTail variable, which is an index ≤ log Tail that points to a log
entry after which there are no completed operations. After a
combiner refreshes its local replica, it updates
completedTail using a CAS to its last batch entry if it is smaller. A reader
reads completedTail when it starts, storing it in a local variable read Tail. If the reader sees that a combiner exists, it just
waits until localTail ≥ readTail; otherwise, the reader
acquires the readers-writer lock in writer mode and refreshes
the replica itself.
3. 7. Readers-combiner parallelism
Node Replication’s algorithm is designed for readers to execute in parallel with combiners in the same node. In early
versions of the algorithm, the combiner lock also protected
the local replica against readers, but this prevented the
desired parallelism. By separating the combiner lock and
the readers-writer lock (Section 3. 6), readers can access the
replica while a combiner is reading the slots or writing the
log, before it refreshes the replica. Furthermore, to enable
parallelism, readers must wait for completedTail as
described, not log Tail because otherwise readers block on
the hole created by the local combiner, despite the readers
lock being available.