practice
When a system
processes trillions
and trillions of
requests, events
that normally have
a low probability
of occurrence
are now guaranteed
to happen and
must be accounted
for upfront in
the design and
architecture
of the system.
reading from the promoted backup
will produce old, inconsistent values.
Also to support better scalable read
performance, RDBMSs have started
to provide the ability to read from the
backup, which is a classical case of
providing eventual consistency guarantees in which the inconsistency windows depend on the periodicity of the
log shipping.
On the server side we need to take
a deeper look at how updates flow
through the system to understand what
drives the different modes that the developer who uses the system can experience. Let’s establish a few definitions
before getting started:
N = The number of nodes that store
replicas of the data.
W = The number of replicas that
need to acknowledge the receipt of the
update before the update completes.
R = The number of replicas that are
contacted when a data object is accessed through a read operation.
If W+R > N, then the write set and
the read set always overlap and one
can guarantee strong consistency. In
the primary-backup RDBMS scenario,
which implements synchronous replication, N= 2, W= 2, and R= 1. No matter
from which replica the client reads, it
will always get a consistent answer. In
the asynchronous replication case with
reading from the backup enabled, N= 2,
W= 1, and R= 1. In this case R+W=N, and
consistency cannot be guaranteed.
The problems with these configurations, which are basic quorum protocols, is that when because of failures
the system cannot write to W nodes, the
write operation has to fail, marking the
unavailability of the system. With N= 3
and W= 3 and only two nodes available,
the system will have to fail the write.
In distributed storage systems that
provide high performance and high
availability the number of replicas is in
general higher than two. Systems that
focus solely on fault tolerance often
use N= 3 (with W= 2 and R= 2 configurations). Systems that must serve very
high read loads often replicate their
data beyond what is required for fault
tolerance; N can be tens or even hundreds of nodes, with R configured to
1 such that a single read will return a
result. Systems that are concerned with
consistency are set to W=N for updates,
which may decrease the probability of
the write succeeding. A common configuration for these systems that are
concerned about fault tolerance but
not consistency is to run with W= 1 to
get minimal durability of the update
and then rely on a lazy (epidemic) technique to update the other replicas.
How to configure N, W, and R depends on what the common case is and
which performance path needs to be
optimized. In R= 1 and N=W we optimize for the read case, and in W= 1 and
R=N we optimize for a very fast write.
Of course in the latter case, durability
is not guaranteed in the presence of
failures, and if W < (N+ 1)/2, there is the
possibility of conflicting writes when
the write sets do not overlap.
Weak/eventual consistency arises
when W+R <= N, meaning that there is
a possibility that the read and write set
will not overlap. If this is a deliberate
configuration and not based on a failure case, then it hardly makes sense to
set R to anything but 1. This happens
in two very common cases: the first is
the massive replication for read scaling
mentioned earlier; the second is where
data access is more complicated. In
a simple key-value model it is easy to
compare versions to determine the latest value written to the system, but in