is N+0 if there is no spare capacity. A
system can also be designed to be N+ 2
redundant, which would permit the
system to survive two dead replicas,
and so on.
Three Ways to Do It Wrong
Now that we understand two different
ways a load balancer can be used, let’s
examine how most teams fail.
Level 1: The Team Disagrees
Ask members of the team whether the
load balancer is being used to add capacity or improve resiliency. If different people on the team give different
answers, you’re load balancing wrong.
If the team disagrees, then different members of the team will be making different engineering decisions. At
best, this leads to confusion. At worst,
it leads to suffering.
You would be surprised at how
many teams are at this level.
Level 2: Capacity Undefined
Another likely mistake is not agree-ing how to measure the capacity of the
system. Without this definition, you do
not know if this system is N+0 or N+ 1. In
other words, you might have agreement
that the load balancing is for capacity or resilience, but you do not know
whether or not you are using it that way.
To know for sure, you have to know
the actual capacity of each replica. In
an ideal world, you would know how
many QPS each replica can handle.
The math to calculate the N+ 1 threshold (or high-water mark) would be simple arithmetic. Sadly, the world is not
You can’t simply look at the source
code and know how much time and resources each request will require and
determine the capacity of a replica.
Even if you did know the theoretical
capacity of a replica, you would need to
verify it experimentally. We are scientists, not barbarians!
Capacity is best determined by
benchmarks. Queries are generated
and sent to the system at different rates,
with the response times measured. Sup-
pose you consider a 200ms response
time to be sufficient. You can start by
generating queries at 50 per second and
slowly increase the rate until the system
is overloaded and responds slower than
200ms. The last QPS rate that resulted
Individual machines fail, but the sys-
tem should continue to provide ser-
vice. All machines eventually fail—
that’s physics. Even if a replica had
near-perfect uptime, you would still
need resiliency mechanisms because
of other externalities such as software
upgrades or the need to physically
move a machine.
A load balancer can be used to
achieve resiliency by leaving enough
spare capacity that a single replica can
fail and the remaining replicas can
handle the incoming requests.
Continuing the example, suppose
four replicas have been deployed to
achieve 400QPS of capacity. If you are
currently receiving 300QPS, each replica will receive approximately 75QPS
(one-quarter of the workload). What
will happen if a single replica fails?
The load balancer will quickly see the
outage and shift traffic such that each
replica receives about 100QPS. That
means each replica is running at maximum capacity. That’s cutting it close,
but it is acceptable.
What if the system had been receiving 400QPS? Under normal operation,
each of the four replicas would receive
approximately 100QPS. If a single
replica died, however, the remaining
replicas would receive approximately
133QPS each. Since each replica can
process about 100QPS, this means
each one of them is overloaded by a
third. The system might slow to a crawl
and become unusable. It might crash.
The determining factor in how the
load balancer was used is whether or
not the arriving workload was above or
below 300QPS. If 300 or fewer QPS were
arriving, this would be a load balancer
used for resiliency. If 301 or more QPS
were arriving, this would be a load balancer for increased capacity.
The difference between using a
load balancer to increase capacity or
improve resiliency is an operational
difference, not a configuration difference. Both use cases configure
the hardware and network (or virtual
hardware and virtual network) the
same, and configure the load balancer
with the same settings.
The term N+ 1 redundancy refers to a
system that is configured such that if a
single replica dies, enough capacity is
left over in the remaining N replicas for
the system to work properly. A system
ways to use