ance, but also alert the team when the
system is nearing that state. Ideally, if
it takes T minutes to add capacity, the
system must send the alert at least T
minutes before that capacity is needed.
Cloud-computing systems such as
Amazon Web Services (AWS) have systems that can add more capacity on demand. If you run your own hardware,
provisioning new capacity may take
weeks or months. If adding capacity always requires a visit to the CFO to sign
a purchase order, you are not living
in the dynamic, fast-paced, high-tech
world you think you are.
Anyone can use a load balancer. Using it properly is much more difficult.
Some questions to ask:
1. Is this load balancer used to increase capacity (N+0) or to improve resiliency (N+ 1)?
2. How do you measure the capacity of each replica? How do you create
benchmark input? How do you process
the benchmark results to arrive at the
threshold between good and bad?
3. Are you monitoring whether you
are compliant with your N+M configuration? Are you alerting in a way that
provides enough time to add capacity
so that you stay compliant?
If the answer to any of these questions is “I don’t know” or “No,” then
you’re doing it wrong.
The Tail at Scale
Jeffrey Dean, Luiz André Barroso
Communications of the ACM 56, 2
(Feb. 2013), 74–80.
Learning to Embrace Failure
A discussion with Jesse Robbins, Kripa
Krishnan, John Allspaw, and Tom Limoncelli
The Pathologies of Big Data
Thomas A. Limoncelli is a site reliability engineer
at Stack Overflow Inc. in New York City. His books
include The Practice of Cloud Administration (http://the-
cloud-book.com), The Practice of System and Network
Administration ( http://the-sysadmin-book.com), and Time
Management for System Administrators. He blogs at
EverythingSysadmin.com and tweets at @Yes That Tom.
Copyright held by author.
Publication rights license to ACM. $15.00
in sufficiently fast response times determines the capacity of the replica.
How do you quantify response time
when measuring thousands or millions of queries? Not all queries run
in the same amount of time. You can’t
take the average, as a single long-run-ning request could result in a misleading statistic. Averages also obscure
bimodal distributions. (For more on
this, see chapter 17, Monitoring Architecture and Practice, of The Practice of
Cloud System Administration, Volume 2,
by T. Limoncelli, S.R. Chalup, and C.J.
Hogan; Addison-Wesley, 2015).
Since a simple average is insufficient, most sites use a percentile.
For example, the requirement might
be that the 90th percentile response
time must be 200ms or better. This
is a very easy way to toss out the most
extreme outliers. Many sites are starting to use MAD (median absolute deviation), which is explained in a 2015 paper
by David Goldberg and Yinan Shan, “The
Importance of Features for Statistical
Anomaly Detection” ( https://www.usenix.
Generating synthetic queries to use
in such benchmarks is another challenge. Not all queries take the same
amount of time. There are short and
long requests. A replica that can handle 100QPS might actually handle 80
long queries and 120 short queries.
The benchmark must use a mix that
reflects the real world.
If all queries are read-only or do not
mutate the system, you can simply record an hour’s worth of actual queries
and replay them during the benchmark. At a previous employer, we had
a dataset of 11 billion search queries
used for benchmarking our service. We
would send the first 1 billion queries to
the system to warm up the cache. We
recorded measurements during the remaining queries to gauge performance.
Not all workloads are read-only. If a
mixture of read and write queries is required, the benchmark dataset and process is much more complex. It is important that the mixture of read and write
queries reflects real-world scenarios.
Sadly, the mix of query types can
change over time as a result of the intro-
duction of new features or unanticipated
changes in user-access patterns. A sys-
tem that was capable of 200QPS today
may be rated at 50QPS tomorrow when
an old feature gains new popularity.
Software performance can change
with every release. Each release should
be benchmarked to verify that capacity
assumptions have not changed.
If this benchmarking is done manually, there’s a good chance it will be
done only on major releases or rarely.
If the benchmarking is automated,
then it can be integrated into your
continuous integration (CI) system. It
should fail any release that is significantly slower than the release running
in production. Such automation not
only improves engineering productivity because it eliminates the manual
task, but also boosts engineering productivity because you immediately
know the exact change that caused
the regression. If the benchmarks are
done occasionally, then finding a performance regression involves hours
or days of searching for which change
caused the problem.
Ideally, the benchmarks are validated by also measuring live performance in production. The two statistics should match up. If they don’t, you
must true-up the benchmarks.
Another reason why benchmarks are
so complicated is caches. Caches have
unexpected side effects. For example,
intuitively you would expect that a system should get faster as replicas are
added. Many hands make light work.
Some applications get slower with
more replicas, however, because cache
utilization goes down. If a replica has
a local cache, it is more likely to have a
cache hit if the replica is highly utilized.
Level 3: Definition But No Monitoring
Another mistake a team is likely to make
is to have all these definitions agreed
upon, but no monitoring to detect whether or not you are in compliance.
Suppose the team has determined
that the load balancer is for improving
both capacity and resilience, they have
defined an algorithm for measuring
the capacity of a replica, and they have
done the benchmarks to ascertain the
capacity of each replica.
The next step is to monitor the system to determine whether the system
is N+ 1 or whatever the desired state is.
The system should not only monitor
the utilization and alert the operations
team when the system is out of compli-