and a bit more time before sending
the third RPC.)
Failover and fallback. Pursue software rollouts and migrations that fail
safe and are automatically isolated
should a problem arise. The basic principle at work here is that by the time
you bring a human online to trigger
a failover, you have likely already exceeded your error budget.
Where concurrency/voting is not
possible, automate failover and
fallback. Again, if the issue needs a human to check what the problem is, the
chances of meeting your SLO are slim.
Asynchronicity. Design dependencies to be asynchronous rather than
synchronous where possible so that
they don’t accidentally become critical. If a service waits for an RPC response from one of its noncritical
dependencies and this dependency
has a spike in latency, the spike will
unnecessarily hurt the latency of the
parent service. By making the RPC
call to a noncritical dependency asynchronous, you can decouple the latency of the parent service from the
latency of the dependency. While
asynchronicity may complicate code
and infrastructure, this trade-off will
Capacity planning. Make sure that
every dependency is correctly provisioned. When in doubt, overprovision
if the cost is acceptable.
Configuration. When possible,
standardize configuration of your dependencies to limit inconsistencies
among subsystems and avoid one-off
Detection and troubleshooting. Make
detecting, troubleshooting, and diagnosing issues as simple as possible.
Effective monitoring is a crucial component of being able to detect issues in
a timely fashion. Diagnosing a system
with deeply nested dependencies is difficult. Always have an answer for mitigating failures that doesn’t require an
operator to investigate deeply.
Fast and reliable rollback. Introduc-
ing humans into a mitigation plan sub-
stantially increases the risk of miss-
ing a tight SLO. Build systems that are
easy, fast, and reliable to roll back. As
your system matures and you gain con-
fidence in your monitoring to detect
problems, you can lower MTTR by en-
gineering the system to automatically
trigger safe rollbacks.
Systematically examine all possible
failure modes. Examine each component and dependency and identify the
impact of its failure. Ask yourself the
˲ Can the service continue serving in
degraded mode if one of its dependencies fails? In other words, design for
˲ How do you deal with unavailability of a dependency in different scenarios? Upon startup of the service? During
Conduct thorough testing. Design
and implement a robust testing environment that ensures each dependency has its own test coverage, with tests
that specifically address use cases that
other parts of the environment expect.
Here are a few recommended strategies for such testing:
˲ Use integration testing to perform
fault injection—verify that your system
can survive failure of any of its dependencies.
˲ Conduct disaster testing to identify weaknesses or hidden/unexpected
dependencies. Document follow-up
actions to rectify the flaws you uncover.
˲ Don’t just load test. Deliberately
overload your system to see how it
degrades. One way or another, your
system’s response to overload will be
tested; better to perform these tests
yourself than to leave load testing to
Plan for the future. Expect changes
that come with scale: a service that begins as a relatively simple binary on a
single machine may grow to have many
obvious and nonobvious dependencies when deployed at a larger scale.
Every order of magnitude in scale will
reveal new bottlenecks—not just for
your service, but for your dependencies
as well. Consider what happens if your
dependencies cannot scale as fast as
you need them to.
Also be aware that system dependencies evolve over time and that your
list of dependencies may very well
grow over time. When it comes to infrastructure, Google’s typical design
guideline is to build a system that will
scale to 10 times the initial target load
without significant design changes.
While readers are likely familiar with
some or many of the concepts this ar-
ticle has covered, assembling this in-
formation and putting it into concrete
terms may make the concepts easier to
understand and teach. Its recommen-
dations are uncomfortable but not
unattainable. A number of Google ser-
vices have consistently delivered better
than four 9s of availability, not by su-
perhuman effort or intelligence, but by
thorough application of principles and
best practices collected and refined
over the years (see SRE’s Appendix B: A
Collection of Best Practices for Produc-
Thank you to Ben Lutch, Dave Rensin,
Miki Habryn, Randall Bosetti, and Patrick Bernier for their input.
There’s Just No Getting Around It:
You’re Building a Distributed System
Eventual Consistency Today:
Limitations, Extensions, and Beyond
Peter Bailis and Ali Ghodsi
A Conversation with Wayne Rosing
David J. Brown
1. Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site
Reliability Engineering: How Google Runs Production
Systems. O’Reilly Media, 2016; https://landing.google.
Ben Treynor started programming at age six and
joined Oracle as a software engineer at age 17. He has
also worked in engineering management at E.piphany,
SEVEN, and Google (2003-present). His current team
of approximately 4,200 at Google is responsible for Site
Reliability Engineering, networking, and datacenters
Mike Dahlin is a distinguished engineer at Google, where
he has worked on Google’s Cloud Platform since 2013.
Prior to joining Google, he was a professor of computer
science at the University of Texas at Austin.
Vivek Rau is an SRE manager at Google and a founding
member of the Launch Coordination Engineering sub-team
of SRE. Prior to joining Google, he worked at Citicorp
Software, Versant, and E.piphany. He currently manages
various SRE teams tasked with tracking and improving the
reliability of Google’s Cloud Platform.
Betsy Beyer is a technical writer for Google, specializing
in Site Reliability Engineering. She has previously written
documentation for Google’s Data Center and Hardware
Operations Teams. She was formerly a lecturer on
technical writing at Stanford University.
Copyright held by owner/authors.
Publication rights licensed to ACM. $15.00.