analysis accuracy—led to widespread
CAS adoption. CAS has proven useful
even for basic cases that do not need
configuration, and has significantly
improved Google’s rollout reliability.
Impact analysis shows that CAS has
likely prevented hundreds of postmor-tem-worthy outages, and the rate of
postmortems among groups that do
not use CAS is noticeably higher.
CAS is evolving as its developers
work to expand their scope and improve analysis quality.
A great number of people contributed
key components of this work. Thanks
to Alexander Malmberg, Alex Rodriguez, Brian O’Leary, Chong Su, Cody
Smith, Eduardo Blanco, Eric Waters,
Jarrod Todd, Konstantin Stepanyuk,
Mike Ulrich, Nina Gonova, Sabrina
Farmer, Sergey Kondratyev, among
others. Also, thanks to Brian O’Leary,
and Chris Jones for technical review.
Fail at Scale
The Verification of a Distributed System
Lessons from Google Chrome
Charles Reis, Adam Barth, Carlos Pizano
1. Banning, J. Monarch, Google’s planet-scale monitoring
infrastructure. Monitorama PDX 2016; https://vimeo.
2. Van Winkel, J. C. The production environment at
Google, from the viewpoint of an SRE, 2017. https://
Štěpán Davidovič is a Site Reliability Engineer at Google,
where he works on internal infrastructure for automatic
monitoring. In previous Google SRE roles, he developed
Canary Analysis Service and has worked on AdSense and
many shared infrastructure projects.
Betsy Beyer is a technical writer for Google Site Reliability
Engineering in New York, NY, USA, and the editor of Site
Reliability Engineering: How Google Runs Production
Systems. She has previously written documentation for
Google’s Data Center and Hardware Operations teams and
lectured on technical writing at Stanford University.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00.
in the extreme, its impact on a service’s
overall SLO can be small. Therefore, key
metrics need to be identified (or introduced) for each component.
It’s tempting to feed a computer
all the metrics exported by a service. While Google systems offer vast
amounts of telemetry, much of it is
useful only for debugging narrow
problems. For example, many BigTable client library metrics are not
a direct indication that a system is
healthy. In practice, using only weakly relevant metrics leads to poor results. Some teams at Google have performed analysis that justifies using a
large number of metrics, but unless
you perform similarly detailed data
analysis, using only a few key metrics
yields much better results.
Perfect is the enemy of good. Canarying is a very useful method of increasing production safety, but it is not a
panacea. It should not replace unit testing, integration testing, or monitoring.
Attempting a “perfectly accurate”
canary setup can lead to a rigid configuration, which blocks releases that have
acceptable changes in behavior. When
a system inherently does not lend itself
to a sophisticated canary, it’s tempting
to forego canarying altogether.
Attempts at hyper-accurate canary
setups often fail because the rigid configuration causes too much toil during
regular releases. While some systems
do not canary easily, they are rarely
impossible to canary, though the impact
of a having a canary process for that
system may be lower. In both cases,
switching to a strategy of gradual on-boarding of canarying, starting with
low-hanging fruit, will help.
Impact analysis is very hard. Early
on, the CAS team asked, “Is providing
a centralized automatic canarying system worth it?” and struggled to find
a answer. If CAS actually prevents an
outage, how do you know the impact of
the outage and, therefore, the impact
The team attempted to perform a
heuristic analysis of production chang-
es, but the diverse rollout procedures
made this exercise too inaccurate to
be practical. They considered an A/B
approach where failures of a subset
of evaluations were ignored, passing
them in order to measure impact. Giv-
en the many factors that influence the
magnitude of an outage, however, this
approach would not be expected to pro-
vide a clear signal. (Postmortem docu-
ments often include a section such as
“where we got lucky,” highlighting that
many elements contribute to the sever-
ity of the outage.)
Ultimately, the team settled upon
what they call near-miss analysis: looking at large postmortems at Google
and identifying outages that CAS could
have prevented, but did not prevent. If
CAS did not prevent an outage because
of missing features, those features
were identified and typically implemented. For example, if CAS could
have prevented a $10M postmortem
if it had an additional feature, implementing that feature proves a $10M
value of CAS. This problem space continues to evolve, as we attempt other
kinds of analyses. Most recently, the
team has performed analysis over a
(more homogeneous) portion of the
company to identify trends in outages and postmortems, and has found
some coarse signal.
Reusability of CAS data is limited.
CAS’s immense amount of information about system behaviors could potentially be put to other uses. Such extensions may be tempting at face value,
but are also dangerous because of the
way CAS operates (and needs to operate at the product level).
For example, the CAS team could
observe where canaries behave best
and recommend that a user select only
that geographical location. While the
recommended location may be optimal now, if a user followed the advice
to canary only in that location, the
team’s ability to provide further advice
would lessen. CAS data is limited to its
observations, so behavior at a local optimum might be quite different from
the global optimum.
Automated canarying has repeatedly
proven to improve development velocity and production safety. CAS helps
prevent outages with major monetary
impact caused by binary changes, configuration changes, and data pushes.
It is unreasonable to expect engineers working on product development or reliability to have statistical
knowledge; removing this hurdle—
even at the expense of potentially lower