IN 1913, SCOTTISH physiologist John Scott Haldane
proposed the idea of bringing a caged canary into a
mine to detect dangerous gases. More than 100 years
later, Haldane’s canary-in-the-coal-mine approach is
also applied in software testing.
In this article, the term canarying refers to a partial
and time-limited deployment of a change in a service,
followed by an evaluation of whether the service
change is safe. The production change process may
then roll forward, roll back, alert a human, or do
something else. Effective canarying involves many
decisions—for example, how to deploy the partial
service change or choose meaningful metrics—and
deserves a separate discussion.
Google has deployed a shared centralized service called Canary Analysis
Service (CAS) that offers automatic (and
often autoconfigured) analysis of key
metrics during a production change.
CAS is used to analyze new versions of
binaries, configuration changes, dataset changes, and other production
changes. CAS evaluates hundreds of
thousands of production changes every day at Google.
CAS requires a very strict separation between modifying and analyzing production. It is a purely passive
observer: it never changes any part of
the production system. Related tasks
such as canary setup are performed
outside of CAS.
In a typical CAS workflow (shown in
Figure 1), the rollout tool responsible
for the production change deploys
a change to a certain subset of a service. It may perform some basic health
checks of its own. For example, if pushing a new version of an HTTP server
causes a process restart, the rollout
tool might wait until the server marks
itself as able to serve before proceeding. (This may also inform the deployment speed of the production change.
This rollout tool behavior is not specific to canarying.)
This subset of production now
constitutes the canary population. By
conducting an A/B test compared to a
control population, CAS answers the
question, “Is the canary meaningfully
worse?” The control population is a
(possibly strict) subset of the remainder of the service. Importantly, CAS is
not trying to establish absolute health.
The population should be as fine-grained as possible. For example, an application update can use a global identifier of that particular process, which
at Google would be a BNS (Borg Naming Service) path. A BNS path is structured as /bns/<cluster>/<user>/<job
name>/<task number>. The job name
is a logical name of the application, and
task number is the identifier of a particular instance. 2 For a kernel update,
the identifier might be machine host-name: clearly, multiple processes can
Article development led by
Automated canarying quickens
development, improves production
safety, and helps prevent outages.
BY ŠTĚPÁN DAVIDOVIČ AND BETSY BEYER