Configuration server. The configuration server looks up and fully expands a configuration that matches an
When the configuration is explicitly referenced in a request, lookup
is trivial. If the configuration is not
explicitly referenced, a set of automatic lookup rules search for the
user’s default config. These lookup
rules are based on features such as
who owns the canaried service.
The CAS-submitted configuration
is generic: it might say something like
“Fetch HTTP error rate,” without specifying where to fetch the error rate. In
the typical flow, the rollout tool identifies the current canary and passes this
information along to CAS when the
evaluation is requested. As a result, the
configuration author cannot necessarily predict the canary population.
To support this flexibility, the configuration server expands configuration and canary/control population
definitions to specify exactly what
data is requested. For example, the
user’s “Fetch HTTP error rate” becomes “Fetch HTTP error rate from
these three processes for canary data,
and from these ten processes for
control data.” From a user’s point of
view, after configuring the generic
variant, the “right thing” happens
automatically, removing any need to
define a dedicated canary setup before canarying (although users can
define such a setup if they have other
reasons to do so).
Besides evaluations, the configuration server also receives configuration
updates, validates updates for correctness and ACLs (access control lists), and
stores these updates in the database.
Evaluator. The evaluator receives a
fully defined configuration (after the
expansion already mentioned) for each
check, with each check in a separate
RPC. The evaluator then:
1. Fetches time series for both canary and control data from the appropriate time series store.
2. Runs statistical tests to turn the
resulting pair of sets of time series
into a single PASS/FAIL verdict for each
statistical test (pair because of canary/
control; sets because it is possible, for
example, to have a time series per running process and have many processes
in the canary or control groups).
If a user configures a statistical
test, then the evaluator runs only that
test. If the user opts for autoconfiguration, however, the evaluator may run
dozens of tests with various parameters, which generate data that feeds
into the model server.
The evaluator returns the data from
tests and any potential metadata (such
as errors talking to time-series stores)
to the coordinator.
Model server. The model server performs automatic data analysis. After
evaluation, the coordinator asks the
model server for predictions. The request contains information about the
evaluation and all observed verdicts
from the evaluator.
For each observed verdict, the model server returns its expected verdict for
that particular evaluation. It returns
this information to the coordinator,
which ignores results of statistical
functions for predicted failures when
deciding the overall verdict. If the model server predicts failure because said
failure is typical behavior, this behavior is deemed a property of the evaluated system and not a failure of this
particular canary evaluation.
Canarying properly is a complex process, as the user needs to accomplish
these nuanced tasks:
˲Correctly identify a meaningful
canary deployment that creates a representative canary population with respect to the evaluation metrics.
˲Choose appropriate evaluation
˲ Decide how to evaluate canaries as
passing or failing.
CAS eases the burden by removing
the most daunting of these tasks: evaluating what it means for a time-series
pair to pass or fail. CAS builds upon
the underlying argument that running
reliable systems should not require in-depth knowledge of statistics or constant tuning of statistical functions’
CAS uses behavior learning that
is slightly different from the general
problem of anomaly detection for
monitoring. In the CAS scenario, you
already know that a service is being
changed, and exactly where and when
that change takes place; there is also a
running control population to use as a
baseline for analysis. Whereas anom-
aly detection for monitoring triggers
user alerts (possibly at 4 A.M.), bad
CAS-related rollouts are far less intru-
sive—typically resulting in a pause or
Users can opt out of autoconfiguration by specifying a test and its parameters manually.
Online behavior learning. In the
simplest terms, we want to determine
the typical behavior of the system being evaluated during similar production changes. The high-level assumption is that bad behavior is rare.
This process takes place online,
since it must be possible to adapt
quickly: if a behavior is anomalous but
desirable, CAS fails the rollout; when
the push is retried, CAS needs to adapt.
Adaptive behavior poses a risk if a
user keeps retrying a push when an
anomaly is actually dangerous: CAS
eventually starts treating this risky behavior as the new norm and no longer
flags it as problematic. This risk becomes less severe as the automation
becomes more mature and reliable, as
users are less inclined to blindly retry
(assuming an incorrect evaluation) and
more inclined to actually debug when
CAS reports a failure.
Offline supporting processes can
supplement the standard online learning.
Breakdown of observations.
Intuitively, you know that comparing
the same metrics across different
binaries may yield different results.
Even if you look at the same metric
(RPC latency, for example), a stateful
service such as Big Table may behave
quite differently from a stateless Web
search back end. Depending on the
binary being evaluated, you may want
to choose different parameters from
the statistical tests, or even different
statistical tests altogether.
Rather than attempting to perform in-depth discovery of potential
functional dependencies, CAS breaks
down observations across dimensions based upon past experiences
with running production systems.
You may well discover other relevant
dimensions over time.
Currently, the system groups observations by the following factors:
˲ Data source. Are you observing
process crash rate, RPC latency, or
something else? Each data source is