relatively few unexpected risks. It still
greatly improves automated canarying. The developers are actively working on improvements.
Time series aggregate models. While
the meta-analysis of the results of hard-coded statistical functions has worked
well for the initial launch of automatic
configuration, this approach is crude
and inflexible. Rather than storing results of statistical tests without any
knowledge about the time series that
caused them, CAS could store data
about the time series.
Each statistical function that CAS
supports requires different data from
the time series. We could attempt to extract constant-size aggregate views on
this data, one for each statistical test.
For example, a student’s t-test view on
the time series could be the mean value
for both populations, the population
sizes, and variance estimation.
This aggregated view from many
past observations would allow synthesizing a single test for each statistical function, with the correct parameters chosen based on past data
and some policy.
This work would essentially replace half of the current autoconfiguration system.
Observation breakdowns turned out
to be the biggest contribution of the
model server to CAS as a whole, so the
development team plans to expand this
feature. Adding more breakdowns entails additional computational/storage
costs and, therefore, needs to be under-taken carefully given CAS’s large scale.
While CAS currently has breakdowns based on the object of evaluation, this could be expanded to
breakdowns by type of canarying.
Anecdotally, there have been major
differences in canary behavior when
observed using before/after tests
versus simultaneous tests of two
populations. The size of the canary
population in relation to the control
population and the absolute sizes
of the populations can also provide
Future work could determine if
these additional breakdowns are
worthwhile, and at what granularity to
perform them. Automatically generated decision trees may also be an option.
assigned a unique identifier by fingerprinting the configuration and some
minor heuristics to remove common
sources of unimportant differences.
˲ Statistical function and parameters.
This could mean, for example, a t-test
with significance level of 0.05. Each
distinct statistical function and parameter set is assigned a unique identifier.
˲ Application binary.
˲ Geographical location. This refers
to locations of the canary and control.
˲ Process age. Has the process recently
restarted? This helps distinguish
a configuration push (which might
not restart the process) from a binary
update (which likely would).
˲ Additional breakdowns, such as different RPC methods. For example, reading a row in BigTable may behave very
differently from deleting the entire
table. This breakdown depends on the
˲ Time of observation. This is kept at
daily granularity for system efficiency.
These factors combine with the
count of each observed verdict to make
a model. A model knows only
identifiers—it has no understanding of the
data source, statistical functions, or
Prediction selection. All models
pertaining to a particular binary are
fetched across all statistical functions
for which there is an observation, and
across all data sources.
For each statistical function and
each data source, the weighted sum
of the previously observed behaviors
is calculated for each possible result.
Similarity is weighted both by heuristic similarity of features (process
age and geographical location) and
by the age of the model. Because additional breakdowns such as RPC
methods do not have a usable similarity metric, the additional matching breakdowns are simply filtered
in, with no further weighting.
For a single statistical function and
a single data source, we generate a
score for each possible verdict (PASS,
FAIL, or NONE). We calculate this
score from a weighted sum of past observations. Weighting is based upon
factors like age of the observation
and similarity of the observation to
the current situation (for example, do
both observations pertain to the same
Each statistical function has a minimum pass ratio. The ratio sum[PASS] /
(sum[PASS] + sum[FAIL] + sum[NONE])
must be greater than the minimum for
a PASS prediction. Otherwise, the prediction is FAIL.
This ratio allows CAS to impose a
notion of strictness on various functions, while being tolerant of “
normal” volatile behavior. For example,
consider two statistical functions:
one that tolerates only 1% deviation
between canary and control, and one
that tolerates 10%. The former can
be given a very high minimum pass
ratio, and the latter a lower one. If
the metric fluctuates more than 1%
in normal operation, CAS quickly
learns that behavior and stops flagging it. If that fluctuation is a one-off, CAS flags it, the system recovers, and over time CAS relearns that
normal behavior includes only deviations under 1%. CAS intentionally
takes longer to learn normal behavior for larger tolerated fluctuations,
so in this example, CAS will learn at a
slower rate for the 10% case.
Bootstrapping. When a user initially
submits a configuration that evaluates
a metric, no past behavior exists to
use for prediction. To bootstrap such
cases, CAS looks for past evaluations
that could have used this config and
runs those evaluations to collect observations for the model server. With
enough recent evaluations, CAS will
already have useful data the first time a
user requests an evaluation.
If such bootstrapping is not possible, the model server reverts to the
most generous behavior possible.
Arbitrary input analysis. The behav-ior-prediction mechanism is also the
first attempt at arbitrary input analysis,
which allows modeling behavior for
tests when there is no prior knowledge
of what they are about.
When a user configures canarying
on RPC error ratio, CAS knows in advance that the values are between 0.0
and 1.0, and that higher is worse. For a
user-supplied query against the monitoring data, CAS has no such knowledge and can only apply a battery of
tests and observe the differences.
Despite some significant issues, as
we will discuss, the CAS development
team chose this approach because
they were confident that it would have