Priming with steady state data. CAS
sees only production changes. Currently, it does not learn that a particular metric is erratic even in steady state.
Data about metric behavior outside
of production changes could be used
to define the typical noise in the data.
CAS would fail a canary only if the deviation is above this typical noise level.
The noise data could come from analyzing only the control population for
every evaluation, because the control
population is expected to have no production changes.
Same environment overfitting. CAS
autoconfiguration’s most significant
issue is overfitting data when there is
already a rich history of past observations in exactly the same environment.
In this scenario, only the historical
data of that environment is used.
This behavior has some caveats.
Consider a rollout of a new version
of a system that takes twice as long to
handle each RPC call but does a significantly better job. CAS would flag the
longer RPC handling time as anomalous behavior for each geographical
location of the rollout, causing the
release owner undue hardship. The
mitigation is to adjust the heuristics
carefully in selecting relevant environments to include data beyond the
User mistrust. CAS is useful but far
from perfect. It has experienced incidents when users disregarded a canary
failure and pushed a broken release.
User mistrust in complex automation
is at the root of many of these issues.
The CAS developers are tackling
this mistrust by explicitly explaining,
in human-friendly terms that do not require knowledge of statistics, why CAS
reaches a particular conclusion. This
includes both textual explanation and
Relative comparisons only.
Because the model server stores only
the outcomes of statistical functions
without knowing the input values,
CAS does not know the typical values
for a time series.
Not knowing the semantics of the
data implies that the tests being run
are purely relative comparisons, such
as having a t-test with null hypothesis
that the metric did not increase by
more than 5%. While relative compari-
sons are easy to reason about, they be-
have extremely poorly if the provided
time series value is typically zero, or if
a large relative change occurs in abso-
lute numbers too small to be impor-
tant to the service owner.
This is a significant limitation of
the mechanism. While it has not had
much practical impact in real-world
operation, especially given existing
trivial workarounds, it merits improve-
ment. Numerous improvements can
be made to this mechanism, some
quite simple. In addition to the fu-
ture work mentioned previously, can-
didates include standard deviation
analysis and looking at past observed
behavior of the metric.
Scale limitations on the input values.
As CAS uses only a hard-coded set of
statistical functions and their param-
eters, the system is somewhat inflex-
ible about analyzing inputs outside of
the expected input scale. For example,
if the range of 1% through 100% differ-
ence is covered, what about the sys-
tems and metrics where a difference
of 200% is normal? What if even a 1%
difference is unacceptable?
CAS developers did not anticipate
this to be a significant limitation in
practice, which thankfully proved true.
Most metrics meriting canary analysis
turn out to contain some noise; conversely, most of our A/B testing hopes
to see little difference between the two
populations, so large differences are
unexpected and therefore noticed.
Good health metrics are surprisingly
rare. The best way to use CAS is to employ a few high-quality metrics that
are clear indicators of system health:
suitable metrics are stable when
healthy, and they drastically change
Often, the best canarying strategy is
to choose metrics tied to SLOs (
service-level objectives). CAS automatically integrates with an SLO tracking system
to apply servicewide SLOs and some
heuristics to scale them appropriately
to the canary size.
Setting an SLO is a complex process
connected to business needs, and SLOs
often cover an entire service rather than
individual components. Even if a canary of a single component misbehaves
to be the biggest
the model server
to CAS as a whole,
so the development
to expand this