Evaluate() call is sent. Instead, analysis starts on GetResult(), since that
indicates someone is interested in the
result. As an optimization, the analysis
actually starts on Evaluate(), but in
order to set appropriate expectations
with the client, this optimization is not
part of the API definition.
GetResult() takes one parameter:
the Evaluation ID. This RPC blocks until
the analysis process is finished, which
can take between a few seconds and a
few minutes after the end time of the
request; GetResult() is idempotent.
For the sake of reliability, CAS developers designed the system with two
calls. This setup allows the system to
resume processing a request without
requiring complex client cooperation.
This reliability strategy played out in
practice when a bug in a library made
all CAS processes crash every five to 10
minutes. CAS was still able to serve all
user requests, thanks to the robust API.
There are some obvious alternatives
to this design. CAS developers decided
against using a single long-running
RPC: since these calls are fundamentally point-to-point connections between two Unix processes, disruption
(for example, because one process restarted) would lead to a full retry from
the client side. The original design doc
included a large number of options,
each with trade-offs tied to nuanced
properties of Google’s infrastructure
Evaluation structure. While the RPC
returns only a simple PASS/FAIL verdict, the underlying analysis consists
of several components.
The lowest-level unit is a check, a
combination of time series from the
canary population, time series from
the control population, and a statistical function that turns both time series
into an unambiguous PASS/FAIL verdict. Some example checks might be:
˲ Crash rate of the canary is not significantly greater than the control.
˲ RPC error ratio is not significantly
greater than the control.
˲ Size of dataset loaded in memory
is similar between canary and control.
As mentioned in the API descrip-
tion, each evaluation request can de-
fine multiple trials (that is, pairs of
canary and control populations). Eval-
uation of each trial results in a collec-
tion of checks. If any check in any trial
fails, the entire evaluation is declared a
failure, and FAIL is returned.
Currently, trials are implemented
to be fairly independent, though a
given evaluation request might have
multiple trials if they look at two re-
lated but different components. For
example, consider an application with
a front end and a back end. Changes
on the front end can trigger bad be-
havior on the back end, so you need to
˲ The canary front end to the pro-
duction front end.
˲The back end receiving traffic
from the canary front end to a back
end receiving traffic from the produc-
tion front end.
These are different populations,
possibly with different metrics, but
failure on either side is a potential
Configuration structure. What ex-
actly does a user-defined configuration
entail? While the design phase of CAS
involved lengthy philosophical discus-
sion about the nature of configuration,
the primary aim was simplicity. The
CAS developers did not want to force
users to learn implementation details
to encode their high-level goals into a
configuration. The intent was to ask us-
ers only a few questions, as close to the
user’s view of the world as possible.
The individual checks that should
be executed for each matching trial de-
fine what information is needed. For
each check, the user specifies:
˲ What it should be called.
˲ How to get the time series for the
˲ How to turn these time series into
The user can also include optional
pieces of information, such as a long-
Monarch is the typical source of
monitoring data for time series. 1 The
user specifies an abstract query, and
the canary and control populations are
determined at runtime in the RPC that
requests evaluation. CAS has a flexible
automatic query rewrite mechanism:
at runtime, it rewrites an abstract que-
ry to specialize it to fetch data only for
a particular population. Say a user con-
figures a query, “Get CPU usage rate.”
At runtime, CAS rewrites that query as
“Get CPU usage rate for job foo-server
replicas 0, 1, 2.” This rewrite happens
enabled by a fairly
under the hood.