until the evaluation is complete. If the
coordinator is not tracking the evaluation (for example, if a restart resulted
in lost state) or if no coordinator is assigned, the RPC front end chooses a
coordinator, stores that information in
the database, and calls AwaitEvaluation(). These retries are limited.
If the evaluation has already finished, the RPC front end does not contact the coordinator and immediately
returns the results from the database
to the caller.
It is very cheap for the RPC front
end to handle parallel GetResult()s.
Selecting one coordinator avoids duplication of expensive work unless the
client requests two duplicate and independent evaluations.
Coordinator. The coordinator keeps
all evaluations it is currently processing
in memory. Upon AwaitEvaluation(),
the coordinator checks whether the
evaluation is being processed. If so, the
coordinator simply adds this RPC to the
set of RPCs awaiting the result.
If the evaluation is not being processed, the coordinator transactionally
takes ownership of the evaluation in
the database. This transaction can fail
if another coordinator (for whatever
reason, such as a race condition) independently takes ownership, in which
case the coordinator pushes back to
the RPC front end, which then contacts
the new canonical coordinator.
Upon receiving a new evaluation,
the coordinator does the following:
1. Retrieves fully qualified and unambiguous expanded configuration
from the config server. The coordinator
now has the full set of all checks to run.
2. Fans out each check to evaluators.
3. Calls the model server to obtain
predicted behavior for checks, simultaneously reporting the results of the
checks in the current evaluation.
4. Responds to all waiting AwaitEvaluation() RPCs with the final
The coordinator checkpoints progress to the database throughout.
Checkpoints occur after a coordinator receives a fully qualified configuration and asynchronously as evaluators return check-evaluation requests.
If the coordinator dies, a new one
takes over, reads progress from the
database, and continues from the last
for both the canary population and the
control, resulting in two queries.
It is possible, although uncommon, to specify different queries
for the canary and the control. The
queries are still subject to rewriting,
which guarantees that they will fetch
data only for the objects that are actually being evaluated.
To simplify configuration, there are
also common queries. These are canned
queries curated by the CAS team, such
as crash rate, RPC server error ratio,
and CPU utilization. These offer known
semantics, for which CAS can provide
better quality analysis.
Finally, there needs to be a way to
turn the time series (possibly multiple streams) obtained by running the
Monarch query for canary and control
populations into an unambiguous verdict. The user can choose from a family
of tests. Some tests (such as Student’s
t-test) have a clear statistical origin,
while others contain custom heuristics
that attempt to mimic how a human
would evaluate two graphs.
As we will discuss, automatic analyses are applied if a user chooses the default configuration, as well as on user-supplied queries if the user does not
specify a statistical test.
and Request Flow
Figure 2 illustrates the components of
the CAS system. This section describes
the role of each component and the
CAS request flow.
Spanner database. The Spanner
database is a shared synchronization
point for the evaluation flow; almost all
components write to it. It is the canonical storage for evaluation progress and
RPC front end. The rollout tool
sends Evaluate() calls to the RPC
front end, which is intentionally very
simple. The front end generates a
unique identifier for the evaluation,
stores the entire evaluation request
in the database (with the unique
identifier as primary key), and returns the identifier.
GetResult() calls also land on the
RPC front end, which queries the database to see if a coordinator is already
working on the evaluation. If so, the RPC
front end sends an AwaitEvaluation()
RPC to the coordinator, which blocks
after a coordinator