rently retrains models once a day.
The online part of RC is a REST
(Representational State Transfer) service within which the models execute
to produce predictions. RC’s clients
(for example, the container scheduler)
call the service passing as input the
model name and information about
the container(s) for which they want
predictions, for example, the subscription identification. The model may
require historical feature data as additional inputs, which RC fetches from
Azure Storage. As an example of feature data, the lifetime model requires
information on historical lifetimes
(for example, percentage of short-lived
and long-lived containers to date) for
the same subscription from the store.
Each prediction result is a predicted
value and a score. The score reflects
the model’s confidence in the predicted value. The client may choose to ignore a prediction when the score is too
low. It may also ignore (or not wait for)
a prediction if it thinks that RC is misbehaving (or unavailable).
RC relies heavily on caching, as clients may have stringent performance
requirements. It caches the prediction
results, model, and feature data from
the store in memory.
Current ML models. RC acts as a framework for offline training of ML models
and serving predictions from them online; RC is agnostic to the specific modeling approach data analysts select. In
our current implementation, analysts
can select models from a large repository that runs on Cosmos. The three leftmost columns of Table 1 list some of the
container behaviors we predict and the
modeling approach we currently use:
Gradient Boosting Trees (GBTs). 18 We
are also experimenting with deep neural
networks and plan to start using them in
the next version of RC.
For classifying numeric behaviors,
we divide the space of possible values
into buckets (for example, 0%–24%,
25%–49%, and so on) and then predict a
bucket. (As we will discuss, this approach
has been more accurate for our datasets
than using regression and then bucketizing the result.) When the prediction
must be converted to a number, the client can assume the highest, middle, or
lowest value for the predicted bucket.
Feature engineering. Each model
takes many features as input, which we
RC uses customer, container, and/
or server features to identify correla-
tions that managers can leverage in
their decision-making. The managers
query RC with a subset of the features,
expecting to receive predictions for the
others. For example, the scheduler may
query RC while providing the customer
name and type, deployment type and
time, and container role name. RC will
then predict how large the deployment
by this customer may become and how
high these containers’ resource utiliza-
tion may get over time.
Architecture. Design rationale. Our
design for RC follows several basic
principles related to the dimensions
we discussed previously and our ability to operate, maintain, and extend it
1.Since application-level performance data is rarely available, RC
should learn from low-level counters.
2. For generality, modularity, and
debuggability, RC should be oblivious to the management policies and,
instead, provide workload and infrastructure behavior predictions. It
should also provide an API that is general enough for many managers to use.
3. For performance and availability,
RC should be an independent system
that is off the critical performance and
availability paths of the managers that
use it whenever possible.
4. Since workload characteristics
and server behaviors change slowly,
RC can learn offline and serve predictions online. For availability, these two
components should be able to operate
independently of each other.
5. For maintainability, it should be
simple and rely on any existing well-supported infrastructures.
6. For usability, it should require
minimal modifications to the resource
Design. Figure 2 illustrates how we
designed RC based on these principles.
The offline workflow consists of data
extraction, cleanup, aggregation, feature data generation, training, validation, and ML model generation. RC
does these tasks on Cosmos, 7 a massive
data processing system that collects all
the container and server telemetry from
the fabric. RC orchestrates these phases, sanity-checks the models and feature data, and publishes them to Azure
storage, a highly available store. RC cur-
There are many
and designs for
ML in cloud