whereas actions are manager-specific.
Integration vs. separation. Related
to the dimension here is the question
of whether the ML should be fully integrated or completely separate from
managers. When the ML outputs actions, the fully integrated design is a
natural one, as in Figure 1 (bottom).
For insights, both integration and
separation are viable options. However, for generality and maintainability, cleanly separating the ML and the
managers via well-defined APIs is beneficial: multiple managers can use the
same ML implementation, which the
platform can maintain independently
of the managers.
Immediate vs. delayed feedback. A
final dimension is whether the ML is
able to observe the result of its previous
outputs or the manager actions within
a short time. Designs that produce ML
models offline will likely observe these
effects only at a coarse time granularity
(for example, daily). Such granularity is
a good match when the input feature
characteristics also change slowly. How-
ever, techniques such as reinforcement
learning and bandit learning often
benefit from actions being observable
much sooner. For such techniques, of-
fline model-learning may not be ideal.
RC is one point in this multidimensional space. We built it as a general ML
and prediction-serving system into the
Azure Compute fabric. RC9 learns from
low-level counters from all containers
and servers, produces various behavioral models offline, and provides predictions online to multiple managers via a
simple REST API.
RC leverages a wealth of historical
data to produce accurate predictions.
For example, from the perspective of
each Azure subscription, many containers exhibit peak CPU utilizations
in consistent ranges across executions;
containers that execute user-facing
workloads consistently do so across
executions; tenant deployment sizes
are unlikely to vary widely across executions, and so on. 9 In all these cases,
prior behavior is a good predictor for
workload and infrastructure behaviors,
whereas leaving this responsibility to
managers may produce policies that
are unnecessarily general. Producing
actions may also be the only alternative when it is impractical to collect labeled training data (for example, when
fast management decisions must be
made at the servers themselves, based
on fine-grained performance data). On
the other hand, leveraging ML for insights simplifies the managers, making
them easier to understand and debug.
In fact, relying on insights is less likely
to cause negative feedback loops that
could potentially degrade customer
experience. Insights may also inform
multiple managers (for example, container scheduler and power manager),
Figure 2. RC architecture comprising offline and online components.
Power Manager Cache
Table 2. Behaviors and their buckets.
Behavior Bucket 1 Bucket 2 Bucket 3 Bucket 4
Avg CPU utilization 0–25% 25%–50% 50%–75% 75%–100%
Deployment size 1 > 1 and ≤ 10 > 10 and ≤100 >100
Lifetime ≤ 15 mins > 15 and ≤ 60 mins > 1 and ≤ 24 hs > 24 hs
Blackout time ≤0.1 s >0.1 and ≤ 1 s > 1 and ≤ 3 s > 3 s
Table 1. Behavior, ML modeling approaches, model and full feature dataset sizes.
Behavior Approach #features Model size Feature data size
Avg CPU utilization Gradient Boosting Tree 247 414KB 416MB
Deployment size Gradient Boosting Tree 41 351KB 296MB
Lifetime Gradient Boosting Tree 247 438KB 416MB
Blackout time Gradient Boosting Tree 998 290KB 4.5MB