their codes with monitoring calls into
the platform (for example, using AWS’s
CloudWatch3 or Azure’s Monitor21).
When they do not, lower-level counters
(for example, resource utilization, CPU
performance counters) must be used
as an imperfect proxy for application
performance. Given that most workloads are not instrumented, we expect
that ML techniques will most often use
counters. Nevertheless, providers also
run first-party workloads, which can
potentially be instrumented.
Predictions vs. actions. Another dimension concerns the role of the ML
techniques. One approach is for them
to produce insights (for example, performance, load, container lifetime predictions) that managers can leverage to
improve decisions as in Figure 1 (top).
This approach gives managers sole
control and understanding of the man-
agement policies. Another approach is
for the ML to produce actual manage-
ment actions (for example, migrate this
container, change this resource alloca-
tion) to be taken by managers. In this
case, the ML embodies a deeper under-
standing of the policies (or may itself
define the policies). Targeting the ML
at producing actions may lead to poli-
cies that more easily adapt to the actual
live-migrate active containers onto a
subset of the servers. A migration man-
ager can also live-migrate (for example,
low-priority) containers to alleviate any
unexpected server resource contention
or interference. It can use application-
level performance information (when
it is available) or ML techniques on
lower-level performance counters to
identify these behaviors. The manager
can use predictions of the containers’
expected lifetimes and blackout times
to live-migrate only those that will
likely remain active for a substantial
amount of time and not incur a notice-
able blackout time if migrated.
Power capping manager. This manager ensures the capacity of the (
oversubscribed) power delivery system is
not exceeded, using CPU speed scaling. To tackle a power emergency (the
power draw is about to exceed a circuit
breaker limit), this manager can use
predictions of the performance impact
of speed scaling on different workloads
to guide its apportioning of the available power budget. Similarly, it can use
predictions of workload interactivity as
a guide. Ideally, containers executing
interactive or highly sensitive workloads should receive all the power they
want, to the detriment of containers
running batch and background tasks.
In this context, the container scheduler
can use predictions of interactivity to
smartly schedule interactive and delay-insensitive workloads across servers.
Server health manager. This manager
monitors hardware health and takes
faulty servers out of rotation for maintenance. When a server starts to misbehave, this manager can use predictions
of the lifetime of the containers running on the server. Using these predictions, it can determine when maintenance can be scheduled, and whether
containers need to be live-migrated to
prevent unavailability.
This is only a partial list of opportunities for ML-based resource management. The challenge is determining
the best system designs for exploiting
these opportunities.
Potential Designs
for ML-Centric Clouds
When deciding how to exploit ML
in cloud resource management, we
must consider: The ML techniques
and their inputs and outputs, and
the managers and their mechanisms
(management actions) and policies.
We must also consider many ques-
tions: Can we use application-level
performance data for learning? How
should the ML and the managers in-
teract? Should the ML produce behav-
ioral insights/predictions or actual
management actions? How tightly
integrated with the managers should
the ML be? How quickly does the ML
need to observe the effect of the man-
agement actions? Is it possible to
create general frameworks/APIs that
can apply to many types of resource
management? Next, we discuss our
thoughts along these dimensions.
Application performance vs. counters.
Managers must optimize resource usage without noticeably hurting end-to-end application performance. Thus,
having direct data on application performance enables precise management with or without ML. Some application metrics are easier to obtain than
others. For example, VM lifetimes are
“visible” to the platform, whereas request latencies within VMs implementing a service often are not. When containers are opaque to the platform, the
way to obtain application performance
data is for developers to instrument
Figure 1. Two designs.
Prediction
replies
Prediction
requestsCounters/
App info
Actions
Counters/
App info
Actions
Managed system
ML outputs predictions
Integrated ML and RM (resource manager)
Managed system
ML RM
ML + RM