ference enables the scheduler to select
placements that do not require migrations. 12 In the Resource Central section,
we mention our earlier results on the
benefit of predictive container scheduling. In fact, non-predictive policies (for
example, based on feedback control)
are not even acceptable in some cases.
For instance, blindly live migrating
containers that will incur long blackout
times is certain to annoy customers.
ML has been shown to produce more
accurate predictions for cloud resource
management than more traditional
methods, such as regressions or time-series analysis. For example, Cao6 and
Chen8 demonstrate that ML techniques
produce more accurate resource utilization predictions than time-series
models. Our results quantitatively compare some ML and non-ML methods.
Opportunities for ML
in Cloud Platforms
Cloud platforms involve a variety of
resource managers, such as the container scheduler and the server health
management system. Here, we discuss
some of the ways in which managers
can benefit from ML.
Container scheduler. The scheduler
selects the server on which a container will run. It can use ML to identify
(and avoid) container placements that
would lead to performance interference, or to adjust its configuration parameters (for example, how tightly to
pack containers on each server). It can
also use ML-derived predictions of the
containers’ resource utilizations to balance the disk access load, or to reduce
the likelihood of physical resource exhaustion in oversubscribed servers.
Predictions of server health are also
useful for it to stop assigning containers to servers that are likely to fail soon.
Finally, it can use predictions of container lifetime when considering servers that will undergo planned maintenance or software updates. We have
used lifetime predictions to match
batch workloads to latency-sensitive
services with enough idle capacity for
the container. 27
Server defragmenter/migration man-
ager. As containers arrive/complete,
each server may be left with available
resources that are insufficient for large
containers. As a result, the server de-
fragmentation system may decide to
another approach. We discuss these
dimensions, the possible integration
designs, and their architectural, func-
tional, and API implications.
As one point in this multi-dimensional space, we built Resource Central
(RC) 9—a general ML and prediction-serving system for providing workload
and infrastructure insights to resource
managers in the Azure Compute fabric.
RC collects telemetry from containers
and servers, learns from their prior behaviors and, when requested, produces
predictions of their future behaviors.
We are currently using RC to accurately
predict many characteristics of the
Azure Compute workload. We present
an overview RC, its initial uses and results, and describe the lessons from
building it.
Though RC has been successful so
far, it has limitations. For example, it
does not implement certain forms of interaction with resource managers. More
broadly, the integration of ML into real
cloud platforms in a general, maintainable, and at-scale manner is still in its
infancy. We close the article with some
open questions and challenges.
ML vs. Traditional Techniques
Resource management in cloud platforms is often implemented by static
policies that have two shortcomings.
First, they are tuned offline based
on relatively few benchmark workloads. For example, threshold-based
policies typically involve hand-tuned
thresholds that must be used for widely different workloads. In contrast,
ML-informed dynamic policies can
naturally adapt to actual production
workloads. 20, 26 For the same example,
each server can learn different thresholds for its own resource management.
Second, the static policies tend to
require reactive actions, and may incur
unnecessary overheads and customer
impact. As an example, consider a common policy for scheduling containers
onto servers, such as best fit. It may
cause some co-located containers to
interfere in their use of resources (for
example, shared cache space) and require (reactive) live migrations. 23 Live
migration is expensive and may cause
a period of unavailability (aka “
blackout” time). In contrast, ML techniques
enable predictive management: having
accurate predictions of container inter-
Cloud platforms
are extremely
expensive to build
and operate, so
providers have a
strong incentive to
optimize their use.