contributed articles
I
M
A
G
E
B
Y
M
A
R
C
E
L
C
L
E
M
E
N
S
resource management using supervised learning techniques, such as
gradient-boosted trees and neural networks, or reinforcement learning. We
also discuss why ML is often preferable
to traditional non-ML techniques.
Public cloud providers are starting
to explore ML-based resource management in production. 9, 14 For example,
Google uses neural networks to optimize fan speeds and other energy
knobs. 14 In academia, researchers have
proposed using collaborative filtering—
a common technique in recommender
systems—in scheduling containers for
reduced with in-server performance
interference. 12 Others proposed using
reinforcement learning to adjust the resources allocated to co-located VMs. 24
Later, we discuss other opportunities
for ML-based management.
Despite these prior efforts and opportunities, it is currently unclear
how best to integrate ML into cloud
resource management. In fact, prior
approaches differ in multiple dimensions. For example, in some cases,
the ML technique produces insights/
predictions about the workload or infrastructure; in others, it produces actual resource management actions. In
some cases, the ML is deeply integrated
with the resource manager; in others, it
is completely separate. In all cases, the
ML addresses a single management
problem; a different problem requires
CLOUD PLATFORMS, SUCH as Microsoft Azure, Amazon
Web Services (AWS), and Google Cloud Platform,
are tremendously complex. For example, the Azure
Compute fabric governs all the physical and virtualized
resources running in Microsoft’s datacenters. Its
main resource management systems include virtual
machine (VM) and container (hereafter we refer
to VMs and containers simply as “containers”)
scheduling, server and container health monitoring
and repairs, power and energy management, and other
management functions.
Cloud platforms are also extremely expensive to
build and operate, so providers have a strong incentive
to optimize their use. A nascent approach is to
leverage machine learning (ML) in the platforms’
Toward
ML-Centric
Cloud
Platforms
DOI: 10.1145/3364684
Exploring the opportunities to use ML,
the possible designs, and our experience
with Microsoft Azure.
BY RICARDO BIANCHINI, MARCUS FONTOURA, ELI CORTEZ,
ANAND BONDE, ALEXANDRE MUZIO, ANA-MARIA CONSTANTIN,
THOMAS MOSCIBRODA, GABRIEL MAGALHAES,
GIRISH BABLANI, AND MARK RUSSINOVICH
key insights
˽ There are many potential uses of ML in
cloud computing platforms. The challenge
is in defining exactly how and where ML
should be infused in these platforms.
˽ Leveraging ML-derived predictions
has shown promise for many resource
managers in Azure Compute. Having a
general and independent ML framework/
system has been key to increasing
adoption quickly.
˽ Many research challenges remain
open, including how to make action-prescribing ML general enough for wide
applicability in cloud platforms, how to
manage (potentially partial) feedback at
scale, and how to debug misbehaviors
(especially when the ML is tightly
integrated with resource managers).