works predict resource demand, resource utilization, or job/task length
for provisioning or scheduling purposes. 5, 6, 8, 15, 17, 19, 25 For example, Cao6
recently explored Random Forests to
predict CPU, memory, and disk utilizations, whereas Chen8 used Residual
Neural Networks for predicting VM
CPU utilization. We predict a broader
set of behaviors (including container
lifetimes, maximum deployment sizes,
and blackout times) for a broader set
of purposes (including health management and power capping). Still, we do
not argue that the models we use are
necessarily the best. Instead, we show
them simply as examples of ML models that we have integrated into the RC
framework and work well in practice.
Prediction accuracy. A key requirement for RC is the ability to predict behaviors accurately. Obviously, this accuracy depends on the behavior one is
trying to predict and on the modeling
approach they use. As such, the best we
can do is provide evidence from our experience with RC that many behaviors
can be predicted accurately.
For our analysis, we use one month
of data about all VMs in Azure. In this
dataset, less than 1% of the VMs are
from “new” subscriptions, that is, subscriptions that appear for the first time
in the set. We trained RC’s models with
the first three weeks and tested them
on the fourth. We provide a similar dataset at https://github.com/Azure/Azur-ePublicDataset.
We divide the space of predictions
for each behavior into the buckets listed
in Table 2. Given these buckets, Figure
3 summarizes the RC prediction results
for each VM-utilization bucket (left)
and the most important predictive attributes (right). Figure 4 shows the overall
accuracy, prediction, and recall results
(three rightmost bars in each group,
respectively) for the VM behaviors in
the tables. We measure accuracy as the
percentage of predictions that were correct, assuming the predicted bucket is
that with the highest confidence score;
precision for a bucket as the percentage of true positives in the set of predictions that named the bucket; and recall
for a bucket as the percentage of true
positives in the set of predictions that
should have named the bucket.
Figure 3 (left) shows recall between
70% and 95% across VM-utilization
buckets. When we average bucket
frequencies and recalls together, we
find that overall VM-utilization recall
is 89%. Figure 3 (right) shows that the
most important attributes in terms of
F1-score are the percentage of VMs of
the same subscription that fell in each
bucket to date. As we discussed in the
RC paper, 9 subscriptions show low (< 1)
coefficient of variation (CoV = standard
deviation divided by average) for the
behaviors we study. Thus, it is unsurprising that prior observations of the
behavior are good indicators. Still, our
results show that other attributes are
also important: service type (the name
of a top first-party subscription or “
unknown” for the others), VM type (for
example, A1, A2), number of cores, VM
class (IaaS vs PaaS), operating system,
and deployment time; their relative importance depends on the metric. VM
role names have little predictive value,
for example, IaaS VMs often have arbitrary role names that do not repeat.
Figure 4 illustrates the high accu-
extract from the attributes available in
our dataset. We split the attributes into
three groups: categorical, boolean, and
numerical. We model categorical attri-
butes (for example, container type, guest
operating system) as categorical fea-
tures. We represent the features in a vec-
tor of pre-defined length. We concate-
nate the boolean attribute (for example,
first deployment, production workload)
values to the input feature vector. Simi-
larly, we normalize and concatenate the
numerical attribute (for example, num-
ber of cores, container memory size)
values to the vector. Finally, we place
the attributes that describe observed
container/subscription behavior (for ex-
ample, last observed container lifetime)
into the buckets of Table 2 and use them
as numerical features. We concatenate
these features to the vector as well.
Comparison to other systems and
techniques. As it focuses on producing predictions, RC fundamentally
differs from action-prescribing systems, for example, Agarwal et al., 2 and
Moritz et al. 22 RC currently produces
its predictions using TLC, a Microsoft-internal state-of-the-art framework
that implements many learning algorithms. However, RC can also leverage
recently proposed frameworks, such
as TensorFlow, 1 for producing its ML
models. RC’s online component is
comparable to recent prediction-serving systems, 10, 11, 16 though with a different architecture and geared toward
cloud resource management. We are
not aware of any ML and prediction-serving frameworks/systems like RC in
other real cloud platforms.
The literature on predicting workload behaviors is extensive. These
Figure 3. Average CPU utilization recall per bucket (left) and attribute importance (right).
1.0
0.8
0.6
0.4
0.2
0.0
0–25% 25–50%
Prediction Buckets
50–75% 75–100%
0
Subscription – Life to Date – VMs Bucket 2
Subscription – Life to Date – VMs Bucket 4
Subscription – Life to Date – VMs Bucket 1
Subscription – Life to Date – VMs Bucket 3
VM Type
Service Type
Number of Cores
VM Class
Operating System
Subscription – Last Deployment – VMs Bucket 4
Subscription – Last Deployment – VMs Bucket 1
Subscription – Last Deployment – VMs Bucket 3
Subscription – Last Deployment – VMs Bucket 2
VM Memory
50 100 150 200 250 300 350 400
Re
ca
ll