ployments that are exceptionally large)
throw off the regression models and ultimately produce incorrect buckets.
Initial uses and results in resource
management. In its first production in-stantiation, we have implemented RC’s
online component as a service in each
Azure Compute cluster. A single version
of its offline model-generation component runs on Cosmos. The first two
major clients to use RC were the server
defragmenter, which queries RC for
lifetime and blackout time predictions
(and VM metadata); and the container
scheduler, which queries it for lifetime predictions (and metadata). As of
March 2019, RC’s clients are directing
roughly 1. 5 billion queries to it daily.
The next major clients we will productize are the power capping manager,
which will use RC’s workload interactivity predictions; and a new predictive
container rightsizing system, which
will use RC’s utilization predictions to
recommend new container sizes. Several other uses of RC are being planned.
Our production results from the
server defragmenter show that, from
October 2018 to March 2019, RC en-
abled many tens of thousands of VM
migrations, enabling more than 200
clusters (that would otherwise have
been considered “full”) to continue
receiving new VMs. Our earlier simu-
lation study considered the use of RC-
produced VM utilization predictions for
safe core oversubscription. 9 It showed
that an RC-informed oversubscribing VM
We expect the prediction quality our
current models provide will be enough
for most clients. For example, a VM
scheduler that oversubscribes CPU
cores prevents resource exhaustion as
effectively with RC’s VM utilization pre-
dictions as with an oracle predictor. 9
Accuracy by VM group. Interestingly,
accuracy can be higher for the first VM
deployments from new subscriptions
than for deployments from subscriptions we have already seen in the dataset. For example, for average CPU utilization predictions, these accuracies
are 92% and 81%, respectively. We conjecture that this is because users tend
to experiment with their first VMs in
similar ways, so feature data accounting for prior subscriptions is predictive of new ones.
We also compare the prediction accuracy for third- and first-party VMs,
and for first-party production and
non-production VMs. The former comparison shows that accuracy tends to
be higher for third-party VMs. For example, for lifetime, the accuracies for
third-party and first-party VMs are 83%
and 74% respectively, whereas for average CPU utilization they are 84% and
80% respectively. When comparing
production and non-production first-party VMs, the results are more mixed.
For lifetime, accuracy is higher for production VMs (82% vs. 64%), whereas
the opposite is true for average CPU
utilization (79% vs. 83%). The wide diversity of production workloads makes
utilization more difficult to predict,
but at the same time their lifetimes
are less diverse and easier to predict as
production VM tend to live long.
Comparison to other techniques. As
baselines for comparison, we experi-
ment with three techniques: most re-
cent bucket (MRB), most popular bucket
(MPB), and logistic regression (LR). MRB
and MPB are non-ML techniques. MRB
predicts the bucket that was most com-
mon for the VMs in the last deployment
of the same subscription (lifetime and
average CPU utilization), the same buck-
et as the last deployment of the same
subscription (max deployment size), or
the same bucket as the last VM migra-
tion of a similar size (blackout time).
MPB predicts the bucket that has been
most popular since the start of the sub-
scription. LR predicts a bucket based on
a non-linear probability curve computed
using the maximum likelihood method.
We train the LR models with the same
feature vectors we described earlier.
Figure 4 shows that MRB exhibits accu-
racies between 54% and 81%, whereas
MPB stays between 42% and 78%, and
LR in the 62%–80% range. Clearly, these
accuracies are substantially worse than
our GBT results. Compared to MRB
and MPB, GBT relies on many features
instead of a simple heuristic, giving it a
broader context that improves predic-
tions. Compared to LR, GBT performs
better for higher dimensional data. In
addition, GBT combines decision trees
with different parameters to produce
higher quality results.
Comparison to regression into buckets.
We also compare GBTs as classifiers
into buckets with GB Ts used for numerical regression and then bucketizing
the results. We find that the former approach is substantially more accurate.
The reason is that “noise” in the numerical values (for example, few VM de-
Figure 4. Accuracy, precision, and recall for all behaviors.
Three leftmost bars of each behavior represent the accuracy for most recent bucket
(MRB), most popular bucket (MPB), and logistic regression (LR). Two rightmost bars
represent precision and recall with gradient boosting tree (GBT), when predictions
with <60% confidence are excluded.
Accuracy - MPB
Precision* - GBT
Accuracy - LR
Recall* - GB T
Lifetime
Accuracy - MRB
Accuracy - GB T
1.0
0.8
0.6
0.4
0.2
0.0
CPU Avg Deployment
Max #VMs
Blackout