a first one was implemented as a runtime library to be linked with clients,
and another as an independent service
(as described). Interestingly, there is still
no consensus on which implementation
is ideal for Azure. Some teams like the
ability to get predictions without leaving
the client’s machine, which the library
approach enables via RC’s caches. Other
teams prefer the standard and higher-level interface of a service, and do not
want to manage an additional library. Ultimately, we expect to build multiple online component implementations that
will consume models and feature data
from the same back-end source.
and Research Avenues
As should be clear by now, RC is by no
means the only possible approach to ML-centric platforms. In fact, RC cannot currently accommodate certain types of ML
integration that can be potentially useful.
Moreover, there are potential additional
areas for ML integration that nobody has
explored yet. Clearly, there is a need for
more research on this topic. The following paragraphs identify some research
challenges and avenues going forward.
Broadly using application-level performance data. As mentioned, low-level
counters are an indirect measure of
application performance. For resource
management without performance
loss, extracting high-level information
from applications is key. Today’s extraction methods require effort from developers, who do not always have a strong
incentive to provide the data. The challenge the cloud provider faces is creating stronger incentives or extraction
methods that are automatic, privacy-preserving, and non-intrusive.
Using action-prescribing ML while being
general. Increasingly popular ML tech-
niques such as reinforcement and ban-
dit learning prescribe actions. In the re-
source management context, this means
the ML must understand the acceptable
management mechanisms and poli-
cies (these techniques could define the
policies themselves, but this would make
manager debugging very difficult), and
be adjusted for every manager that can
benefit. Moreover, it must be safe/cheap
to explore the space of available actions.
The challenge is creating general designs
for these ML techniques, perhaps via
frameworks/APIs that take mechanism
scheduler can accommodate more VMs,
while producing 6× fewer cases of physi-
cal resource exhaustion than a baseline
oversubscribing scheduler that does not
consider utilization predictions.
Lessons learned. Thinking about ML-centric cloud platforms and through our
experience with RC, we learned several
Separation of concerns. Keeping predictions and management policies separate has worked well in RC. This separation is making policies easier to debug
and their results easier to reproduce, as
they are not obscured by complex ML
techniques. In Azure’s current managers,
policies tend to be rule-based and use RC
predictions as rule attributes (for example, if expected lifetime is short, then
place container in one of these servers).
The rule-based organization has made it
easier to integrate predictions into existing managers.
Reach and extensibility. The ML framework must act as a source of intelligence
for many resource managers, not all of
which will be known on day-one. Thus, it
is critical to be able to easily integrate new
data sources, predict/understand more
behaviors, include multiple models for
each behavior and implement versioning
per model, among others. RC’s modular
design has made these extensions easy.
Model updates. We designed RC to
produce models and feature data offline,
and then serve predictions and feature
data online until it produces (in the background) an updated version of them.
However, one resource manager we have
come across requires models to be updated online, so that each prediction accounts for the effect of the previous one.
As this scenario seems to be rare, we opted for RC’s more general design.
Performance. Many of the managers
do not require extremely fast predictions.
For example, the server defragmenter can
easily deal with slow predictions. However, other managers require substantially higher performance. For example,
the entire time budget for the container
scheduler is less than 100 milliseconds.
In these scenarios, prediction result,
model, and feature data caching can be
critical in prediction-serving, especially
if models are large and complex. Caching
also helps maintain operation even when
the data store is unavailable.
Integration with clients. We explored
two versions of RC’s online component:
RC is by no means
the only possible
approach to ML-
In fact, RC
certain types of
ML integration that
can be potentially