the speed and decrease the power
consumption of its deep-learning
inference workloads. The TPUv2, announced in 2017, supports both training and inference workloads and is
available as part of Google’s cloud
offering.Pr oject Brainwave (https://
bit.ly/2iotXMQ) from Microsoft Research is exploring the use of FPGAs
(field-programmable gate arrays) to
perform hardware-based prediction
serving and has already achieved
some exciting results demonstrating
simultaneously high-throughput and
low-latency deep-learning inference
on a variety of model architectures.
Finally, both Intel’s Nervana chips
and and Nvidia’s Volta GPUs are new,
machine learning-focused architectures for improving the performance
and efficiency of machine-learning
workloads at both training and inference time.
As machine learning matures from
an academic discipline to a widely
deployed engineering discipline, we
anticipate that the focus will shift
from model development to prediction serving. As a consequence, we
are anxious to see how the next generation of machine-learning systems
can build on the ideas pioneered in
these papers to drive further advances
in prediction-serving systems.
Dan Crankshaw is a Ph. D. student in the UC Berkeley CS
department working in the RISELab. His current research
interests include systems and techniques for serving and
deploying machine learning, with a particular emphasis on
low-latency and interactive applications.
Joseph Gonzalez is an assistant professor at UC
Berkeley and co-director of the UC Berkeley RISELab
where he studies the design of algorithms, abstractions,
and systems for scalable machine learning. Before joining
UC Berkeley, he co-founded Turi Inc. (formerly GraphLab)
to develop AI tools for data scientists and later sold Turi
to Apple. He also developed the GraphX framework in
Apache Spark.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00
provides process isolation between
different models and ensures a single model failure does not affect the
availability of the rest of the system.
Finally, this disaggregated design
provides a convenient mechanism for
horizontally and independently scaling each model via replication to increase throughput.
Clipper also introduces
latency-aware batching to leverage hardware-accelerated inference. Batching prediction requests can significantly
improve performance. Batching helps
amortize the cost of system overheads
(for example, remote procedure call
and feature method invocation) and
improves throughput by enabling
models to leverage internal parallelism. For example, many machine-learning frameworks are optimized
for batch-oriented model training
and therefore capable of using SIMD
(single instruction, multiple data)
instructions and GPU accelerators to
improve computation on large input
batches. While batching increases
throughput, however, it also increases inference latency because the entire batch must be completed before
a single prediction is returned. Clipper employs a latency-aware batching
mechanism that automatically sets
the optimal batch size on a per-model
basis in order to maximize throughput, while still meeting latency constraints in the form of user-specified
service-level objectives.
To improve prediction accuracy,
Clipper introduces a set of selection
policies that enable the prediction-
serving system to adapt to feedback
and perform online learning on
top of black-box models. The selec-
tion policy uses reward feedback to
choose between and even combine
multiple candidate models for a given
prediction request. By selecting the
optimal model or set of models to use
on a per-query basis, Clipper makes
machine-learning applications more
robust to dynamic environments and
allows applications to react in real
time to degrading or failing models.
The selection policy interface is de-
signed to support ensemble methods
( https://bit.ly/2a7aB8N) and explore/
exploit techniques that can express a
wide range of such methods, includ-
ing multiarmed bandit techniques
and the Thompson sampling algo-
rithm used by LASER.
There are two key takeaways from
this paper: the first is the introduction of a modular prediction-serving
architecture capable of serving models trained in any machine-learning
framework and providing the ability
to scale each model independently;
the second is the exploitation of the
computational structure of inference
(as opposed to the mathematical
structure that several of the previous
papers exploit) to improve performance. Clipper exploits this structure through batching, but there is
potential for exploiting other kinds of
structures, particularly in approaches
that take more of a gray- or white-box
approach to model serving and thus
have more fine-grained performance
information.
Emerging Systems
and Technologies
Machine learning in general, and prediction serving in particular, are exciting and fast-moving fields. Along with
the research described in this article,
commercial systems are actively being
developed for low-latency prediction
serving. TensorFlow Serving (https://
www.tensorflow.org/serving/) is a prediction-serving system developed by
Google to serve models trained in TensorFlow. The Microsoft Custom Decision Service ( https://bit.ly/2JHp1v2),
with accompanying paper (https://
arxiv.org/abs/1606.03966), provides
a cloud-based service for optimizing
decisions using multiarmed bandit
algorithms and reinforcement learning, with the same kinds of explore/
exploit algorithms as the Thompson
sampling used in LASER or the selection policies of Clipper. Finally,
Nvidia’s TensorRT (https://developer.
nvidia.com/tensorrt) is a deep-learning optimizer and runtime for accelerating deep-learning inference on
Nvidia GPUs.
While the focus of this article is
on systems for prediction serving,
there have also been exciting developments around new hardware for
machine learning. Google has now
created two versions of its TPU (
Tensor Processing Unit) custom ASIC.
The first version, announced in 2016,
was developed specifically to increase