while maintaining the same distributed architecture for scalability and fault
tolerance. One paper, however, looks
at whether Sparrow-style wide distribution is really required for scalability.
Can We Have Quality and Speed?
Gog, I. et al.
Firmament: Fast, centralized cluster scheduling
at scale. In Proceedings of the Usenix
Conference on Operating Systems Design and
Implementation, 2016, 99–115; https://www.
Distributed decisions improve scalability and fault tolerance, but schedulers
must make them in the presence of reduced (and only statistically sampled)
information about cluster state. By contract, centralized schedulers, which
make all decisions in the same place,
have the information to apply more sophisticated algorithms—for example, to
avoid overloaded machines. Of course,
this applies both at the cluster level and
to application-level schedulers.
This paper sets out to investigate
whether—fault-tolerance benefits notwithstanding—distribution is indeed
necessary for scalability. It notes that it is
crucial to amortize the cost of decisions
over many tasks, especially if the scheduler supports features that require reconsidering the existing placements, such as
task preemption. Think about it this way:
If the scheduler picked a task off a queue,
looked at a whole bunch of machines,
and did some complex calculations just
to decide where to put this single task, it
cannot scale well to many tasks.
Firmament generalizes the Quincy
scheduler, a cool—and sometimes-overlooked—system that uses a min-cost, max-flow constraint solver to
schedule batch workloads. The constraint solver always schedules the entire
cluster workload, not just waiting tasks.
Because min-cost, max-flow solvers are
highly optimized, their algorithms amortize the work well over many tasks.
Applied naively, however, the Quincy
approach cannot scale to large work-
loads over thousands of machines—the
constraint-solver runtime, which domi-
nates scheduling latency, would be un-
acceptably long. To fix this, Firmament
concurrently runs several min-cost, max-
flow algorithms with different properties
and solves the optimization problem
incrementally if possible, refining a previ-
ous solution rather than starting over.
With some additional tricks, Fir-
mament achieves subsecond decision
times even when scheduling a Google-
scale cluster running hundreds of thou-
sands of tasks. This allows Sparrow-
style application-level tasks to be placed
within hundreds of milliseconds in a
centralized way even on thousands of
machines. The paper also shows there
is no scalability-driven need for distrib-
uted cluster-level schedulers, as Fir-
mament runs a 250-times-accelerated
Google cluster trace with median task
runtimes of only four seconds, while
still making subsecond decisions in
the common case. The simulator and
Firmament itself are open source, and
there is a plug-in that allows Kuber-
netes to use Firmament as a scheduler.
The Firmament paper suggests we
need not compromise on decision
quality to solve a perceived scalability
problem in cluster scheduling. Nevertheless, Sparrow-style distributed
schedulers are useful: for example, a
fault-tolerant application-level load
balancer that serves a fixed set of equally powerful workers might well wish to
use Sparrow’s architecture.
Research to Practice
What does all this mean for you, the
reader? For one, you are almost certainly already using applications that
run on Borg and other cluster managers every day. Moreover, if you are running a business that computes on large
data or runs web applications (or, especially, both!), you will probably want
the automation of a cluster manager.
Many companies already do so and
run their own installations of Mesos or
Kubernetes on clusters of VMs provisioned on the cloud or on machines on
their own premises.
The problem of scaling cluster
managers and their schedulers to very
large clusters, however, is one that
most readers won’t have to face: only
a few dozen companies run such large
clusters, and buying resources on AWS
(Amazon Web Services) or Google’s or
Microsoft’s clouds is the easiest way to
scale. In some cases, scheduler scal-
ability can also be an issue in smaller
clusters, however: if you are running
interactive analytics workloads with
short tasks, a scalable scheduler may
give you better resource utilization.
Another important aspect of the
scheduling problem is that cluster workloads are quite diverse, and
scheduling policies in practice often
require substantial hand-tuning using
placement constraints. Indeed, this is
what makes cluster scheduling different from the multiprocessor scheduling that your kernel does: while most
applications are fine with the general-purpose policies the kernel applies
to assign processes to cores, current
cluster-level placement policies often
do not work well for all workload mixes
without some manual operator help.
Since it is challenging for humans to
develop scheduling policies and good
placement heuristics that suit all (or
even most) workloads, research on
policies that help in specific settings is
guaranteed to continue.
Another approach, however, may
also be viable. Recent, early results
suggest that the abundant metrics
data and the feedback loops of cluster-scheduling decisions are a good fit for
modern machine-learning techniques:
they allow training neural networks to
automatically learn custom heuristics
tailored to the workload. For example,
reinforcement learning can effectively
learn packing algorithms that match or
outperform existing, human-specified
heuristics, and a neural network outperforms human planners in TensorFlow’s
application-level operator scheduling.
Therefore, future research may
raise the level of automation in cluster
management even further: perhaps the
cluster scheduler will someday learn
its own algorithm.
Editor’s Note: To read this article with embedded
hyperlinks, visit https://queue.acm.org/detail.
Malte Schwarzkopf is a post-doctoral associate in the
Parallel and Distributed Operating Systems (PDOS) Group
at MIT, Cambridge, MA, USA.
Copyright held by owner/author.
Publication rights licensed to ACM. $15.00