THIS INSTALLMENT OF Research for Practice features
a curated selection from Malte Schwarzkopf, who
takes us on a tour of distributed cluster scheduling,
from research to practice, and back again. With
the rise of elastic compute resources, cluster
management has become an increasingly hot topic
in systems R&D, and a number of competing cluster
managers including Kubernetes, Mesos, and Docker
are currently jockeying for the crown in this space.
Interested in the foundations behind these systems,
and how to achieve fast, flexible, and fair scheduling?
Malte’s got you covered!
Peter Bailis is an assistant professor of computer science at Stanford University.
His research in the Future Data Systems group ( futuredata.stanford.edu) focuses
on the design and implementation of next-generation data-intensive systems.
Increasingly, many applications and websites
rely on distributed back-ends running in cloud
datacenters. In these
datacenters, clusters of
hundreds or thousands of machines
run workloads ranging from fault-tolerant, load-balanced Web servers to
batch data-processing pipelines and
distributed storage stacks.
A cluster manager is special “
orchestration” software that manages
the machines and applications in
such a datacenter automatically:
some widely known examples are Kubernetes, Mesos, and Docker Swarm.
Why are cluster managers needed?
Most obviously because managing
systems at this scale is beyond the capabilities of human administrators.
Just as importantly, however, automation and smart resource management
save real money. This is true both at
large scale—Google estimates that its
cluster-management software helped
avoid building several billion-dollar
datacenters—and at the scale of a
startup’s cloud deployment, where
wasting hundreds of dollars a month
on underutilized virtual machines
may burn precious runway.
As few academic researchers have
access to real, large-scale deployments,
academic papers on cluster management largely focus on scheduling workloads efficiently, given limited resources, rather than on more operational
aspects of the problem. Scheduling is
an optimization problem with many
possible answers whose relative goodness depends on the workload and
the operator’s goals. Thinking about
solutions to the scheduling problem,
however, has also given rise to a vigorous debate about the right architecture
for scalable schedulers for ever larger
clusters and increasingly demanding
Let’s start by looking at a paper
that nicely summarizes the many
facets of a full-fledged industry cluster manager, and then dive into the
scheduler architecture debate.
Article development led by
Expert-curated guides to
the best of CS research.
BY MALTE SCHWARZKOPF