while aiding in keeping tenant systems as stable as possible during periods of troubleshooting by the owners
of the tenant systems.
The cutting-edge nature of this field
of engineering makes it one of the
most exciting areas in which to work,
enabling workload mobility, uniform
scalability, and self-healing systems to
Hadoop Superlinear Scalability
Neil Gunther, Paul Puglia,
and Kristofer Tomasette
A Conversation with Phil Smoot
The Network is Reliable
Peter Bailis and Kyle Kingsbury
1. Apache Mesos; http://mesos.apache.org/
2. Envoy; https://www.envoyproxy.io/.
3. Fluentd; https://www.fluentd.org/.
4. Ionel, G., Schwarzkopf, M., Gleave, A., Watson, R.N.M.
and Hand, S. Firmament: Fast, centralized cluster
scheduling at scale. In Proceedings of the 12th
Usenix Symposium on Operating Systems Design and
Implementation, 2016; https://research.google.com/
5. Isard, M., Prabhakaran, V., Currey, J., Wieder, U.,
Talwar, K. and Goldberg, A. Quincy: Fair scheduling
for distributed computing clusters. Proceedings of the
22nd ACM SIGOPS Symposium on Operating System
Principles, 2009, 261–276; https://dl.acm.org/citation.
6. Krasnov, V. On the dangers of Intel’s frequency
scaling. Cloudflare, 2017; https://blog.cloudflare.com/
7. Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P. and
Kozyrakis, C. Improving resource efficiency at scale
with Heracles. ACM Trans. Computer Systems 34, 2
8. Microsoft System Center. Methods and formula
used to determine server capacity. TechNet Library,
9. Rosen, R. Understanding the new control groups API.
LWN.net, 2016; https://lwn.net/Articles/679786/.
10. Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M. and
Wilkes, J. Omega: Flexible, scalable schedulers for
large compute clusters. In Proceeding of SIGOPS
2013 European Conference on Computer Systems;
11. Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer,
D., Tune, E. and Wilkes, J. Large-scale cluster
management at Google with Borg. Proceedings of
the 10th European Conference on Computer Systems,
Diptanu Gon Choudhury (@diptanu) works at Facebook
on large-scale distributed systems. He is one of the
maintainers of the Nomad open source cluster scheduler
and previously worked on cluster schedulers on top of
Apache Mesos at Netflix.
Timothy Perrett (@timperrett) is an infrastructure
engineering veteran, author, and speaker and has led
engineering teams at a range of blue-chip companies.
He is primarily interested in scheduling systems,
programming language theory, security systems,
and changing the way industry approaches
© 2018 ACM-0001-0782/18/6 $15.00.
spend varying amounts of time serving various jobs from various teams
based on constraints and volume
of the tasks or churn in the cluster.
Monolithic schedulers, which centralize all the scheduling work, are more
prone to these kinds of problems than
two-level schedulers such as Mesos,
where operators can run multiple
frameworks to ensure various schedulers are serving a single purpose and
thereby not sharing scheduling time
for anything else.
With monolithic schedulers it is
important to develop concepts such
as quotas for various types of jobs or
teams. Another possibility for scaling
schedulers is to do parallel scheduling
in a similar manner to Nomad, where
the operators can run many scheduler
instances that work in parallel and can
decide how many scheduler processes
they want to run for a certain job type.
Recovering cluster state from
failures. Scheduler operators want AP
(CAP available, partition tolerant) systems in practice because they prefer
availability and operability over consistency. The convergence of the cluster
state eventually must be guaranteed
after all the cluster events have been
processed or by some form of reconciliation mechanism. Most real-world
schedulers, however, are built on top
of highly consistent coordination systems such as ZooKeeper or etcd, because building and reasoning about
such distributed systems are easier
when the data store provides guarantees of linearizability. It is not unheard
of for schedulers to lose their entire
database for a few hours. One such instance was when AWS had a Dynamo
outage, and a large scheduler operating on top of AWS was using Dynamo
to store cluster state. There is not a lot
that can be done in such situations, but
scheduler developers have to consider
this scenario and develop with the goal
of causing the least impact to running
services on the cluster.
Some schedulers such as Mesos allow operators to configure a duration
after which an agent that is disconnected from the scheduler starts killing the tasks running on a machine.
Usually this is done with the assump-
tion that the scheduler is disconnect-
ed from the nodes because of failures
such as network partitions; since the
scheduler thinks the node is offline, it
has already restarted the tasks on that
machine somewhere else in the clus-
ter. This does not work when schedul-
ers are experiencing outages or have
failed in an unrecoverable manner. It
is better to design scheduler agents
that do not kill tasks when the agent
disconnects from the scheduler, but
instead allow the tasks to run and even
restart them if a long-running service
fails. Once the agents rejoin the clus-
ter, the reconciliation mechanisms
should converge the state of the clus-
ter to an expected state.
The process of restoring a cluster’s
state when a scheduler loses all its
data is complicated, and the design
depends largely on the architecture
and data model of the scheduler. On
Apache Mesos, the scheduler frameworks11 can query statuses of tasks
for known task IDs. The Mesos master responds with the current state
of the nonterminal tasks. On Nomad, the cluster state is captured in
the raft stores of the schedulers, and
there is no good way to back up the
cluster state and restore from a snapshot. Users are expected to resubmit
the jobs. Nomad can then reconcile
the cluster state, which creates a lot
of churn in services.
Designing for failures in all aspects of a
distributed cluster scheduler is a must
for operational stability and reliability.
Scheduler agents should be developed
with the understanding that only finite
amounts of resources exist on a given
system. Processes could leak resources
or consume more resources than they
were intended to, resulting in unexpected performance degradation caused by
resource contention. These scheduler
agents must also be able to converge on
a good state by using robust reconciliation mechanisms during a given failure
(or set of failures), even when particular failure modes could inundate the
scheduler with cluster events—for example, the loss of many agent systems
caused by a power failure.
Engineers looking to build scheduling systems should consider all
failure modes of the underlying infrastructure they use and consider
how operators of scheduling systems
can configure remediation strategies,