have proven to be healthy or working
as expected from the perspective of
key metrics. Schedulers such as Nomad and Kubernetes come with such
provisions. They move on to deploying
newer versions of software only when
the current set of tasks passes health
checks and stops deployments if they
start encountering failures.
System software failure. System
software, such as the Docker daemon
and monitoring and logging software, is an integral part of the scheduler infrastructure and thus contributes to the health of the cluster. New
versions of such software have often
been deployed only to find that they
cause failures after some period of
time in a cluster. In one instance a
specific Auto Scaling Group on AWS
(Amazon Web Services) started misbehaving a few days after the cluster
joined the scheduler infrastructure;
it turned out that a new version of
Docker had been rolled out, which
had functional regressions.
In most cases, the best strategy for
dealing with such failures is to disable scheduling on those machines
and drain the assigned work to force
the relocation of the workload to
elsewhere in the datacenter. Alternatively, you could introduce additional resource capacity with a working
configuration of all system software,
such that pending workloads could
be scheduled successfully.
Such failures affect the tasks of all
jobs in a specific cluster or resource
pool; hence, schedulers should have
good mechanisms for dealing with
them. A robust scheduler design
should ideally be able to detect an issue with a given cluster or resource
pool and proactively stop scheduling
workloads there. This kind of proactive remediation event should be included in the telemetry information
being emitted by the scheduler so that
on-call engineers can further debug
and resolve the specific problem.
Shared resources failures. Modes
of failure at the infrastructure level include fiber cuts; faulty power distribution for a machine, rack, or ToR switch;
and many other environmental possibilities. In such cases, other than moving affected workloads to unaffected
systems, a scheduler can do very little
to mitigate the problem.
for unhealthy tasks such as restarts or
eviction of lower-priority tasks from a
node to reduce resource contention.
Clusterwide failures indicate a problem far bigger than local node-related
problems that can be solved with local
remediation techniques.
Such failures usually require paging on-call engineers for remediation
actions; however, the scheduler can
also play a role in remediation during such failures. The authors have
written and deployed schedulers
that have clusterwide failure detectors and would prevent nodes from
continuously restarting tasks locally.
They also allow operators to define remediation strategies, such as reverting to a last known good version or
decreasing the frequency of restarts,
stopping the eviction of other tasks,
among others, before the operators
can debug possible causes of failure.
Such failure-detection algorithms
usually take into consideration the
health of tasks cotenant on the same
machine to differentiate service-level
failures from other forms of infra-structure-related failures.
Clusterwide failures should be
taken seriously by scheduler developers; the authors have encountered
failures that have generated so many
cluster events that they saturated the
scheduler’s ability to react to failures.
Therefore, sophisticated measures
must be taken to ensure the events are
sampled without losing the context
of the nature and magnitude of the
underlying issues. Depending on the
magnitude of failure, the saturation
of events often brings operations to a
standstill unless it is quickly mitigated. Here, we cover some of the most
frequently used techniques for mitigating cluster-level failures.
Bad software push. Most cluster-level job failures are the result of bad
software pushes or configuration
changes. It can often be useful to track
the start time of such failures and correlate them with cluster events such
as job submissions, updates, and configuration changes. Another common,
yet simple, technique for reducing
the likelihood of cluster-wide failures
in the face of bad software pushes
is a rolling update strategy that incrementally deploys new versions of
software only after the new instances
Most of the failures
that wake up
operators in the
middle of the night
have affected entire
clusters or racks
of servers in a fleet.