In some cluster schedulers, the default behavior when nodes become
disconnected from the network is to
begin killing tasks. Operationally, this
can cause significant challenges when
nodes return to a healthy state. In most
cases, it is preferable to delegate the
complexity of guaranteeing a fixed
number of currently operational tasks
to the application itself. This typically
makes the scheduling system easier to
operate and allows the application to
get precisely the consistency and failure semantics it desires. Tasks should
be allowed to join the cluster gracefully
when the failures are mitigated. Such
measures decrease the churn in the
cluster and allow for its faster recovery.
Depletion of global resources. In
addition to node resources, global resources such as aggregate bandwidth
or power usage within the cluster
should be tracked by the scheduler’s
resource allocator. Failure to track
these global resources could result in
placement decisions oversubscribing
cluster resources, causing bottlenecks
that create hotspots within the datacenter, thereby reducing the efficiency
of the provisioned hardware.
For example, bin packing too many
network I/O-intensive tasks in a single
rack might saturate the links to the
datacenter’s backbone, creating contention, even though network links at
the node level might not be saturated.
Bin packing workloads very tightly in a
specific part of the datacenter can also
have interesting or unexpected side effects with respect to power consumption, thereby impacting the available
cooling solutions.
Software-distribution mechanisms.
It is very important to understand the
bottlenecks of software-distribution
mechanisms. For example, if the aggregate capacity of a distribution
mechanism is 5Gbps, launching a job
with tens of thousands of tasks could
easily saturate the limit of the distribution mechanism or even of the shared
backbone. This could have detrimental
effects on the entire cluster and/or the
running services. Parallel deployments
of other services can often be affected
by such a failure mode; hence, the
parallelism of task launches must be
capped to ensure no additional bottlenecks are created when tasks are deployed or updated.
Keep in mind that distribution
mechanisms that are centralized in
nature, such as the Docker registry, are
part of the availability equation. When
these centralized systems fail, job submission or update requests fail as well,
thereby putting services at risk of becoming unavailable if they, too, are
updated. Extensive caching of artifacts
on local nodes to reduce pressure on
centralized distribution mechanisms
can be an effective mitigation strategy
against centralized distribution outages. In some instances, peer-to-peer
distribution technologies such as Bit-Torrent can further increase the availability and robustness of such systems.
Back-off strategies for remediation
actions. Certain workloads might not
perform well on any node in the cluster
and might be adversely affecting the
health of the services and the nodes. In
such cases, the schedulers must detect
the trend while reallocating workloads
or bring additional capacity to ensure
they do not deplete global resources,
such as API call limits of cloud providers, or adversely affect cotenant workload, thereby causing cascading failures.
Control-Plane
Failure Considerations
Control planes within schedulers have
a different set of failure considerations
than compute nodes and clusters,
as the control plane must react to
changes in the cluster as they happen, including various forms of failure. Software engineers writing such
systems should understand the user
interaction, scale, and SLA (
service-level agreement) for workloads and
then derive an appropriate design that
encompasses handling failures in the
control plane. Here, we look at some of
the important design considerations
for control-plane developers.
Reliable cluster state reconciliation.
At the end of the day, most schedulers
are just managing cluster state, super-
vising tasks running on the cluster,
and ensuring QoS for them. Schedul-
ers usually track the cluster state and
maintain an internal finite-state ma-
chine for all the cluster objects they
manage, such as clusters, nodes, jobs,
and tasks. The two main ways of cluster
state reconciliation are level- and edge-
triggered mechanisms. The former is
employed by schedulers such as Kuber-
netes, which periodically looks for un-
placed work and tries to schedule that
work. These kinds of schedulers often
suffer from having a fixed baseline la-
tency for reconciliation.
Edge-triggered scheduling is more
common. Most schedulers, such as Mesos and Nomad, work on this model.
Events are generated when something
changes in the cluster infrastructure,
such as a task failing, node failing, or
node joining. Schedulers must react to
these events, updating the finite state
machine of the cluster objects and
modifying the cluster state accordingly. For example, when a task fails in Mesos, the framework gets a TASK_LOST
message from the master and reacts to
that event based on certain rules, such
as restarting the task elsewhere on the
cluster or marking a job as dead or
complete. Nomad is similar: it invokes
a scheduler based on the type of the allocation that died, and the scheduler
then decides whether the allocation
needs to be replaced.
While event-driven schedulers are
faster and more responsive in practice,
guaranteeing correctness can be harder since the schedulers have no room
to drop or miss the processing of an
event. Dropping cluster events will result in the cluster not converging to the
right state; jobs might not be in their
expected state or have the right number of tasks running. Schedulers usually deal with such problems by making
the agents or the source of the cluster
event resend the event until they get an
acknowledgment from the consumer
that the events have persisted.
Quotas for schedulers. Schedulers
are usually offered to various teams in
an organization for consuming compute infrastructure in the datacenter.
Schedulers usually implement quotas,
which ensure that various jobs have
the right amount of resources on the
clusters during resource contention.
Besides quotas for compute resources
on the compute clusters, scheduler
developers also must consider how
much time schedulers spend doing
scheduling per job. For example, the
amount of time it would take to schedule a batch job with 15,000 tasks would
be much more than for a job with 10
tasks. Alternatively, a job might have a
few tasks but very rigorous placement
constraints. The scheduler might