I
M
A
G
E
B
Y
A
N
D
R
I
J
B
O
R
Y
S
A
S
S
O
C
I
A
T
E
S
/
S
H
U
T
T
E
R
S
T
O
C
K
within the available pool and resched-
ule that job for processing within the
constraints imposed by the user. It is
exactly at this moment of failure where
things become interesting. Why did
the workload fail? Was there an appli-
cation problem? Or a machine-specific
problem? Or was there perhaps a clus-
terwide or otherwise environmental
problem? More importantly, how does
the architecture of a scheduler impact
the ability and timeline of a workload
to recover? The answers to these ques-
tions directly affect and dictate how ef-
fective a scheduler can be in recovering
that failed workload.
One of the responsibilities of a clus-
ter scheduler is to supervise an individ-
ual unit of work, and the most primi-
tive form of remediation is to move
that workload to a different, healthy
node; doing so will frequently solve
a given failure scenario. When using
any kind of shared infrastructure, how-
ever, you must carefully evaluate the
bulkheading options applied for that
shared infrastructure, and objectively
assess the opportunity for cascading
failure. For example, if an I/O-intensive
job is relocated to a node that already
hosts another I/O-intensive job, they
could potentially saturate the network
links in the absence of any bulkhead-
ing of IOPs (I/O operations), resulting
in a degraded QoS (quality of service) of
other tenants on the node.
In this article we highlight the various failure domains within scheduling
systems and touch upon some of the
practical problems operators encounter with machine, scheduler, environmental, and clusterwide failures. In
addition, we provide some answers to
dealing with the failures.
Considerations for
Machine Failures
Failures at the machine level are probably the most common. They have a variety of causes: hardware failures such
as disks crashing; faults in network
interfaces; software failures such as
excessive logging and monitoring; and
problems with containerizer daemons.