that have the same—or similar—
hardware configuration.
These are some of the key elements
to consider for capacity-planning purposes, but this is by no means an exhaustive set. Be sure to consider any
environment-specific capacity bounds
that might apply to your case, always
basing your plan on real data about actual usage collected in the field.
The OOM (out of memory) killer
in Linux steps in under extremely low
memory conditions and kills processes to recover memory based on a set
of rules and heuristics. The decisions
made by the OOM killer are based on
a so-called oom_score, which changes
over time, based on certain rules, and
is not deterministic in most situations.
The OOM killer is an important
system component to keep in mind
while designing schedulers that allow
for oversubscription, 1 since they allow more tasks on a machine than actual resources. In such situations, it is
important to design a good QoS module that actively tracks the resource
usage of tasks and kills them proactively. If tasks consume more memory
than they are allocated, the scheduler
should kill the tasks before the overall
resource utilization forces invocation
of the system OOM killer. For example,
QoS modules could implement their
own strategy for releasing memory by
listening for kernel notifications indicating memory pressure, and subsequently killing lower-priority tasks,
which would prevent the kernel from
invoking the OOM killer.
Having scheduler agents killing
tasks allows for deterministic behavior and is easier to debug and troubleshoot. For example, in the Mesos cluster manager the Mesos agent runs a
QoS controller that continuously monitors tasks that run with revocable resources and kills them if they interfere
with normal tasks.
Leaking container resources. Since
its introduction to the Linux kernel a
decade ago, container technology has
improved immensely. It is, however,
still an imperfect world, and tools that
have been built atop these foundations
have added more complexity over time,
opening the door to interesting and
tricky-to-solve problems. One of the
common runtime issues operators will
encounter is the leaking of associated
All these failures result in degraded
application performance or a poten-
tial incoherent cluster state. While all
possible system failure modes are too
numerous to mention in one article,
within the realm of scheduling there
are a handful of important factors to
consider. Here, we cover details about
the mechanics of the modern Linux op-
erating system and how to mitigate the
effects of typical failure modes encoun-
tered in the field.
Capacity planning. Regardless of
where compute capacity is housed—
public cloud or private datacenter—at
some point capacity planning will be
necessary to figure out how many machines are needed. Traditional methods8
for capacity planning make assumptions about compute resources being
entirely dedicated, where a given machine has a sole tenant. While this is a
common industry practice, it is often
ineffective as application authors tend
to be overly optimistic about their runtime performance (resulting in insufficient capacity and potential outage
at runtime) or overly cautious about resource consumption, leading to a high
cost of ownership with a large amount
of waste5 when operating at scale.
Assuming an application has been
benchmarked for performance and
resource consumption, running that
application on a scheduling system
introduces additional challenges for
capacity planning: how will shared
components handle multitenancy?
Common configurations have per-node utilities for routing traffic, monitoring, and logging (among other
tasks); will these potentially impact
those lab performance numbers? The
probability is very high that they will
have a (negative) impact.
Ensure the capacity plan includes
headroom for the operating system,
file systems, logging agents, and any-
thing else that will run as a shared
component. Critically, anything that
is a shared component should have
well-defined limits (where possible) on
compute resources. Not provisioning
an adequate amount of resources for
system services inadvertently surfac-
es as busy-neighbor problems. Many
schedulers allow operators to reserve
resources for running system compo-
nents, and correctly configuring these
resource reservations can dramatically
improve the predictability of applica-
tion performance.
˲ File systems. Learn about the
overhead and resource usage of file
systems. This is useful, for example,
when using ZFS to limit the ARC
(adaptive replacement cache) to an
acceptable size, or when planning to
turn on deduplication or compression to account for the CPU cycles that
ZFS itself is going to use. Consider another example: two containers doing
a lot of file-system I/O with a very limited cache would end up invalidating
each other’s caches, resulting in poor
I/O performance. Limiting file-system
IOPs is not straightforward in Linux,
since the block I/O and memory controller cannot interact with each other
to limit the writeback I/O with traditional cgroup v1. The next version of
cgroup can properly limit I/O, but a
few controllers—such as the CPU controller—have not yet been merged.
˲ Sidecars. Logging, monitoring, or
service meshes such as Envoy2 can potentially use a considerable amount
of resources, and this needs to be accounted for. For example, if a logging
agent such as Fluentd3 is forwarding
logs to a remote sink, then the network
bandwidth for that process should be
limited so that containers can get their
expected share of network resources
for application traffic. Fair sharing of
such resources is difficult, and therefore it is sometimes easier to run sidecars for every allocation on a node
rather than sharing them, so that their
resources can be accounted for under
the cgroup hierarchy of the allocation.
˲ Administration. Policies for system
or component configurations—such
as garbage collection—should be
based on the units that the underlying
resource understands. For example,
log retention policies based on a number of days are not effective on a node
where the storage is limited by number
of bytes—rotating logs every three days
is useless if the available bytes are consumed within a matter of hours. Systems administrators often apply the
same types of policies that they write
for clusterwide log aggregation services for local nodes. This can have disastrous consequences at the cluster level
where services are designed to scale
out horizontally, where a workload
might be spread across many nodes