resources. For example, say you boot a
container with a bridged networking
mode; under the hood a virtual Ether-net adapter will be created. If the application crashes unexpectedly—and
is not killed by an external agent—the
container daemon can potentially leak
virtual interfaces over time, which
eventually causes a system problem
when a moderate number of interfaces
have been leaked. This causes new applications attempting to boot on that
machine to fail, as they are unable to
create virtual network adapters.
Remediating these types of failures
can be difficult; the issue must first
be monitored to keep track of the resources being created and garbage
collected over time, ensuring that the
leaking is either kept to a minimum
or effectively mitigated. Operators
often find themselves writing agents
to disable scheduling on a node until
resources become available to make
sure a node is not running under pressure, or preemptively redistributing
work before the issue manifests itself
by causing an outage. It is best to surface such problems to the operators
even if automated mitigation procedures are in place, since the problems
are usually a result of bugs in underlying container runtimes.
Undersubscription of attached resources. Schedulers usually choose
placements or bin-packing strategies
for tasks based on node resources such
as CPU, memory, disk, and capacity of
the I/O subsystems. It is important, however, to consider the shared resources attached to a node, such as network storage or aggregated link layer bandwidth
attached to the top of rack (ToR) switch
to ensure such resources are allocated
to a reasonable limit or are judiciously
oversubscribed. Naive scheduler policies might undersubscribe node-local
resource usage but oversubscribe aggregate resources such as bandwidth. In
such situations, optimizing for cluster-level efficiency is better than local optimization strategies such as bin packing.
Multitenancy is one of the most dif-
ficult challenges for performance en-
gineers to solve in an elastic, shared
infrastructure. A cluster that is shared
by many different services with varying
resource usage patterns often shows
so-called busy-neighbor problems. The
performance of a service can become
degraded because of the presence of
other cotenant services. For example,
on the Linux operating system, impos-
ing QoS for the network can be com-
plicated, so operators sometimes do
not go through the effort of imposing
traffic-shaping mechanisms for con-
trolling throughput and bandwidth of
network I/O in containers. If two net-
work I/O-intensive applications run on
the same node, they will adversely af-
fect each other’s performance.
Other common problems with
multitenancy include cgroup con-
trollers not accounting for certain
resources correctly, such as the VFS
IOP, where services that are very disk
I/O-intensive will have degraded per-
formance when colocated with simi-
lar services. Work has been ongoing
in this area for the past five to six years
to design new cgroup controllers9 on
Linux that do better accounting, but
not all these controllers have yet been
put into production. When workloads
use SIMD (single instruction multiple
data) instructions such as those from
Intel’s AVX-512 instruction set, pro-
cessors throttle the CPU clock speed
to reduce power consumption, there-
by slowing other workloads running
on the same CPU cores that are run-
ning non-SIMD instructions. 6
Fair sharing of resources is often
the most common approach offered
by schedulers, and shares of resources
are often expressed via scalar values.
Scalar values are easier to comprehend
from an end-user perspective, but in
practice they do not always work well
because of interference. 7 For example,
if 100 units of IOPs are allocated to
two workloads running on the same
machine, the one doing sequential I/O
may get a lot more throughput than the
one performing random I/O.
Considerations for
Cluster-Level Failures
Most of the failures that wake up operators in the middle of the night have
affected entire clusters or racks of servers in a fleet. Cluster-level failures are
usually triggered because of bad configuration changes, bad software deployment, or in some cases because of
cascading failures in certain services
that result in resource contention in a
multitenant environment. Most schedulers come with remediation steps
Regardless of
where compute
capacity is
housed—public
cloud or private
datacenter—at
some point capacity
planning will be
necessary to
figure out how
many machines
are needed.