lating, only to cause larger problems
when the right failure mode occurs.
With failure induction, the added
need to model changes in the data,
information flow, and deployment
architecture in a test environment is
minimized and presents less of an opportunity to miss problems.
Before going further, let’s discuss
what is meant by resilience and how to
increase it.
Resilience is an attribute of a system that enables it to deal with failure
in a way that does not cause the entire
system to fail. It could involve minimizing the blast radius when a failure
occurs or changing the user experience to work around a failing component. For example, if a movie recommendation service fails, the user can
be presented with a nonpersonalized
list of popular titles. A complex system is constantly undergoing varying
degrees of failure. Resiliency is the
measure by which it can recover, or be
insulated, from failure, both current
and future. 7
There are two ways of increasing the
resilience of a system:
˲ Build your application with redundancy and fault tolerance. In a service-oriented architecture, components
are encapsulated in services. Services
are made up of redundant execution
units (instances) that protect clients
from single- or multiple-unit failure.
When an entire service fails, clients of
that service must implement fault tolerance to localize the failure and continue to function.
˲ Reduce uncertainty by regularly inducing failure. Increasing the frequency of failure reduces its uncertainty
and the likelihood of an inappropriate
or unexpected response. Each unique
failure can be induced while observing the application. For each undesirable response to an induced failure,
the first approach can be applied to
prevent its recurrence. Although in
practice it is not feasible to induce
every possible failure, the exercise of
enumerating possible failures and
prioritizing them helps in understanding tolerable operating conditions and
classifying failures when they fall outside those bounds.
The first item is well covered in other literature. The remainder of this article will focus on the second.
the simian army
Once you have accepted the idea of
inducing failure regularly, there are
a few choices on how to proceed. One
option is GameDays, 1 a set of scheduled exercises where failure is manually introduced or simulated to mirror
real-world failure with the goal of both
identifying the results and practicing
the response—a fire drill of sorts. Used
by the likes of Amazon and Google,
GameDays are a great way to induce
failure on a regular basis, validate assumptions about system behavior, and
improve organizational response.
But what if you want a solution that
is more scalable and automated—one
that does not run once per quarter but
rather once per week or even per day?
You do not want failure to be a fire drill.
You want it to be a nonevent—
something that happens all the time in the
background so that when a real failure
occurs, it will simply blend in without
any impact.
One way of achieving this is to engineer failure to occur in the live environment. This is how the idea for
“monkeys” (autonomous agents really,
but monkeys inspire the imagination)
came to Netflix to wreak havoc and induce failure. Later the monkeys were
assembled together and labeled the
Simian Army. 5 A description of each
resilience-related monkey follows.
Chaos Monkey. The failure of a virtual instance is the most common type
of failure encountered in a typical public cloud environment. It can be caused
by a power outage in the hosting rack, a
disk failure, or a network partition that
cuts off access. Regardless of the cause,
the result is the same: the instance becomes unavailable. Inducing such failures helps ensure services do not rely
on any on-instance state, instance affinity, or persistent connections.
To address this need, Netflix created
its first monkey: Chaos Monkey, which
randomly terminates virtual instances
in a production environment—
instances that are serving live customer traffic. 3
Chaos Monkey starts by looking
into a service registry to find all the
services that are running. In Netflix’s
case, this is done through a combina-
tion of Asgard6 and Edda. 2 Each service
can override the default Chaos Monkey
configuration to change termination
probability or opt out entirely. Each
hour, Chaos Monkey wakes up, rolls
the dice, and terminates the affected
instances using Amazon Web Services
(AWS) APIs.