Latency Monkey. Once Chaos Monkey is running and individual instance
failure no longer has any impact, a new
class of failures emerges. Dealing with
instance failure is relatively easy: just
terminate the bad instances and let
new healthy instances take their places. Detecting when instances become
unhealthy, but are still working, is
more difficult, and having resilience to
this failure mode is harder still. Error
rates could become elevated, but the
service could occasionally return success. The service could reply with successful responses, but latency could
increase, causing timeouts.
What Netflix needed was a way of
inducing failure that simulated partially healthy instances. Hence came
the genesis of Latency Monkey, which
induces artificial delays in the RESTful client-server communication layer
to simulate service degradation and
measures if upstream services respond appropriately. In addition, by
creating very large delays, node downtime, or even an entire service downtime, can be simulated without physically bringing instances or services
down. This can be particularly useful
when testing the fault tolerance of a
new service by simulating the failure
of its dependencies, without making
these dependencies unavailable to the
rest of the system.
The remaining army. The rest of
the Simian Army, including Janitor
Monkey, takes care of upkeep and
other miscellaneous tasks not directly
related to availability. (For details, see
http://techblog.netflix.com/2011/07/
netflix-simian-army.html.)
monkey training at netflix
While the Simian Army is a novel concept and may require a shift in perspective, it is not as difficult to implement
as it initially appears. Understanding
what Netflix went through is illustrative for others interested in following
such a path.
Netflix is known for being bold in
its rapid pursuit of innovation and
high availability, but not to the point
of callousness. It is careful to avoid any
noticeable impact to customers from
these failure-induction exercises. To
minimize risk, Netflix takes the following steps when introducing a monkey:
1. With the new monkey in the test
a complex system
is constantly
undergoing varying
degrees of failure.
Resiliency is the
measure by which
it can recover,
or be insulated,
from failure, both
current and future.
environment, engineers observe the
user experience. The goal is to have
negligible or zero impact on the customer. If the engineers see any adverse
results, then they make the necessary
code changes to prevent recurrence.
This step is repeated as many times as
necessary until no adverse user experience is observed.
2. Once no adverse results are observed in the test environment, the
new monkey is enabled in the production environment. Initially, the
new monkey is run in opt-in mode.
One or more services are selected to
run the new monkey against, having
already been run in the test environment. The new monkey runs for a few
months in this mode, opting in new
services over time.
3. After many services have opted in,
the new monkey graduates to opt-out
mode, in which all services are potential targets for the new monkey. If a
service is placed in an opt-out list, the
monkey avoids it.
4. The opt-out list is periodically
reviewed for each monkey, and service owners are encouraged to remove
their services from the list. The platform and monkey are improved to increase adoption and address reasons
for opting out.
the importance of observability
No discussion of resilience would be
complete without highlighting the important role of monitoring. Monitoring
here means the ability to observe and,
optionally, signal an alarm on the external and internal states of the system
and its components. In the context of
failure induction and resilience, monitoring is important for two reasons:
˲ During a real, nonsimulated cus-tomer-impacting event, it is important
to stabilize the system and eliminate
customer impact as quickly as possible. Any automation that causes additional failure must be stopped during
this time. Failing to do so can cause
Chaos Monkey, Latency Monkey, and
the other simians to further weaken
an already unhealthy system, causing
even greater adverse end-user impact.
The ability to observe and detect cus-tomer-impacting service degradation
is an important prerequisite to building and enabling automation that
causes failure.