practice
Doi: 10.1145/2347736.2347751
Article development led by
queue.acm.org
Making the case for resilience testing.
By John ALLSPAW
Fault injection
in Production
WHen We BUild Web infrastructures at Etsy, we
aim to make them resilient. This means designing
them carefully so they can sustain their (increasingly
critical) operations in the face of failure. Thankfully,
there have been a couple of decades and reams of
paper spent on researching how fault tolerance and
graceful degradation can be brought to computer
systems. That helps the cause.
To make sure the resilience built into Etsy systems
is sound and that the systems behave as expected, we
have to see the failures being tolerated in production.
Why production? Why not simulate this in a QA
or staging environment? The reason is the existence
of any differences in those environments brings
uncertainty to the exercise, as well as because the risk
of not recovering has no consequences, which can
bring unforeseen assumptions into the fault-tolerance
design and into recovery. The goal is to reduce
uncertainty, not increase it.
Forcing failures to happen, or even designing
systems to fail on their own, generally is not easily sold
to management. Engineers are not conditioned to
embrace their ability to respond to emergency
situations; they aim to avoid them altogether. Taking a detailed look at how
to respond better to failure is essentially accepting that failure will happen, which you might think is counter
to what you want in engineering, or in
business.
Take, for example, what you would
normally think of as a simple case: the
provisioning of a server or cloud instance from zero to production:
1. Bare metal (or cloud-compute instance) is made available.
2. Base operating system is installed
via PXE (preboot execution environment), or machine image.
3. Operating-system-level configurations are put into place (via configuration management or machine image).
4. Application-level configurations
are put into place (via configuration
management, app deployment, or machine image).
5. Application code is put into place
and underlying services are started correctly (via configuration management,
app deployment, or machine image).
6.Systems integration takes
place in the network (load balancers, VLANs, routing, switching, DNS,
among others).
This is probably an oversimplification, and each step or layer is likely to
represent a multitude of CPU cycles;
disk, network and/or memory operations; and various amounts of software
mechanisms. All of these come together to bring a node into production.
Operability means that you can have
confidence in this node coming into
production, possibly joining a cluster,
and serving live traffic seamlessly every
time it happens. Furthermore, you want
and expect to have confidence that if the
underlying power, configuration, application, or compute resources (CPU,
disk, memory, network, and so on) experience a fault, then you can survive such
a fault by some means: allowing the application to degrade gracefully, rebuild
itself, take itself out of production, and
alert on the specifics of the fault.
Building this confidence typically
comes in a number of ways: