times referred to as GameDay. The
goal is to make these faults happen in
production in order to anticipate similar behaviors in the future, understand
the effects of failures on the underlying
systems, and ultimately gain insight
into the risks they pose to the business.
Causing failures to happen in complex systems is not a new concept. Organizations such as fire departments
have been running full-scale disaster
drills for decades. Web engineering
has an advantage over these types of
drills in that the systems engineers can
gather a massive amount of detail on
any fault at an extremely high resolution, wield a very large amount of control over the intricate mechanisms of
failures, and learn how to recover very
quickly from them.
Constructing a GameDay exercise at
Etsy follows this pattern:
1. Imagine a possible untoward
event in your infrastructure.
2. Figure out what is needed to prevent that event from affecting your
business, and implement that.
3. Cause the event to happen in production, ultimately to prove the non-effect of the event and gain confidence
The greatest advantage of a GameDay exercise is figuring out how to
prevent a failure from affecting the
business. It is difficult to overstate the
importance of steps 1 and 2. The idea
is to get a group of engineers together
to brainstorm the various failure scenarios that a particular application,
service, or infrastructure could experience. This will help remove complacency in the safety of the overall
system. Complacency is an enemy of
resilience. If a system has a period of
little or no degradation, then there is
a real risk of it drifting toward failure
on multiple levels, because engineers
can be convinced—falsely—that the
system is experiencing no surprising
events because it is inherently safe.
Imagining failure scenarios and
asking, “What if…?” can help combat
this thinking and bring a constant
sense of unease to the organization.
This is a hallmark characteristic of
high-reliability organizations. Think of
it as continuously deploying a business
continuity plan (BCP).
In theory, the idea of GameDay exercises may seem sound: you make
an explicit effort to anticipate failure
scenarios, prepare for handling them
gracefully, and then confirm this behavior by purposely injecting those
failures into production. In practice,
this idea may not seem appealing to
the business: it brings risk to the forefront; and without context, the concept
of making failures happen on purpose
may seem crazy. What if something
The traditional view of failure in
production is avoidance at all costs.
The assumption is that failure is entirely preventable, and if it does happen,
then find the persons responsible (
usually those most proximate to the code
or systems) and fire them, in the belief
that getting rid of “bad apples” is how
you bring safety to an organization.
This perspective is, of course, ludicrous. Fault injection and GameDay
scenarios can revert this view into a
more pragmatic and realistic one.
When approaching Etsy’s executive
team with the idea of GameDay exercises, I explained that it is not that we want
to cause failures out of some perverse
need to watch infrastructure crumble;
it is because we know that parts of the
system will inevitably fail, and we need
to gain confidence that the system is
resilient enough to handle it gracefully.
The concept, I explained to the executives, is that building resilient systems requires experience with failure,
and that we want to anticipate and
confirm our expectations surrounding
failure more often, not less often. Shying away from the effects of failure in
a misguided attempt to reduce risk will
result in poor designs, stale recovery
skills, and a false sense of safety.
In other words, it is better to prepare
for and cause failures to happen in production while we are watching, instead
of relying on a strategy of hoping the
system will behave correctly when we
are not watching. The worst-case scenario with a GameDay exercise is that
something will go wrong during the
exercise. In that case, an entire team
of engineers is ready to respond to the
surprises, and the system will become
stronger as a result.
The worst-case scenario in the ab-
sence of a GameDay exercise is that
something in production will fail that
was not anticipated or prepared for,
and it will happen when the team is not
expecting or watching closely for it.
Case: Payments System
Earlier this year Etsy rolled out a new
payment system ( http://www.etsy.com/
checkout/) to provide more flexibility
and reliability for buyers and sellers
on the site. Obviously, resilience was
of paramount importance to the success of the project. As with many Etsy
features, the rollout to production was
done in a gradual ramp-up. Sellers interested in allowing this new payment
method could opt in, and Etsy would
turn the functionality on for buckets of
sellers at a time.
As you might imagine, the payment
system is not particularly simple. It has
fraud-detection components, audit
trails, security mechanisms, process-ing-state machines, and other components that need to interact with each
other. Thus, Etsy has a mission-critical
system with a significant amount of
complexity and whose expectations for
being resilient are very high.
To confirm its ability to withstand
failures gracefully, Etsy put together a
list of reasonable scenarios to prepare
for, develop against, and test in production, including the following:
˲ One of the app servers dies (power
cable yanked out);
˲ All of the app servers leave the
˲ One of the app servers gets wiped
clean and needs to be fully rebuilt from
˲ Database dies (power cable yanked