cannot be viewed to be comprehensive enough to gain perfect coverage
of the system’s safety. While any increase in the confidence of the system’s resilient abilities is positive,
it is still just that: an increase, not a
completion of perfect confidence.
Any complex system can (and will)
fail in surprising ways, no matter how
many different types of faults you inject and recover from.
Some have suggested that continually introducing failures automatically
is a more efficient way to gain confidence in the adaptability of the system
than manually running GameDay exercises as an engineering-team event.
Both approaches have the same limitation mentioned here, in that they
result in an increase in confidence but
cannot be used to achieve sufficient
Automated fault injection can
carry with it a paradox. If the faults
that are injected (even at random) are
handled in a transparent and graceful
way, then they can go unnoticed. You
would think this was the goal: for failures not to matter whatsoever when
they occur. This masking of failures,
however, can result in the very complacency they intend (at least should
intend) to decrease. In other words,
when you have randomly generated
and/or continual fault injection and
recovery happening successfully, care
must be taken to raise the detailed
awareness that this is happening—
when, how, where. Otherwise, the
failures themselves become another
component that increases complexity
in the system while still having limitations to their functionality (because
they are still contrived and therefore
A lot of what I am proposing should
simply be an extension of the confi-dence-building tools that organizations already have. Automated quality
assurance, fault tolerance, redundancy, and A/B testing are all in the same
category of GameDay scenarios, although likely with less drama.
Should everything have an associated GameDay exercise? Maybe, or maybe not, depending on the level of confidence you have in the components,
interactions, and levels of complexity
Shying away from
the effects of failure
in a misguided
attempt to reduce
risk will result
in poor designs,
skills, and a false
sense of safety.
found in your application and infrastructure. Even if your business does
not think that GameDay exercises are
warranted, however, they ought to have
a place in your engineering toolkit.
Why would you introduce faults into
an otherwise well-behaved production
system? Why would that be useful?
First, these failure-inducing exercises can serve as “vaccines” to improve the safety of a system—a small
amount of failure injected to help the
system learn to recover. It also keeps a
concern of failure alive in the culture of
engineering teams, and it keeps complacency at bay.
It gathers groups of people who
might not normally get together to
share in experiencing failures and to
build fault tolerance. It can also help
bring the concept of operability in
production closer to developers who
might not be used to it.
At a high level, production fault injection should be considered one of
many approaches used to gain confidence in the safety and resilience of a
system. Similar to unit testing, functional testing, and code review, this approach is limited as to which surprising events it can prevent, but it also has
benefits, many of which are cultural.
We certainly cannot imagine working
Black Box Debugging
James A. Whittaker, Herbert H. Thompson
Too Darned Big to Test
A Conversation with Steve Bourne,
Eric Allman, and Bryan Cantrill
January 14, 2009
John Allspaw ( firstname.lastname@example.org) is senior vice
president of tech operations at etsy. he has worked in
systems operations for more than 14 years in biotech,
government, and online media. he built the backing
infrastructures at salon, Info World, Friendster, and Flickr.