practice
Doi: 10.1145/2492007.2492022
Article development led by
queue.acm.org
Embracing failure to improve resilience
and maximize availability.
By aRiEl tsEitlin
the
antifragile
organization
Failure is inevitaBle. Disks fail. Software bugs lay
dormant waiting for just the right conditions to bite.
People make mistakes. Data centers are built on
farms of unreliable commodity hardware. If you are
running in a cloud environment, then many of these
factors are outside of your control. To compound
the problem, failure is not predictable and does not
occur with uniform probability and frequency. The
lack of a uniform frequency increases uncertainty and
risk in the system. In the face of such inevitable and
unpredictable failure, how can you build a reliable
service that provides the high level of availability your
users can depend on?
A naive approach could attempt to prove the
correctness of a system through rigorous analysis.
It could model all different types of failures and
deduce the proper workings of the system through
a simulation or another theoretical framework that
emulates or analyzes the real operating environment.
Unfortunately, the state of the art of
static analysis and testing in the industry has not reached those capabilities. 4
A different approach could attempt to create exhaustive test suites
to simulate all failure modes in a separate test environment. The goal of
each test suite would be to maintain
the proper functioning of each component, as well as the entire system
when individual components fail.
Most software systems use this approach in one form or another, with a
combination of unit and integration
tests. More advanced usage includes
measuring the coverage surface of
tests to indicate completeness.
While this approach does improve
the quality of the system and can prevent a large class of failures, it is insufficient in maintaining resilience in a
large-scale distributed system. A distributed system must address the challenges posed by data and information
flow. The complexity of designing and
executing tests that properly capture
the behavior of the target system is
greater than that of building the system itself. Layer on top of that the attribute of large scale, and it becomes
unfeasible, with current means, to
achieve this in practice while maintaining a high velocity of innovation and
feature delivery.
Yet another approach, advocated
in this article, is to induce failures in
the system to empirically demonstrate
resilience and validate intended behavior. Given the system was designed
with resilience to failures, inducing
those failures—within original design
parameters—validates the system
behaves as expected. Because this approach uses the actual live system, any
resilience gaps that emerge are identified and caught quickly as the system
evolves and changes. In the second
approach just described, many complex issues are not caught in the test
environment and manifest themselves
in unique and infrequent ways only
in the live environment. This, in turn,
increases the likelihood of latent bugs
remaining undiscovered and accumu-