practice
I
M
A
G
E
B
Y
V
I
T
E
Z
S
L
A
V
V
A
L
K
A
THE HETEROGENEITY, COMPLEXITY, and scale of cloud
applications make verification of their fault tolerance
properties challenging. Companies are moving away
from formal methods and toward large-scale testing
in which components are deliberately compromised
to identify weaknesses in the software. For example,
techniques such as Jepsen apply fault-injection testing
to distributed data stores, and Chaos Engineering
performs fault injection experiments on production
systems, often on live traffic. Both approaches have
captured the attention of industry and academia alike.
Unfortunately, the search space of distinct fault
combinations that an infrastructure can test is
intractable. Existing failure-testing solutions require
skilled and intelligent users who can supply the faults
to inject. These superusers, known as Chaos Engineers
and Jepsen experts, must study the sys-
tems under test, observe system execu-
tions, and then formulate hypotheses
about which faults are most likely to
expose real system-design flaws. This
approach is fundamentally unscal-
able and unprincipled. It relies on the
superuser’s ability to interpret how
a distributed system employs redun-
dancy to mask or ameliorate faults
and, moreover, the ability to recognize
the insufficiencies in those redundan-
cies—in other words, human genius.
This article presents a call to arms
for the distributed systems research
community to improve the state of
the art in fault tolerance testing.
Ordinary users need tools that automate the selection of custom-tailored faults to inject. We conjecture
that the process by which superusers
select experiments—observing executions, constructing models of system
redundancy, and identifying weaknesses in the models—can be effectively modeled in software. The article describes a prototype validating
this conjecture, presents early results
from the lab and the field, and identifies new research directions that can
make this vision a reality.
The Future Is Disorder
Providing an “always-on” experience
for users and customers means that
distributed software must be fault tolerant—that is to say, it must be written to anticipate, detect, and either
mask or gracefully handle the effects
of fault events such as hardware failures and network partitions. Writing
fault-tolerant software—whether for
distributed data management systems
involving the interaction of a handful
of physical machines, or for Web applications involving the cooperation of
tens of thousands—remains extremely
difficult. While the state of the art in
verification and program analysis continues to evolve in the academic world,
the industry is moving very much in
the opposite direction: away from formal methods (however, with some
noteworthy exceptions, 41) and toward
Abstracting
the Geniuses
Away from
Failure Testing
DOI: 10.1145/3152483
Article development led by
queue.acm.org
Ordinary users need tools that automate the
selection of custom-tailored faults to inject.
BY PETER ALVARO AND SEVERINE TYMON