are simply too large, too heterogeneous, and too dynamic for these
classic approaches to software quality to take root. In reaction, practitioners increasingly rely on resiliency
techniques based on testing and fault
injection. 6, 14, 19, 23, 27, 35 These “black box”
approaches (which perturb and observe the complete system, rather
than its components) are (arguably)
better suited for testing an end-to-end property such as fault tolerance.
Instead of deriving guarantees from
understanding how a system works
on the inside, testers of the system
observe its behavior from the outside,
building confidence that it functions
correctly under stress.
Two giants have recently emerged
in this space: Chaos Engineering6 and
Jepsen testing. 24 Chaos Engineering,
the practice of actively perturbing production systems to increase overall site
resiliency, was pioneered by Netflix, 6
but since then LinkedIn, 52 Microsoft, 38
Uber, 47 and PagerDuty5 have developed
Chaos-based infrastructures. Jepsen
performs black box testing and fault
injection on unmodified distributed
data management systems, in search
of correctness violations (for example,
counterexamples that show an execution was not linearizable).
Both approaches are pragmatic and
empirical. Each builds an understanding of how a system operates under
faults by running the system and observing its behavior. Both approaches offer
a pay-as-you-go method to resiliency:
the initial cost of integration is low,
and the more experiments that are
performed, the higher the confidence
that the system under test is robust.
Because these approaches represent
a straightforward enrichment of existing best practices in testing with well-understood fault injection techniques,
they are easy to adopt. Finally, and
perhaps most importantly, both approaches have been shown to be effective at identifying bugs.
Unfortunately, both techniques
also have a fatal flaw: they are manual
processes that require an extremely
sophisticated operator. Chaos Engi-
neers are a highly specialized subclass
of site reliability engineers. To devise
a custom fault injection strategy, a
Chaos Engineer typically meets with
different service teams to build an
understanding of the idiosyncrasies
of various components and their in-
teractions. The Chaos Engineer then
targets those services and interactions
that seem likely to have latent fault tol-
erance weaknesses. Not only is this ap-
proach difficult to scale since it must
be repeated for every new composition
of services, but its critical currency—
a mental model of the system under
study—is hidden away in a person’s
brain. These points are reminiscent
of a bigger (and more worrying) trend
in industry toward reliability priest-
hoods, 7 complete with icons (dash-
boards) and rituals (playbooks).
Jepsen is in principle a framework
that anyone can use, but to the best of
our knowledge all of the reported bugs
discovered by Jepsen to date were discovered by its inventor, Kyle Kingsbury,
who currently operates a “distributed
systems safety research” consultancy. 24
Applying Jepsen to a storage system
requires the superuser carefully read
the system documentation, generate
workloads, and observe the externally
visible behaviors of the system under
test. It is then up to the operator to
choose—from the massive combinatorial space of “nemeses,” including
machine crashes and network partitions—those fault schedules that are
likely to drive the system into returning
incorrect responses.
A human in the loop is the kiss of
death for systems that need to keep up
with software evolution. Human attention should always be targeted at tasks
that computers cannot do! Moreover,
the specialists that Chaos and Jepsen
testing require are expensive and rare.
Here, we show how geniuses can be abstracted away from the process of failure testing.
We Don’t Need Another Hero
Rapidly changing assumptions about
our visibility into distributed system
internals have made obsolete many
if not all of the classic approaches to
software quality, while emerging “
chaos-based” approaches are fragile and
unscalable because of their genius-in-the-loop requirement.
We present our vision of automated
failure testing by looking at how the
same changing environments that hastened the demise of time-tested resiliency techniques can enable new ones.
We argue the best way to automate the
experts out of the failure-testing loop is
to imitate their best practices in software and show how the emergence of
sophisticated observability infrastructure makes this possible.
The order is rapidly fadin.’ For large-scale distributed systems, the three
fundamental assumptions of traditional approaches to software quality
are quickly fading in the rearview mirror. The first to go was the belief that
you could rely on experts to solve the
hardest problems in the domain. Second was the assumption that a formal
specification of the system is available.
Finally, any program analysis (broadly
defined) that requires that source code
is available must be taken off the table. The erosion of these assumptions
helps explain the move away from classic academic approaches to resiliency
in favor of the black box approaches
described earlier.
What hope is there of understanding the behavior of complex systems
in this new reality? Luckily, the fact
that it is more difficult than ever to
understand distributed systems from
the inside has led to the rapid evolution of tools that allow us to understand them from the outside. Call-graph logging was first described by
Google; 51 similar systems are in use
at Twitter, 4 Netflix, 1 and Uber, 50 and
the technique has since been standardized. 43 It is reasonable to assume
that a modern microservice-based
Internet enterprise will already have
instrumented its systems to collect
call-graph traces. A number of startups that focus on observability have
recently emerged. 21, 34 Meanwhile,
provenance collection techniques
for data processing systems11, 22, 42 are
becoming mature, as are operating
system-level provenance tools. 44
Recent work12, 55 has attempted to infer
causal and communication structure
of distributed computations from
raw logs, bringing high-level explanations of outcomes within reach even
for uninstrumented systems.
Regarding testing distributed systems.
Chaos Monkey, like they mention, is awesome, and I also highly recommend getting Kyle to run Jepsen tests.
—Commentator on HackerRumor