ability infrastructure and fault injection infrastructure, consuming the
former, maintaining a model of system
redundancy, and using it to parameterize the latter. Explanations of system outcomes and fault injection infrastructures are already available. In
the current state of the art, the puzzle
piece that fits them together (models of
redundancy) is a manual process. LDFI
(as we will explain) shows that automation of this component is possible.
A Blast from the Past
In previous work, we introduced a bug-finding tool called LDFI (lineage-driven
fault injection). 2 LDFI uses data provenance collected during simulations of
distributed executions to build
derivation graphs for system outcomes. These
graphs function much like the models
of system redundancy described earlier. LDFI then converts the derivation
graphs into a Boolean formula whose
satisfying assignments correspond to
combinations of faults that invalidate
all derivations of the outcome. An experiment targeting those faults will
then either expose a bug (that is, the expected outcome fails to occur) or reveal
additional derivations (for example, after a timeout, the system fails over to a
backup) that can be used to enrich the
model and constrain future solutions.
At its heart, LDFI reapplies well-understood techniques from data
management systems, treating fault
tolerance as a materialized view maintenance problem. 2, 13 It models a distributed system as a query, its expected outcomes as query outcomes, and
critical facts such as “replica A is up at
time t” and “there is connectivity between nodes X and Y during the interval i . . . j” as base facts. It can then ask
a how-to query: 37 What changes to base
data will cause changes to the derived
data in the view? The answers to this
query are the faults that could, according to the current model, invalidate the
expected outcomes.
The idea seems far-fetched, but the
LDFI approach shows a great deal of
promise. The initial prototype demonstrated the efficacy of the approach at
the level of protocols, identifying bugs
in replication, broadcast, and commit
protocols. 2, 46 Notably, LDFI reproduced
a bug in the replication protocol used by
the Kafka distributed log26 that was first
(manually) identified by Kingsbury. 30
A later iteration of LDFI is deployed at
Netflix, 1 where (much like the illustration in Figure 1) it was implemented
as a microservice that consumes traces
from a call-graph repository service and
provides inputs for a fault injection service. Since its deployment, LDFI has
identified 11 critical bugs in user-fac-ing applications at Netflix. 1
Rumors from the Future
The prior research presented earlier is
only the tip of the iceberg. Much work
still needs to be undertaken to realize
the vision of fully automated failure
testing for distributed systems. Here,
we highlight nascent research that
shows promise and identifies new di-
rections that will help realize our vision.
Don’t overthink fault injection. In the
context of resiliency testing for distribut-
ed systems, attempting to enumerate
and faithfully simulate every possible
kind of fault is a tempting but dis-
tracting path. The problem of under-
standing all the causes of faults is not
directly relevant to the target, which
is to ensure that code (along with its
configuration) intended to detect and
mitigate faults performs as expected.
Consider Figure 2: The diagram on
the left shows a microservice-based
architecture; arrows represent calls
generated by a client request. The
right-hand side zooms in on a pair of
interacting services. The shaded box
in the caller service represents the
fault tolerance logic that is intended
to detect and handle faults of the cal-
lee. Failure testing targets bugs in this
logic. The fault tolerance logic targeted
in this bug search is represented as the
shaded box in the caller service, while
the injected faults affect the callee.
The common effect of all faults, from
the perspective of the caller, is explicit
error returns, corrupted responses,
and (possibly infinite) delay. Of these
manifestations, the first two can be ad-
equately tested with unit tests. The last
is difficult to test, leading to branches
of code that are infrequently executed.
If we inject only delay, and only at com-
ponent boundaries, we conjecture that
we can address the majority of bugs re-
lated to fault tolerance.
Explanations everywhere. If we can
provide better explanations of system
outcomes, we can build better models
The rapid evolution
of fault injection
infrastructure
makes it easier
than ever to test
fault hypotheses
on large-scale
systems.