up the stack and frustrate any attempts
at abstraction.
The Old Guard. The modern myth:
Formally verified distributed components. If we cannot rely on geniuses to
hide the specter of partial failure, the
next best hope is to face it head on,
armed with tools. Until quite recently,
many of us (academics in particular)
looked to formal methods such as
model checking16, 20, 29, 39, 40, 53, 54 to assist
“mere mortal” programmers in writing distributed code that upholds its
guarantees despite pervasive uncertainty in distributed executions. It is
not reasonable to exhaustively search
the state space of large-scale systems
(one cannot, for example, model
check Netflix), but the hope is that
modularity and composition (the next
best tools for conquering complexity)
can be brought to bear. If individual
distributed components could be
formally verified and combined into
systems in a way that preserved their
guarantees, then global fault tolerance could be obtained via composition of local fault tolerance.
Unfortunately, this, too, is a pipe
dream. Most model checkers require
a formal specification; most real-world
systems have none (or have not had one
since the design phase, many versions
ago). Software model checkers and other program-analysis tools require the
source code of the system under study.
The accessibility of source code is also
an increasingly tenuous assumption.
Many of the data stores targeted by
tools such as Jepsen are closed source;
large-scale architectures, while typically built from open source components,
are increasingly polyglot (written in a
wide variety of languages).
Finally, even if you assume that specifications or source code are available,
techniques such as model checking are
not a viable strategy for ensuring that
applications are fault tolerant because,
as mentioned, in the context of timeouts, fault tolerance itself is an end-to-end property that does not necessarily
hold under composition. Even if you
are lucky enough to build a system out
of individually verified components, it
does not follow the system is fault tolerant—you may have made a critical error
in the glue that binds them.
The Vanguard. The emerging ethos:
YOLO. Modern distributed systems
approaches that combine testing with
fault injection.
Here, we describe the underlying
causes of this trend, why it has been
successful so far, and why it is doomed
to fail in its current practice.
The Old Gods. The ancient myth:
Leave it to the experts. Once upon a
time, distributed systems researchers
and practitioners were confident that
the responsibility for addressing the
problem of fault tolerance could be
relegated to a small priesthood of experts. Protocols for failure detection,
recovery, reliable communication,
consensus, and replication could be
implemented once and hidden away
in libraries, ready for use by the layfolk.
This has been a reasonable dream.
After all, abstraction is the best tool
for overcoming complexity in computer science, and composing reliable
systems from unreliable components
is fundamental to classical system
design. 33 Reliability techniques such
as process pairs18 and RAID45
demonstrate that partial failure can, in
certain cases, be handled at the lowest levels of a system and successfully
masked from applications.
Unfortunately, these approaches
rely on failure detection. Perfect failure
detectors are impossible to implement
in a distributed system, 9, 15 in which it
is impossible to distinguish between
delay and failure. Attempts to mask
the fundamental uncertainty arising
from partial failure in a distributed
system—for example, RPC (remote
procedure calls8) and NFS (network file
system49)—have met (famously) with
difficulties. Despite the broad consensus that these attempts are failed abstractions, 28 in the absence of better
abstractions, people continue to rely
on them to the consternation of developers, operators, and users.
In a distributed system—that is, a
system of loosely coupled components
interacting via messages—the failure
of a component is only ever manifested
as the absence of a message. The only
way to detect the absence of a message
is via a timeout, an ambiguous signal
that means either the message will never come or that it merely has not come
yet. Timeouts are an end-to-end concern28, 48 that must ultimately be managed by the application. Hence, partial
failures in distributed systems bubble
While the state
of the art in
verification and
program analysis
continues to evolve
in the academic
world, the industry
is moving in the
opposite direction:
away from formal
methods and
toward approaches
that combine
testing with fault
injection.