Article development led by
A practitioner’s guide to increasing
confidence in system correctness.
BY CAITIE MCCAFFREY
LESLIE LAMPORT, KNOWN for his seminal work in
distributed systems, famously said, “A distributed
system is one in which the failure of a computer
you didn’t even know existed can render your own
computer unusable.” Given this bleak outlook and the
large set of possible failures, how do you even begin to
verify and validate the distributed systems you build
are doing the right thing?
Distributed systems are difficult to build and test
for two main reasons: partial failure and asynchrony.
Asynchrony is the nondeterminism of ordering and
timing within a system; essentially, there is no now. 10
Partial failure is the idea components can fail along
the way, resulting in incomplete re-
sults or data.
These two realities of distributed
systems must be addressed to create a
correct system. Solving these problems
often leads to solutions with a high degree of complexity. This in turn leads
to an increase in the probability of human error in either design, implementation, or operation. In addition, the
interaction of asynchrony and partial
failure leads to an extremely large state
space of failure conditions. Because of
the complexity of distributed systems
and the large state space of failure conditions, testing and verifying these systems are critically important. Without
explicitly forcing a system to fail, it is
unreasonable to have any confidence it
will operate correctly in failure modes.
Formal verification is a set of methods that prove properties about a system. Once these methods have been
used, a system gets a gold star for being
Formal specification languages.
Individual systems and protocols can be
verified using formal methods such as
TLA+ and Coq. 16 These are formal specification languages that allow users to
design, model, document, and verify
the correctness of concurrent systems.
Every system under test requires the
definition of the program, an environment including what kinds of failures
can happen, and a correctness specification that defines the guarantees the
system provides. With these tools a system can be declared provably correct.
Amazon Web Services (AWS) successfully used TLA+ to verify more
than 10 core pieces of its infrastructure, including Simple Storage Service
(S3). 8 This resulted in the discovery
of several subtle bugs in the system
design and allowed AWS to make aggressive performance optimizations
without having to worry about breaking existing behavior.
The typical complaint about formal
specification languages is that learning
them is difficult and writing the specifications is tedious and time consuming.
It is true that while they can uncover