IronFleet: Proving Safety
and Liveness of Practical
By Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno,
Michael L. Roberts, Srinath Setty, and Brian Zill
Distributed systems are notorious for harboring subtle bugs.
Verification can, in principle, eliminate these bugs, but it
has historically been difficult to apply at full-program scale,
much less distributed system scale. We describe a methodology for building practical and provably correct distributed systems based on a unique blend of temporal logic of
actions-style state-machine refinement and Hoare-logic
verification. We demonstrate the methodology on a complex
implementation of a Paxos-based replicated state machine
library and a lease-based sharded key-value store. We prove
that each obeys a concise safety specification as well as desirable liveness requirements. Each implementation achieves
performance competitive with a reference system. With our
methodology and lessons learned, we aim to raise the standard for distributed systems from “tested” to “correct.”
Distributed systems are notoriously hard to get right. Protocol
designers struggle to reason about concurrent execution on
multiple machines, which leads to subtle errors. Engineers
implementing such protocols face the same subtleties and,
worse, must improvise to fill in gaps between abstract protocol descriptions and practical constraints such as "real logs
cannot grow without bound." Thorough testing is considered best practice, but its efficacy is limited by distributed
systems’ combinatorially large state spaces.
In theory, formal verification can categorically eliminate
errors from distributed systems. However, due to the complexity of these systems, previous work has primarily focused
on formally specifying, 1, 8, 18 verifying, 20 or at least bug-checking9
distributed protocols, often in a simplified form, without
extending such formal reasoning to the implementations. In
principle, one can use model checking to reason about the
correctness of both protocols15 and implementations. 17 In
practice, however, model checking is incomplete—the accuracy of the results depends on the accuracy of the model—
and does not scale. 1
This paper presents IronFleet, the first methodology for
automated machine-checked verification of the safety and
liveness of nontrivial distributed system implementations.
The IronFleet methodology is practical: it supports complex, feature-rich implementations with reasonable performance, and a tolerable proof burden.
Ultimately, IronFleet guarantees that the implementa-
tion of a distributed system meets a high-level, centralized
specification. For example, a sharded key-value store acts
as a key-value store, and a replicated state machine acts as
a state machine. This guarantee categorically rules out race
conditions, violations of global invariants, integer overflow,
disagreements between packet encoding and decoding, and
bugs in rarely exercised code paths such as failure recovery.
Moreover, it not only rules out bad behavior but also tells us
exactly how the distributed system will behave at all times.
The IronFleet methodology supports proving both safety
and liveness properties of distributed system implementations. A safety property says that the system cannot perform
incorrect actions; for example, replicated-state-machine linearizability says that clients never see inconsistent results.
A liveness property says that the system eventually performs
a useful action, for example, that it responds to each client
request. In large-scale deployments, ensuring liveness is
critical, since a liveness bug may render the entire system
IronFleet takes the verification of safety properties further than prior work (Section 7), mechanically verifying two
full-featured systems. The verification applies not just to
their protocols but to actual imperative implementations
that achieve good performance. Our proofs reason all the
way down to the bytes of the UDP packets sent on the network, guaranteeing correctness despite packet drops, reor-derings, or duplications.
Regarding liveness, IronFleet breaks new ground: to our
knowledge, IronFleet is the first system to mechanically
verify liveness properties of a practical protocol, let alone an
IronFleet achieves comprehensive verification of complex
distributed systems via a methodology for structuring and
writing proofs about them, as well as a collection of generic
verified libraries useful for implementing such systems.
Structurally, IronFleet’s methodology uses a concurrency
containment strategy (Section 3) that blends two distinct
verification styles within the same automated theorem-
proving framework, preventing any semantic gaps between
them. We use temporal logic of actions (TLA)-style state-
machine refinement13 to reason about protocol-level con-
currency, ignoring implementation complexities, then use
The original version of this paper was published as “Iron-
Fleet: Proving Practical Distributed Systems Correct” in
the 25th ACM Symposium on Operating Systems Principles
(SOSP), Oct. 2015.