Certifying a File System Using
Crash Hoare Logic: Correctness
in the Presence of Crashes
By Tej Chajed, Haogang Chen, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich, and Daniel Ziegler
FSCQ is the first file system with a machine-checkable proof
that its implementation meets a specification, even in the
presence of fail-stop crashes. FSCQ provably avoids bugs that
have plagued previous file systems, such as performing disk
writes without sufficient barriers or forgetting to zero out
directory blocks. If a crash happens at an inopportune time,
these bugs can lead to data loss. FSCQ’s theorems prove
that, under any sequence of crashes followed by reboots,
FSCQ will recover its state correctly without losing data.
To state FSCQ’s theorems, this paper introduces the
Crash Hoare logic (CHL), which extends traditional Hoare
logic with a crash condition, a recovery procedure, and logical
address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers
through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system.
Although FSCQ’s design is relatively simple, experiments
with FSCQ as a user-level file system show that it is sufficient
to run Unix applications with usable performance. FSCQ’s
specifications and proofs required significantly more work
than the implementation, but the work was manageable
even for a small team of a few researchers.
This paper describes Crash Hoare logic (CHL), which allows
developers to write specifications for crash-safe storage systems and also prove them correct. “Correct” means that,
if a computer crashes due to a power failure or other fail-stop fault and subsequently reboots, the storage system
will recover to a state consistent with its specification (e.g.,
POSIX17). For example, after recovery, either all disk writes
from a file-system call will be on disk, or none will be. Using
CHL we write a simple specification for a subset of POSIX
and build the FSCQ certified file system, which comes with a
machine-checkable proof that its implementation matches
Proving the correctness of a file system implementation is
important, because existing file systems have a long history
of bugs both in normal operation and in handling crashes.
Reasoning about crashes is especially challenging because
it is difficult for the file-system developer to consider all
possible points where a crash could occur, both while a file-system call is running and during the execution of recovery
code. Often, a system may work correctly in many cases, but
if a crash happens at a particular point between two specific
disk writes, then a problem arises.
Most approaches to building crash-safe file systems fall
roughly into three categories (see the SOSP paper3 for a more
in-depth discussion of related work): testing, program analysis, and model checking. Although they are effective at finding bugs in practice, none of them can guarantee the absence
of crash-safety bugs in actual implementations. This paper
focuses precisely on this issue: helping developers build file
systems with machine-checkable proofs that they correctly
recover from crashes at any point.
Researchers have used theorem provers for certifying
real-world systems such as compilers,
23 small kernels,
22 kernel extensions,
35 and simple remote servers.
15 Some certification projects1, 10, 11, 18, 28, 32 have even targeted file systems,
as we do, but in each case either the file system was not
complete, executable, and ready to run on a real operating
system; or its proof did not consider crashes. Reasoning
about crash-free executions typically involves considering
the states before and after some operation. Reasoning about
crashes is more complicated because crashes can expose
intermediate states of an operation.
Building an infrastructure for reasoning about file-system crashes poses several challenges. Foremost among
those challenges is the need for a specification framework
that allows the file-system developer to formalize the system behavior under crashes. Second, it is important that
the specification framework allows for proofs to be automated, so that one can make changes to a specification and
its implementation without having to redo all of the proofs
manually. Third, the specification framework must be able
to capture important performance optimizations, such as
asynchronous disk writes, so that the implementation of a
file system has acceptable performance. Finally, the specification framework must allow modular development: developers should be able to specify and verify each component
in isolation and then compose verified components. For
instance, once a logging layer has been implemented, file-system developers should be able to prove end-to-end crash
safety in the inode layer simply by relying on the fact that
logging ensures atomicity; they should not need to consider
every possible crash point in the inode code.
CHL addresses these challenges by allowing program-
mers to specify what invariants hold in case of crashes and
The original version of this paper is entitled “Using Crash
Hoare Logic for Certifying the FSCQ File System” and was
published in the Proceedings of the 25th ACM Symposium
on Operating Systems Principles (SOSP ’ 15).