at an arbitrary time during execution. It
can even deal with crashes happening
during recovery, when the file system attempts to reconstruct a consistent state
from whatever it finds on the persistent
medium after a crash, and in the worst
case no progress happens (a situation
that could arise due to a crash triggered
by a fault in the file system itself).
The authors show how this logic can
be applied to actual file systems, including difficulties such as read and write
operations that are asynchronous to the
execution of the file-system code. They
then use this formalism to specify and
implement an actual, fully featured file
system, and produce a machine-checked
and largely automated proof that this
file system is actually crash-safe.
This is truly remarkable work, combining the development of a new formalism with its application in real
systems to prove very complex properties. It is exactly this kind of properties
where formal proofs are most powerful, as anything that is as complex and
interconnected as the crash-safety
property is notoriously difficult to get
right in the implementation. Experience shows it is almost never right—
unless formally proved.
Proof automation is very important
to making such techniques scale to real-world, high-performance systems. In
this paper, the authors have taken a first
step toward reducing the manual effort,
and their work has already triggered further progress in automating such proof.
Other work has demonstrated the feasibility of automating some of the underlying infrastructure, which this work
assumes correct without proof. All this
means is that complete operating systems, with full proofs of functional correctness, may be closer than we thought
only two years ago.
Gernot Heiser is a Scientia Professor and the John Lions
Chair for operating systems at the University of New
South Wales.
Copyright held by author.
FILE SYSTEMS HAVE been a standard
part of computer systems for over 50
years. We routinely (and mostly unthinkingly) use them to store and retrieve persistent data. But how persistent are they really?
This question is far trickier to answer than it may seem at first, for several reasons. Persistent storage media
are orders of magnitude slower than
non-persistent memories, such as
processor registers or random-access
memory. They are also organized differently: persistent media are read
or written in blocks whose size is on
the order of kilobytes but can be up
to megabytes. Bridging the speed and
granularity gaps requires complex
logic for data buffering, implementing
efficient lookups, and reordering slow
operations to obtain the best performance. Some of this is performed by
the operating system, and some of it in
the firmware of the storage device, with
limited operating-system control.
The file system hides this complexity behind a simple read-write interface,
with the promise that data that has
been written to a persistent medium
can be retrieved correctly at a later time
(provided the storage medium itself is
not destroyed).
What makes it tricky to fulfill this
promise is the fact that computer systems are (still) not reliable—they crash.
Such a crash may be the result of a physi-
cal failure (lack of electrical power), but
is more frequently the result of failing
software, especially the operating sys-
tem, or even the file-system implemen-
tation itself. Unless great care is taken,
such a failure can lead to loss or corrup-
tion of supposedly persistent data. For
example, a write operation may have
updated an internal buffer but not yet
reached the persistent medium, result-
ing in data loss. Reordering may lead
to causality violations, where data pro-
duced later in the computation is made
persistent, while data logically produced
before is not. In extreme cases, corrup-
tion of metadata, such as lookup data
structures, can result in the loss of data
that had been made persistent earlier.
Different file systems go to different
lengths in trying to prevent such loss,
from making no guarantees at all, to
guaranteeing the integrity of metadata,
to guaranteeing persistence and integ-
rity of all data written before an explicit
or implicit synchronization barrier. But
what do those “guarantees” really mean,
when they are implemented in complex
software that is almost certainly buggy?
In particular, what does it mean that
a file system is “crash safe?”
The following paper presents a big
step toward real-world file systems that
are crash-safe in a strict sense. It devel-
ops a formalism called crash Hoare logic,
which extends the traditional Hoare log-
ic, widely used for reasoning about pro-
grams, by a crash condition. This allows
the authors to reason about consistency
of file systems, when crashes may occur
Technical Perspective
Proving File Systems
Meet Expectations
By Gernot Heiser
This is truly
remarkable work,
combining the
development of a
new formalism with
its application in real
systems to prove very
complex properties.
To view the accompanying paper,
visit doi.acm.org/10.1145/3051092
DOI: 10.1145/3051090