78 COMMUNICATIONS OF THE ACM | APRIL 2017 | VOL. 60 | NO. 4
other_blocks ∨ a 〈v, [v0]⊕vs〉 other_blocks. Moreover,
note that the crash condition specifies that disk_write
could crash in two possible states (either before making
the write or after). In all three logical conditions, other_
blocks stands for the arbitrary contents of all other disk
blocks beside a, which should be preserved by this operation, even in case of a crash.
The specification in Figure 3 captures asynchronous
writes. To do so, CHL models the disk as a (partial) function
from a block number to a tuple, 〈v, vs〉, consisting of the last-written value v and a set of previous values vs, any of which
could appear on disk after a crash. Block numbers greater
than the size of the disk do not map to anything. Reading
from a block returns the last-written value, since even if
there are previous values that might appear after a crash, in
the absence of a crash a read should return the last write.
Writing to a block makes the new value the last-written value
and adds the old last-written value to the set of previous values. Reading or writing a block number that does not exist
causes the system to “fail” (as opposed to finishing or crashing). Finally, CHL’s disk model supports a sync operation,
which forces the disk to flush pending writes to persistent
storage, modeled in the postcondition for sync by discarding all previous values.
Returning to Figure 3, the disk_write specification
asserts through the precondition that the address being
written, a, must be valid (i.e., within the disk’s size), by stating that address a points to some value 〈v0, vs〉 on disk. The
specification’s postcondition asserts that the block being
modified will contain the new value 〈v, [v0]⊕vs〉; that is,
the new last-written value is v, and v0 is added to the set of
previous values. The specification also asserts through the
crash condition that disk_write could crash in a state
that satisfies a 〈v0, vs〉 other_blocks ∨ a 〈v, [v0]⊕vs〉
other_blocks, that is, either the write did not happen
(a still has 〈v0, vs〉) or it did (a has 〈v, [v0]⊕vs〉). Finally, the
specification asserts that the rest of the disk is unaffected:
if other disk blocks satisfied some predicate other_blocks
before disk_write, they will still satisfy the same predicate afterward.
One subtlety of CHL’s crash conditions is that they
describe the state of the disk just before the crash occurs,
rather than just after. Right after a crash, CHL’s disk model
specifies that each block nondeterministically chooses one
value from the set of possible values before the crash. For
instance, the first line of Figure 3’s crash condition says that
the disk still “contains” all previous writes, represented by
〈v0, vs〉, rather than a specific value that persisted across the
crash, chosen out of [v0]⊕vs. This choice of representing
the state before the crash rather than after the crash allows
the crash condition to be similar to the pre- and postconditions. For example, in Figure 3, the state of other blocks just
before a crash matches the other_blocks predicate, as in the
pre- and postconditions. However, describing the state after
the crash would require a more complex predicate (e.g., if
other_blocks contains unsynced disk writes, the state after
the crash must choose one of the possible values). Making
crash conditions similar to pre- and postconditions is good
for proof automation.
The specification of disk_write captures two important behaviors of real disks—that the disk controller can
defer flushing pending writes to persistent storage and can
reorder them—in order to achieve good performance. CHL
could model a simpler synchronous disk by specifying that
a points to a single value (instead of a set of values) and
changing the crash condition to say that either a points to
the new value (a v) or a points to the old value (a v0).
This change would simplify proofs, but this model of a disk
would be accurate only if the disk were running in synchronous mode with no write buffering, which achieves lower
performance.
The disk_write specification states that blocks are
written atomically; that is, after a crash, a block must contain either the last-written value or one of the previous
values, and partial block writes are not allowed. This is a
common assumption made by file systems and we believe it
matches the behavior of many disks in practice. Using CHL,
we could capture the notion of partial block writes by specifying a more complicated crash condition, but the specification shown here matches the common assumption. We
leave to future work the question of how to build a certified
file system without that assumption.
Much like other Hoare-logic-based approaches, CHL
requires developers to write a complete specification for
every procedure, including internal ones (e.g., allocating
an object from a free bitmap). This requires stating precise pre- and postconditions. In CHL, developers must also
write a crash condition for every procedure. In practice,
we have found that the crash conditions are often simpler
to state than the pre- and postconditions. For example, in
FSCQ, most crash conditions in layers above the log simply state that there is an active (uncommitted) transaction;
only the top-level system-call code begins and commits
transactions.
3. 3. Logical address spaces
The above example illustrates how CHL can specify predicates about disk contents, but file systems often need to
express similar predicates at other levels of abstraction
as well. Consider the Unix pwrite system call. Its specification should be similar to disk_write, except that it
should describe offsets and values within the file’s contents, rather than block numbers and block values on disk.
Expressing this specification directly in terms of disk contents is tedious. For example, describing pwrite might
require saying that we allocated a new block from the bitmap allocator, grew the inode, perhaps allocated an indirect block, and modified some disk block that happens
to correspond to the correct offset within the file. Writing
such complex specifications is also error-prone, which
can result in significant wasted effort in trying to prove an
incorrect spec.
To capture such high-level abstractions in a concise manner, we observe that many of these abstractions deal with
logical address spaces. For example, the disk is an address
space from disk-block numbers to disk-block contents; the
inode layer is an address space from inode numbers to inode
structures; each file is a logical address space from offsets