While researchers have developed
a number of different approaches to
fault tolerance over the years, ultimately they all share a common strategy:
redundancy. While hardware systems
can employ redundancy at multiple levels, such as the central processing unit,
memory, and firmware, fault-tolerant
software design largely comes down
to creating mechanisms for consistent
data replication.
One of the most common approaches to software replication involves a method known as state machine replication. With state machine
replication, any service provided by a
computer can be described as a state
machine, which accepts commands
from other client machines that alter
the state machine. By deploying a set
of replica state machines with identical initial states, subsequent client
commands can be processed by the
replicas in a pre-determined fashion,
so that all state machines eventually
reach the same state. Thus, the failure of any one state machine can be
masked by the surviving machines.
The origins of this approach to
fault tolerance stretch back to the
1970s when researchers at SRI International began exploring the question
of how to fly mission-critical aircraft
using an assembly of computers. That
work laid the foundation for contemporary approaches to fault tolerance
by establishing the fundamental difference between timely systems, in
which network transmission times
are bounded and clocks are synchronized, and asynchronous systems, in
which communication latencies have
infinite-tail distribution (most messages arrive within a certain time limit
but, with decreasingly low probability,
messages may be delayed in transit
beyond any bound).
The SRI work also helped draw important distinctions between the various types of faults experienced in a
system, such as message omissions,
machine crashes, or arbitrary faults
due to software malfunction or other
undetected data alterations. Finally,
the SRI work helped to characterize
resilience bounds, or how many machines are needed to tolerate certain
failures.
The idea of state machine replication was given its first abstract formal-
ization by Leslie Lamport and later
surveyed by Fred Schneider. Lamport’s
work eventually led to the Paxos protocol, a descendant of which is now in
use at Google and elsewhere. Lamport
used the term “Byzantine” to describe
the array of possible faults that could
bedevil a system. The term derives
from the Byzantine Generals Problem,
a logic puzzle in which a group of generals must agree on a battle plan, even
though one or more of the generals
may be a traitor. The challenge is to
develop an effective messaging system that will outsmart the traitors and
ensure execution of the battle plan.
The solution, in a nutshell, involves
redundancy.
While Lamport’s work has proved
foundational in the subsequent development of Byzantine fault tolerance,
the basic ideas behind state machine
replication were also implemented in
other early systems. In the early 1980s,
Ken Birman pursued a related line
of work known as Virtual Synchrony
with the ISIS system. This approach
establishes rules for replication that
behave indistinguishably from a nonreplicated system running on a single,
nonfaulty node. The ISIS approach
eventually found its way into several
other systems, including the CORBA
fault-tolerance architecture.
At about the same time, Liskov developed viewstamped replication, a
protocol designed to address benign
failures, such as when a message gets
lost but there’s no malicious intent.
These pioneering efforts all laid the
foundation for an approach to state
machine replication that continues to
Practical Byzantine
fault tolerance
provides a useful
framework for
developing
fault-tolerant
Web systems.
underlie most contemporary work on
fault tolerance. However, most of these
projects involved relatively small, fixed
clusters of machines. “In this environment you only had to worry that the
machine you stored your data on might
have crashed,” Liskov recalls, “but it
wasn’t going to tell you lies.”
With the rise of the Internet in the
mid-1990s, the problem of “lies”—
or malicious hacks—rose to the fore.
Whereas once state machines could
trust each other’s messages, they now
had to support an additional layer of
confirmation to allow for the possibility that one or more of the state machines might have been hacked.
Two groups of developers began exploring ways of applying state-machine
replication techniques to cope with a
growing range of Byzantine failures.
Dahlia Malkhi and Mike Reiter introduced a data-centric approach known
as the Byzantine quorum systems principle. In contrast to active-replication
approaches like the Paxos protocol,
Byzantine quorum systems focus on
identifying a set of servers, rather than
focusing on the messages, and choosing a set of servers so that they intersect
in specific ways to ensure redundancy.
In the mid-1990s, Liskov started
her breakthrough work on practical Byzantine fault tolerance (PBFT),
an extension of her earlier work on
viewstamped replication that adapted
the Paxos replication protocol to cope
with arbitrary failures. Liskov’s approach demonstrated that Byzantine
approaches could scale cost-effectively, sparking renewed interest in the
systems research community.
While the foundational principles
of consistency and replication remain
essential, the rapid growth of Web systems is introducing important new
challenges. Many researchers are finding that PBFT provides a useful framework for developing fault-tolerant Web
systems. “I’m really excited about the
recent work Barbara and her colleagues
have done on making Byzantine Agreement into a practical tool—one that we
can use even in large-scale settings,”
says Birman, a professor of computer
science at Cornell University.
Inspired by Lamport and Liskov’s
foundational work, Hebrew University’s Dolev has been working on an approach involving polynomial solutions