Nnews
Science | DOI: 10.1145/1538788.1538794
Alex Wright
Contemporary Approaches
to fault Tolerance
Thanks to computer scientists like Barbara Liskov, researchers are making
major progress with cost-efficient fault tolerance for Web-based systems.
As More and more data moves
into the cloud, many developers find themselves
grappling with the prospect of system failure at
ever-widening scales.
“ThE BYzAN TINE GENERALS PROBLEM” ACM TOPLAS VOL. 4, ISSUE 3 (JULY 1982) DOI: 10.1145/357172.357176
When distributed systems first
started appearing in the late 1970s and
early 1980s, they typically involved a
small, fixed number of servers running
in a carefully managed environment.
By contrast, today’s Web-based distributed systems often involve thousands
or hundreds of thousands of servers
coming on and offline at unpredictable intervals, hosting multiple stored
objects, services, and applications that
often cross organizational boundaries
over the Internet.
“In a cloud we have relatively few
sites that are loaded with a huge number of processors,” says Danny Dolev,
a computer science professor at The
Hebrew University of Jerusalem. “Fault
tolerance needs to provide survivability
and security within a cloud and across
clouds.”
In this deeply intertwined environment, software designers have to plan
for a bewildering array of potential
failure points. Building large-scale
fault-tolerant systems inevitably in-
The Byzantine Generals Problem.
figure 1
Lieutenant 2 as the Traitor
Commander
“Attack”
“Attack”
figure 2
Commander as the Traitor
Commander
“Attack”
“Retreat”
Lieutenant 1
Lieutenant 2
Lieutenant 1
“He said
retreat”
Lieutenant 2
“He said
retreat”
in the Byzantine Generals Problem, as defined by Leslie Lamport, Robert shostak, and
Marshall Pease in their 1982 paper, a general must communicate his order to attack or retreat
to his lieutenants, but any number of participants, including the general, could be a traitor.
volves trade-offs in terms of cost, performance, and development time.
As Web systems grow, those trade-offs loom larger and larger. “
Fault-tolerant systems have always been difficult to build,” says University of North
Carolina at Chapel Hill computer science professor Mike Reiter. “Getting
a fault-tolerant system to perform as
well as a non-fault-tolerant one is a
challenge.”
Fortunately, the research community has been making major strides
in this area of late, thanks in part to
the contributions of ACM A.M. Turing
Award winner Barbara Liskov of Massachusetts Institute of Technology,
whose breakthrough work in applying
Byzantine fault tolerance (BFT) methods to the Internet has helped point
the way to cost-efficient fault tolerance
for Web-based systems.