Thanks to computer scientists like Barbara Liskov, researchers are making major progress with cost-efficient fault tolerance for Web-based systems.
As More and more data moves into the cloud, many developers find themselves grappling with the prospect of system failure at ever-widening scales.
“ThE BYzAN TINE GENERALS PROBLEM” ACM TOPLAS VOL. 4, ISSUE 3 (JULY 1982) DOI: 10.1145/357172.357176
When distributed systems first started appearing in the late 1970s and early 1980s, they typically involved a small, fixed number of servers running in a carefully managed environment. By contrast, today’s Web-based distributed systems often involve thousands or hundreds of thousands of servers coming on and offline at unpredictable intervals, hosting multiple stored objects, services, and applications that often cross organizational boundaries over the Internet.
“In a cloud we have relatively few sites that are loaded with a huge number of processors,” says Danny Dolev, a computer science professor at The Hebrew University of Jerusalem. “Fault tolerance needs to provide survivability and security within a cloud and across clouds.”
In this deeply intertwined environment, software designers have to plan for a bewildering array of potential failure points. Building large-scale fault-tolerant systems inevitably in-
The Byzantine Generals Problem.
figure 1
Lieutenant 2 as the Traitor
Commander
“Attack”
“Attack”
figure 2
Commander as the Traitor
Commander
“Attack”
“Retreat”
Lieutenant 1
Lieutenant 2
Lieutenant 1
“He said retreat”
Lieutenant 2
“He said retreat”
in the Byzantine Generals Problem, as defined by Leslie Lamport, Robert shostak, and Marshall Pease in their 1982 paper, a general must communicate his order to attack or retreat to his lieutenants, but any number of participants, including the general, could be a traitor.
volves trade-offs in terms of cost, performance, and development time.
As Web systems grow, those trade-offs loom larger and larger. “ Fault-tolerant systems have always been difficult to build,” says University of North Carolina at Chapel Hill computer science professor Mike Reiter. “Getting a fault-tolerant system to perform as well as a non-fault-tolerant one is a challenge.”
Fortunately, the research community has been making major strides in this area of late, thanks in part to the contributions of ACM A.M. Turing Award winner Barbara Liskov of Massachusetts Institute of Technology, whose breakthrough work in applying Byzantine fault tolerance (BFT) methods to the Internet has helped point the way to cost-efficient fault tolerance for Web-based systems.
References:
Archives