to the general Byzantine agreement problem. While his early work in this area 25 years ago seemed largely theoretical, he is now finding practical applications for these approaches on the Web. “My theoretical work was ignited by Leslie [Lamport],” he says. “ Barbara’s work brought me to look again at the practicality of the solutions.”
At Microsoft, researcher Rama Kotla has proposed a new BFT replication protocol known as Zyzzyva, that strives to improve performance by using a technique called speculation to achieve low performance overheads. Kotla is also exploring a complementary technique called high throughput BFT that exploits parallelism to improve the performance of a replicated application.
Also at Microsoft, director Chandu Thekkath has been pioneering an alternative approach to fault tolerance for Microsoft’s Live Services, creating a single “configuration master” to coordinate recovery from machine failures across multiple data services. The concept of a configuration master also underlies the design of several other leading services in the live services market, such as Google’s Chubby lock server.
Lorezo Alvisi, a professor of computer science at the University of Texas at Austin, and colleagues are probing the possibilities of applying game theory techniques to fault tolerance problems, while Ittai Abraham, a professor of computer science at The Hebrew University of Jerusalem, and colleagues are incorporating security methods into distributed protocols to punish rogue participants and deter against the deviation of any collusion among them.
While these efforts are opening new research frontiers, they remain squarely rooted in the pioneering work on Byzantine fault tolerance that started more than three decades ago. Indeed, many developers are just beginning to encounter this foundational research for the first time. “Engineers are starting to discover and use these algorithms instead of writing code by the seat of their pants,” says Lamport.
Many developers still wrestle with the cost and performance trade-offs of fault tolerance, however, and a number of large sites still seem willing to accept a certain degree of system failure as a
cost of doing business on the Web.
“The reliability of a system increases with increasing number of tolerated failures,” says Kotla, “but it also increases the cost of the system.” He suggests that developers look for ways to balance costs against the need to achieve reliability in terms of mean time to failure, mean time to detect failures, and mean time to recover faulty replicas. “We need more research work in understanding and modeling faults in various settings to help system designers choose the right parameters,” Kotla says.
Further complicating matters is the rise of mobile devices that are only sporadically connected to the Internet. As people entrust more and more of their personal data to these devices— like financial transactions, messaging, and other sensitive information—the challenge of keeping all that data in sync across multiple platforms will continue to escalate. And the problem of distributed fault tolerance will only grow more, well, Byzantine.
“The Web is going live,” says Birman, who believes that the coming convergence of sensors, simulators, and mobile devices will drive the need for increasingly reliable data replication. “This is going to change the picture for replication, creating a demand from average users.” When that happens, we may just see fault tolerance coming out of the clouds and back down to earth.
Alex Wright is a writer and information architect who lives and works in New York City.
© 2009 ACM 0001-0782/09/0700 $10.00
a pair of scientists at intel research Berkeley have developed CloneCloud, which creates an identical clone of an individual’s smartphone that resides in a cloud-computing environment.
Created by intel researchers
Byung-Gon Chun and petros
Maniatis, CloneCloud uses
a smartphone’s internet
connection to communicate
with the phone’s online copy,
which contains its data and
applications, up to several
gigabits in size, in the cloud.
CloneCloud would make
smartphones significantly faster
and more powerful, enabling
them to perform processor-
heavy tasks in the cloud. For
example, Chun and Maniatis’s
CloneCloud prototype, running
on Google’s android mobile
operating system, conducted a
test application involving the
facial recognition of photos.
running the application on the
android smartphone took 100
seconds; the phone’s clone,
operating on a desktop computer
in the cloud, completed the task
in one second.
according to the researchers,
CloneCloud would also
provide improved smartphone
security, with virus scans of
a device’s entire file system
being conducted in the cloud.
Moreover, CloneCloud would
improve a smartphone’s battery
life by having cloud-based
computers handle the most
processor-intensive tasks.
the CloneCloud research
could help with intelligently
allocating tasks to the most
energy-efficient or fastest
processor in a cloud-computing
environment. “there will be a
family of heterogeneous devices,
and you would like to move the
computing job to the one that
makes most sense; from that
standpoint, it is a great idea,”
said allan knies, associate
director of intel research
Berkeley, in an interview
with Technology Review.
the CloneCloud approach
could also help create a
computing environment that
would make it easier to share
data between mobile devices
and home-based computers.
References:
Archives