to the general Byzantine agreement
problem. While his early work in this
area 25 years ago seemed largely theoretical, he is now finding practical applications for these approaches on the
Web. “My theoretical work was ignited
by Leslie [Lamport],” he says. “
Barbara’s work brought me to look again at
the practicality of the solutions.”
At Microsoft, researcher Rama
Kotla has proposed a new BFT replication protocol known as Zyzzyva, that
strives to improve performance by using a technique called speculation to
achieve low performance overheads.
Kotla is also exploring a complementary technique called high throughput
BFT that exploits parallelism to improve the performance of a replicated
application.
Also at Microsoft, director Chandu
Thekkath has been pioneering an alternative approach to fault tolerance
for Microsoft’s Live Services, creating a
single “configuration master” to coordinate recovery from machine failures
across multiple data services. The concept of a configuration master also underlies the design of several other leading services in the live services market,
such as Google’s Chubby lock server.
Lorezo Alvisi, a professor of computer science at the University of
Texas at Austin, and colleagues are
probing the possibilities of applying
game theory techniques to fault tolerance problems, while Ittai Abraham, a
professor of computer science at The
Hebrew University of Jerusalem, and
colleagues are incorporating security
methods into distributed protocols to
punish rogue participants and deter
against the deviation of any collusion
among them.
While these efforts are opening new
research frontiers, they remain squarely rooted in the pioneering work on
Byzantine fault tolerance that started
more than three decades ago. Indeed,
many developers are just beginning to
encounter this foundational research
for the first time. “Engineers are starting to discover and use these algorithms instead of writing code by the
seat of their pants,” says Lamport.
Many developers still wrestle with
the cost and performance trade-offs of
fault tolerance, however, and a number
of large sites still seem willing to accept
a certain degree of system failure as a
“The Web is going
live,” says Ken
Birman. “This is
going to change
the picture for
replication, creating
a demand from
average users.”
cost of doing business on the Web.
“The reliability of a system increases with increasing number of tolerated failures,” says Kotla, “but it also
increases the cost of the system.” He
suggests that developers look for ways
to balance costs against the need to
achieve reliability in terms of mean
time to failure, mean time to detect failures, and mean time to recover faulty
replicas. “We need more research work
in understanding and modeling faults
in various settings to help system designers choose the right parameters,”
Kotla says.
Further complicating matters is
the rise of mobile devices that are only
sporadically connected to the Internet.
As people entrust more and more of
their personal data to these devices—
like financial transactions, messaging,
and other sensitive information—the
challenge of keeping all that data in
sync across multiple platforms will
continue to escalate. And the problem
of distributed fault tolerance will only
grow more, well, Byzantine.
“The Web is going live,” says Birman, who believes that the coming
convergence of sensors, simulators,
and mobile devices will drive the need
for increasingly reliable data replication. “This is going to change the picture for replication, creating a demand
from average users.” When that happens, we may just see fault tolerance
coming out of the clouds and back
down to earth.
Alex Wright is a writer and information architect who
lives and works in New York City.
© 2009 ACM 0001-0782/09/0700 $10.00
Cloud Computing
Cloning
Smart-
phones
a pair of scientists at intel
research Berkeley have
developed CloneCloud, which
creates an identical clone of an
individual’s smartphone that
resides in a cloud-computing
environment.
Created by intel researchers
Byung-Gon Chun and petros
Maniatis, CloneCloud uses
a smartphone’s internet
connection to communicate
with the phone’s online copy,
which contains its data and
applications, up to several
gigabits in size, in the cloud.
CloneCloud would make
smartphones significantly faster
and more powerful, enabling
them to perform processor-
heavy tasks in the cloud. For
example, Chun and Maniatis’s
CloneCloud prototype, running
on Google’s android mobile
operating system, conducted a
test application involving the
facial recognition of photos.
running the application on the
android smartphone took 100
seconds; the phone’s clone,
operating on a desktop computer
in the cloud, completed the task
in one second.
according to the researchers,
CloneCloud would also
provide improved smartphone
security, with virus scans of
a device’s entire file system
being conducted in the cloud.
Moreover, CloneCloud would
improve a smartphone’s battery
life by having cloud-based
computers handle the most
processor-intensive tasks.
the CloneCloud research
could help with intelligently
allocating tasks to the most
energy-efficient or fastest
processor in a cloud-computing
environment. “there will be a
family of heterogeneous devices,
and you would like to move the
computing job to the one that
makes most sense; from that
standpoint, it is a great idea,”
said allan knies, associate
director of intel research
Berkeley, in an interview
with Technology Review.
the CloneCloud approach
could also help create a
computing environment that
would make it easier to share
data between mobile devices
and home-based computers.