technical Perspective
safeguarding online information
against failures and attacks
By Barbara Liskov
thE intErnEt is increasingly the place
where both users and organizations
store their information. Storage is becoming a commodity; for example,
consider the storage offerings by companies such as Google and Amazon.
A key benefit for individual users
of commodity Internet storage is that
they can access their information from
anywhere at anytime from any device.
Thus they no longer have to use their
PC to access their files or their email.
Furthermore, online information can
easily be shared with others, giving rise
to new applications based on support
for collaboration.
However, the full benefit of online
storage will be realized only if users
can access their data whenever they
want. Users need storage that is highly
reliable (it is not lost) and highly available (accessible when needed). They
will not be satisfied with less reliability
or availability than they can get by storing their information locally. Providing
these guarantees requires replication:
by storing copies of information on
multiple computers, it is possible to
prevent loss and provide accessibility.
Replication has been the subject of
research for over 20 years. The details
of the replication protocols depend on
the failure model, and two are in common use. The first is the crash model,
in which either a computer is up and
running as required by the protocol,
or it has crashed and is doing nothing.
The second is the Byzantine model, in
which computers are allowed to fail in
arbitrary ways. The Byzantine model is
more general: in addition to crashes, it
handles failures in which a faulty computer continues to run the protocol
while misbehaving. For example, a machine might indicate it has executed a
command to update information, but
discard the new information.
During the 1980s there was a great
deal of research on replication protocols that handle crash failures for two
reasons: crashes were the most com-
mon failures at the time, and it is much
easier to think about crashes than Byzantine failures. This work led to protocols that survive f failures using 2f + 1
replicas, which is the minimum needed in an asynchronous setting like the
Internet. Also, the protocols provide
good performance, close to what an
unreplicated system can provide.
However, these protocols are unable
to handle arbitrary (Byzantine) failures,
which are becoming more common.
One source of Byzantine failures is
software errors; typically these are non-deterministic errors, because deterministic errors are much more likely than
non-deterministic ones to be removed
during testing. The second source, and
one of increasing concern today, is malicious attacks in which an adversary
manages to get control of a computer
and cause it to misbehave. To handle
these problems, researchers have developed replication protocols that provide Byzantine fault tolerance.
Prior to the late 1990s, work on Byzantine-fault-tolerant replication was
only of theoretical interest because
the protocols were so costly or worked
only in a synchronous network. This
changed with the invention of PBFT,
the first Byzantine-fault-tolerant replication protocol that could be used in
practice in an asynchronous network.
PBFT provides state machine replication; that is, it handles arbitrary operations on the service state. It requires the
minimum of 3f + 1 replicas to tolerate f
failures.
The development of PBFT led to renewed interest in Byzantine-fault-tolerant replication protocols. Researchers
have investigated a number of research
questions, including:
Level of Consistency. At one extreme
is a replication protocol like PBFT that
appears to behave as if there were just
one copy of the data. But performance
can be improved by providing weaker
guarantees.
Other Approaches. PBFT makes use
of a primary replica to direct the protocol; researchers have invented protocols that avoid the use of a primary
either completely or partially.
Failure Analysis. Replication protocols like PBFT work correctly provided
no more than f replicas are faulty. But if
that bound is exceeded, can any guarantees be made?
Performance, performance, performance. Improved protocols that have
better performance (lower latency, higher throughput) are always of interest.
The work on Zyzzyva presented here
is concerned with the last topic. Zyzzyva
achieves excellent performance when
all replicas are non-faulty. It pays for
this gain in performance in the nonfailure case by offering reduced performance when there are failures. Importantly, its techniques should allow
it to achieve performance that is close
to that of an unreplicated system most
of the time.
Today there are Byzantine-fault-tolerant replication protocols efficient
enough to be deployed in a real setting.
But when might this happen? Here we
can learn from the work on crash replication. Although developed in the
1980s, these protocols weren’t used in
real systems until around 2000. The
reason for this delay was a perception
that the reliability provided by these
approaches wasn’t really needed in
practice. This perception changed as
more critical state was stored online.
The concern about cost also changed,
since computers are much cheaper,
and the network is much faster.
I expect that someday there will be
a practical deployment that tolerates
Byzantine failures. The decision to take
this step will depend on the criticality
of the data. At some point incurring the
cost of replication will be preferable to
being liable should the data be lost or
unavailable.
Barbara Liskov ( liskov@csail.mit.edu) is the ford
professor of engineering, mit Computer science and
artificial intelligence laboratory, Cambridge. ma.