algorithms would be increased if we were to use the more
secure, but more expensive, SHA-256.
For comparison, we run Castro and Liskov’s implementation of Practical Byzantine Fault Tolerance (PBFT) 4 and
Cowling et al.’s implementation of hybrid quorum (HQ) 6; we
scale-up HQ’s measured throughput for the small request/
response benchmark by 9% to account for their use of SHA-
1 rather than MD5. We include published throughput measurements for Q/U1; we scale Q/U’s reported performance up
by 7.5% to account for our use of 3.0 GHz rather than 2.8GHz
machines. We also compare against the measured performance of an unreplicated server.
To stress-test Zyzzyva we use the microbenchmarks devised by Castro and Liskov4 In the 0/0 benchmark, clients
send null requests and receive null replies. In the 4/0 benchmark, clients send 4KB requests and receive a null replies.
In the 0/4 benchmark, clients send null requests and receive 4KB replies. In all experiments, we configure all BFT
systems to tolerate f = 1 faults; we examine performance for
other configurations elsewhere. 11
In the preceding sections, we describe a simplified version of the protocol. In our extended paper, 12 we detail a
number of optimizations, all implemented in the prototype
measured here, that ( 1) reduce encryption costs by replacing public key signatures with MACs, 4 ( 2) improve throughput by agreeing on the order of batches of requests, 4 ( 3) reduce the impact of lost messages by caching out-of-order
messages, ( 4) improve read performance by optimizing
read-only requests, 4 reduce bandwidth by allowing most
replicas to send hashes rather than full replies to clients, 4
( 5) improve the performance of Zyzzyva’s two-phase case by
using a commit optimization in which replicas use a client
hint to initiate and complete the second phase to commit
the request before they execute the request and send the response (with the committed history) back to the client, and
( 6) reduce overheads by including MACs only for a preferred
quorum. 6 In the extended paper we also describe Zyzzyva5,
a variation of the protocol that requires 5f + 1 agreement
replicas but that improves performance in the presence of
faulty replicas by completing in three one-way message exchanges as in Figure 1a even when up to f nonprimary replicas are faulty.
In the following experiments, unless noted otherwise, we use
all of the optimizations other than preferred quorums for Zyzzyva. PBFT4 does not implement the preferred quorum optimization, but HQ does. 6 We do not use the read-only optimization
for Zyzzyva and PBFT unless we state so explicitly.
4. 1. cost model
Our evaluation focuses on three metrics that BFT replication must optimize to be practical for a broad range of
services: replication cost, throughput, and latency. Before
we dive into experimental evaluation in the following sections, Table 2 puts our results in perspective by providing
a high-level analytic model of Zyzzyva and of several other
recent BFT protocols. The table also shows lower bounds on
BFT state machine replication overheads for each of these
In the first row of the table body, replication cost refers
to the number of replicas required to construct a system
that tolerates f Byzantine faults. The importance of minimizing this metric for practical services is readily apparent. We show two values: replicas with application state indicates the number of replicas that must both participate
in the coordination protocol and also maintain application
state for executing application requests. Conversely, total
replicas indicates the total number of machines that must
participate in the protocol including, for some protocols,
“witness nodes” that do not maintain application state or
execute application requests. This distinction is important
because witness nodes may be simpler or less expensive
than nodes that must also execute requests to run the replicated service.
Zyzzyva and PBFT (with Yin et al.’s optimization for separating agreement and execution26) meet the replication cost
lower bounds of 2f + 1 application replicas (so a majority of
nodes are correct) 24 and 3f + 1 total replicas (so agreement
on request order can be reached). 21
In the next row of the table body, throughput is determined by the processing overhead per request. Our simple
model focuses on CPU intensive cryptographic operations.
All of the systems we examine use Castro’s MAC authenticator construct4 to avoid using expensive asymmetric cryptography operations.
table 2: Properties of state-of-the-art and optimal Byzantine fault-tolerant (Bft) service replication systems tolerating f faults, using macs for
authentication, 4 assuming preferred quorum optimization, and using a batch size of b. 4
Replicas with application state
mAC ops at bottleneck server
NW one-way latencies on critical path
3f + 1
2f + 126
2 + (8f + 1)/b
5f + 1
5f + 1
2 + 8f
3f + 1
3f + 1
4 + 4f
3f + 1
2f + 1
2 + 3f/b
3f + 121
2f + 124
2 or 3b
bold entries denote protocols that match known lower bounds or those with the lowest known cost.
a It is not clear that this trivial lower bound is achievable.
b The distributed systems literature typically considers three one-way latencies to be the lower bound for agreement on client requests17; two one-way latencies is achievable if no request contention is assumed.