systems that return sets of objects it is
more difficult to determine what the
correct latest set should be. In most
of these systems where the write set is
smaller than the replica set, a mechanism is in place that applies the updates in a lazy manner to the remaining
nodes in the replica’s set. The period
until all replicas have been updated is
the inconsistency window discussed
before. If W+R <= N, then the system is
vulnerable to reading from nodes that
have not yet received the updates.
Whether or not read-your-write, session, and monotonic consistency can
be achieved depends in general on the
“stickiness” of clients to the server that
executes the distributed protocol for
them. If this is the same server every
time, then it is relatively easy to guarantee read-your-writes and monotonic
reads. This makes it slightly more difficult to manage load balancing and
fault tolerance, but it is a simple solution. Using sessions, which are sticky,
makes this explicit and provides an
exposure level that clients can reason
Sometimes the client implements
read-your-writes and monotonic reads.
By adding versions on writes, the client
discards reads of values with versions
that precede the last-seen version.
Partitions happen when some nodes
in the system cannot reach other nodes,
but both sets are reachable by groups
of clients. If you use a classical majority
quorum approach, then the partition
that has W nodes of the replica set can
continue to take updates while the other partition becomes unavailable. The
same is true for the read set. Given that
these two sets overlap, by definition the
minority set becomes unavailable. Partitions don’t happen frequently, but
they do occur between data centers, as
well as inside data centers.
In some applications the unavailability of any of the partitions is unacceptable, and it is important that the
clients that can reach that partition
make progress. In that case both sides
assign a new set of storage nodes to receive the data, and a merge operation is
executed when the partition heals. For
example, within Amazon the shopping
cart uses such a write-always system; in
the case of partition, a customer can
continue to put items in the cart even if
the original cart lives on the other partitions. The cart application assists the
storage system with merging the carts
once the partition has healed.
A system that has brought all of these
properties under explicit control of the
application architecture is Amazon’s
Dynamo, a key-value storage system
that is used internally in many services
that make up the Amazon e-commerce
platform, as well as Amazon’s Web Services. One of the design goals of Dynamo is to allow the application service
owner who creates an instance of the
Dynamo storage system—which commonly spans multiple data centers—to
make the trade-offs between consistency, durability, availability, and performance at a certain cost point.
Data inconsistency in large-scale reliable distributed systems must be tolerated for two reasons: improving read
and write performance under highly
concurrent conditions; and handling
partition cases where a majority model
would render part of the system unavailable even though the nodes are up
Whether or not inconsistencies are
acceptable depends on the client application. In all cases the developer must
be aware that consistency guarantees
are provided by the storage systems
and must be taken into account when
developing applications. There are a
number of practical improvements to
the eventual consistency model, such
as session-level consistency and monotonic reads, which provide better tools
for the developer to work with. Many
times the application is capable of handling the eventual consistency guarantees of the storage system without any
problem. A specific popular case is a
Web site in which we can have the notion of user-perceived consistency. In
this scenario the inconsistency window
must be smaller than the time expected
for the customer to return for the next
page load. This allows for updates to
propagate through the system before
the next read is expected.
The goal of this article is to raise
awareness about the complexity of engineering systems that need to operate at a global scale and that require
careful tuning to ensure that they can
deliver the durability, availability, and
performance that their applications
require. One of the tools the system designer has is the length of the consistency window, during which the clients
of the systems are possibly exposed to
the realities of large-scale systems engineering.
1. brewer, e.a. towards robust distributed systems
(abstract). in Proceedings of the 19th Annual ACM
Symposium on Principles of Distributed Computing
(July 16–19, 2000, Portland, or), 7.
2. conversation with bruce lindsay. ACM Queue 2, 8
3. Decandia, g., et. al. Dynamo: amazon’s highly
available key-value store. in Proceedings of the 21st
ACM Symposium on Operating Systems Principles
(stevenson, wa, oct. 2007).
4. gilbert, s. and lynch, n. brewer’s conjecture and the
feasibility of consistent, available, partition-tolerant
web services. ACM SIGACT News 33, 2 (2002).
5. lindsay, b.g. et al. notes on distributed databases.
Distributed Data Bases. i.w. Draffan and f. Poole, eds.
cambridge university Press, cambridge, ma, 1980,
247–284. also available as ibm research report
rJ2517, san Jose, ca (July 1979).
Werner Vogels is vice president and chief technology
officer at amazon.com, where he is responsible for
driving the company’s technology vision of continuously
enhancing innovation on behalf of amazon’s customers at
a global scale.
a previous version of this article appeared in the october
2008 issue of ACM Queue.
© 2009 acm 0001-0782/09/0100 $5.00