systems that return sets of objects it is more difficult to determine what the correct latest set should be. In most of these systems where the write set is smaller than the replica set, a mechanism is in place that applies the updates in a lazy manner to the remaining nodes in the replica’s set. The period until all replicas have been updated is the inconsistency window discussed before. If W+R <= N, then the system is vulnerable to reading from nodes that have not yet received the updates.

Whether or not read-your-write, session, and monotonic consistency can be achieved depends in general on the “stickiness” of clients to the server that executes the distributed protocol for them. If this is the same server every time, then it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly more difficult to manage load balancing and fault tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.

Sometimes the client implements read-your-writes and monotonic reads. By adding versions on writes, the client discards reads of values with versions that precede the last-seen version.

Partitions happen when some nodes in the system cannot reach other nodes, but both sets are reachable by groups of clients. If you use a classical majority quorum approach, then the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same is true for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don’t happen frequently, but they do occur between data centers, as

well as inside data centers.

In some applications the unavailability of any of the partitions is unacceptable, and it is important that the clients that can reach that partition make progress. In that case both sides assign a new set of storage nodes to receive the data, and a merge operation is executed when the partition heals. For example, within Amazon the shopping cart uses such a write-always system; in the case of partition, a customer can continue to put items in the cart even if the original cart lives on the other partitions. The cart application assists the storage system with merging the carts once the partition has healed.

amazon’s Dynamo

A system that has brought all of these properties under explicit control of the application architecture is Amazon’s Dynamo, a key-value storage system that is used internally in many services that make up the Amazon e-commerce platform, as well as Amazon’s Web Services. One of the design goals of Dynamo is to allow the application service owner who creates an instance of the Dynamo storage system—which commonly spans multiple data centers—to make the trade-offs between consistency, durability, availability, and performance at a certain cost point.

3

summary

Data inconsistency in large-scale reliable distributed systems must be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.

Whether or not inconsistencies are acceptable depends on the client application. In all cases the developer must be aware that consistency guarantees are provided by the storage systems and must be taken into account when developing applications. There are a number of practical improvements to the eventual consistency model, such as session-level consistency and monotonic reads, which provide better tools for the developer to work with. Many times the application is capable of handling the eventual consistency guarantees of the storage system without any

problem. A specific popular case is a Web site in which we can have the notion of user-perceived consistency. In this scenario the inconsistency window must be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system before the next read is expected.

The goal of this article is to raise awareness about the complexity of engineering systems that need to operate at a global scale and that require careful tuning to ensure that they can deliver the durability, availability, and performance that their applications require. One of the tools the system designer has is the length of the consistency window, during which the clients of the systems are possibly exposed to the realities of large-scale systems engineering.

 

References

1. brewer, e.a. towards robust distributed systems (abstract). in Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (July 16–19, 2000, Portland, or), 7.

2. conversation with bruce lindsay. ACM Queue 2, 8 (2004), 22–33.

3. Decandia, g., et. al. Dynamo: amazon’s highly available key-value store. in Proceedings of the 21st ACM Symposium on Operating Systems Principles (stevenson, wa, oct. 2007).

4. gilbert, s. and lynch, n. brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33, 2 (2002).

5. lindsay, b.g. et al. notes on distributed databases. Distributed Data Bases. i.w. Draffan and f. Poole, eds. cambridge university Press, cambridge, ma, 1980, 247–284. also available as ibm research report rJ2517, san Jose, ca (July 1979).

 

Werner Vogels is vice president and chief technology officer at amazon.com, where he is responsible for driving the company’s technology vision of continuously enhancing innovation on behalf of amazon’s customers at a global scale.

a previous version of this article appeared in the october 2008 issue of ACM Queue.

© 2009 acm 0001-0782/09/0100 $5.00

References:

http://Amazon.com

Archives