Doi: 10.1145/1435417.1435432
Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.
at the FouNdatIoN
of Amazon’s cloud computing are
infrastructure services such as Amazon’s s3 (simple
storage service), simpleDB, and EC2 (Elastic Compute
Cloud) that provide the resources for constructing
Internet-scale computing platforms and a great variety
of applications. The requirements placed on these
infrastructure services are very strict; they need to
score high marks in the areas of security, scalability,
availability, performance, and cost-effectiveness, and
they need to meet these requirements while serving
millions of customers around the globe, continuously.
Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and must be accounted for upfront in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly
transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services.
One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when many widespread distributed systems provide an eventual consistency model in the context of data replication. When designing these large-scale systems at Amazon, we use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. Here, I present some of the relevant background that has informed our approach to delivering reliable distributed systems that must operate on a global scale. (An earlier version of this article appeared as a posting on the “All Things Distributed” Weblog and was greatly improved with the help of its readers.)
In an ideal world there would be only one consistency model: when an update is made all observers would see that update. The first time this surfaced as difficult to achieve was in the database systems of the late 1970s. The best “period piece” on this topic is “Notes on Distributed Databases” by Bruce Lindsay et al. It lays out the
5
fundamental principles for database replication and discusses a number of techniques that deal with achieving consistency. Many of these techniques try to achieve distribution transparency—that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency. 2
In the mid-1990s, with the rise of larger Internet systems, these practices were revisited. At that time people began to consider the idea that availability was perhaps the most impor-
References:
Archives