The challenge is to find a good system
invariant, one strong enough to ensure no safety properties are violated.
Formal methods help engineers find
strong invariants, so formal methods
can help improve assertions that help
improve the quality of code.
While we would like to verify that
executable code correctly implements the high-level specification or
even generate the code from the specification, we are not aware of any such
tools that can handle distributed systems as large and complex as those
being built at Amazon. We do routinely use conventional static analysis tools, but they are largely limited
to finding “local” issues in the code,
and are unable to verify compliance
with a high-level specification.
We have seen research on using the
TLC model checker to find “edge cases” in the design on which to test the
21 an approach that seems promising. However, Tasiran et al.
hardware design, and we have not yet
tried to apply the method to software.
Alternatives to TLA+
There are many formal specification methods. We evaluated several
and published our findings in Newcombe,
19 listing the requirements
we think are important for a formal
method to be successful in our industry segment. When we found TLA+ met
those requirements, we stopped evaluating methods, as our goal was always
practical engineering rather than an
We find relatively little published literature on using high-level formal specification for verifying the design of
complex distributed systems in industry. The Farsite project6 is complex but
somewhat different from the types of
systems we describe here and apparently never launched commercially.
Abrial1 cited applications in commercial safety-critical control systems,
but they seem less complex than our
problem domain. Lu et al.
post-facto verification of a well-known
algorithm for a fault-tolerant distributed hash table, and Zave22 described
another such algorithm, but we do not
know if these algorithms have been
used in commercial products.
Formal methods are a big success at
AWS, helping us prevent subtle but serious bugs from reaching production,
bugs we would not have found through
any other technique. They have helped
us devise aggressive optimizations to
complex algorithms without sacrificing quality. At the time of this writing,
seven Amazon teams have used TLA+,
all finding value in doing so, and more
Amazon teams are starting to use it.
Using TLA+ will improve both time-to-market and quality of our systems.
Executive management actively encourages teams to write TLA+ specs
for new features and other significant
design changes. In annual planning,
managers now allocate engineering
time to TLA+.
While our results are encouraging, some important caveats remain.
Formal methods deal with models of
systems, not the systems themselves,
so the adage “All models are wrong,
some are useful” applies. The designer must ensure the model captures the
significant aspects of the real system.
Achieving it is a special skill, the acquisition of which requires thoughtful practice. Also, we were solely
concerned with obtaining practical
benefits in our particular problem domain and have not attempted a comprehensive survey. Therefore, mileage
may vary with other tools or in other
1. Abrial, J. Formal methods in industry: Achievements,
problems, future. In Proceedings of the 28th
International Conference on Software Engineering
(Shanghai, China, 2006), 761–768.
2. Amazon.com. Supported Operations in DynamoDB:
Strongly Consistent Reads. System documentation;
3. Barr, J. Amazon S3: The first trillion objects. Amazon
Web Services Blog, June 2012; http://aws.typepad.
4. Barr, J. Amazon S3: Two trillion objects, 1. 1 million
requests per second. Amazon Web Services Blog, Mar.
5. Batson, B. and Lamport, L. High-level specifications:
Lessons from industry. In Formal Methods for
Components and Objects, Lecture Notes in Computer
Science Number 2852, F.S. de Boer, M. Bonsangue,
S. Graf, and W.-P. de Roever, Eds. Springer, 2003,
6. Bolosky, W., Douceur, J., and Howell, J. The Farsite
Project: A retrospective. ACM SIGOPS Operating
Systems Review: Systems Work at Microsoft Research
41, 2 (Apr. 2007), 17–26.
7. Brooker, M. Exploring TLA+ with two-phase commit.
Personal blog, Jan. 2013; http://brooker.co.za/
8. Holloway, C. Michael Why you should read accident
reports. Presented at the Software and Complex
Electronic Hardware Standardization Conference
(Norfolk, VA, July 2005); http://klabs.org/richcontent/
9. Joshi, R., Lamport, L. et al. Checking cache-coherence
protocols with TLA+. Formal Methods in System
Design 22, 2 (Mar, 2003) 125–131.
10. Kudrjavets, G., Nagappan, N., and Ball, T. Assessing
the relationship between software assertions
and code quality: An empirical investigation. In
Proceedings of the 17th International Symposium on
Software Reliability Engineering (Raleigh, NC, Nov.
11. Lamport, L. The TLA Home Page; http://research.
12. Lamport, L. Fast Paxos. Distributed Computing 19, 2
(Oct. 2006), 79–103.
13. Lamport, L. The Wildfire Challenge Problem; http://
14. Lamport, L. Checking a multithreaded algorithm with
+CAL. In Distributed Computing: 20th International
Conference, S. Dolev, Ed. Springer-Verlag, 2006, 11–163.
15. Lamport, L. and Merz, S. Specifying and verifying fault-tolerant systems. In Formal Techniques in Real-Time
and Fault-Tolerant Systems, Lecture Notes in Computer
Science, Number 863, H. Langmaack, W.-P. de Roever,
and J. Vytopil, Eds. Springer-Verlag, Sept. 1994, 41–76.
16. Lamport, L., Sharma, M., Tuttle, M., and Yu, Y.
The Wildfire Challenge Problem. Jan. 2001;
17. Lu, T., Merz, S., and Weidenbach, C. Towards
verification of the Pastry Protocol using TLA+. In
Proceedings of Joint 13th IFIP WG 6. 1 International
Conference and 30th IFIP WG 6. 1 International
Conference Lecture Notes in Computer Science
Volume 6722 (Reykjavik, Iceland, June 6–9). Springer-Verlag, 2011, 244 –258.
18. Newcombe, C. Debugging Designs. Presented at the
14th International Workshop on High-Performance
Transaction Systems (Monterey, CA, Oct. 2011); http://
pdf and associated specifications http://hpts.ws/
19. Newcombe, C. Why Amazon chose TLA+. In
Proceedings of the Fourth International Conference
Lecture Notes in Computer Science Volume 8477, Y.A.
Ameur and K.-D. Schewe, Eds. (Toulouse, France, June
2–6). Springer, 2014, 25–39.
20. Patterson, D., Fox, A. et al. The Berkeley/Stanford
Recovery-Oriented Computing Project. University of
California, Berkeley; http://roc.cs.berkeley.edu/
21. Tasiran, S., Yu, Y., Batson, B., and Kreider, S. Using
formal specifications to monitor and guide simulation:
Verifying the cache coherence engine of the Alpha
21364 microprocessor. In Proceedings of the Third
IEEE International Workshop on Microprocessor Test
and Verification (Austin, TX, June). IEEE Computer
22. Zave, P. Using lightweight modeling to understand
Chord. ACM SIGCOMM Computer Communication
Review 42, 2 (Apr. 2012), 49–57.
Chris Newcombe ( firstname.lastname@example.org) is an
architect at Oracle, Seattle, WA, and was a principal
engineer in the AWS database services group at Amazon.
com, Seattle, WA, when this article was written.
Tim Rath ( email@example.com) is a principal engineer in the
AWS database services group at Amazon.com, Seattle, WA.
Fan Zhang ( firstname.lastname@example.org) is a software
engineer and technical product and program manager at
Cyanogen, Seattle, WA, and was a software engineer for
AWS S3 at Amazon.com, Seattle, WA, when this article
Bogdan Munteanu ( email@example.com) is
a software engineer at Dropbox, and was a software
engineer in the AWS S3 Engines group at Amazon.com,
Seattle, WA, when this article was written.
Marc Brooker ( firstname.lastname@example.org) is a principal
engineer for AWS EC2 at Amazon.com, Seattle, WA.
Michael Deardeuff ( email@example.com) is a
software engineer in the AWS database services group
at Amazon.com, Seattle, WA.
Copyright held by Owners/Authors.
Publication rights licensed to ACM. $15.00