36. Matloff, N., Salzman, P.J. The Art of Debugging with
GDB, DDD, and Eclipse. No Starch Press, 2008.
37. Meliou, A., Suciu, D. Tiresias: The database oracle for
how-to queries. Proceedings of the ACM SIGMOD
International Conference on the Management of Data
38. Microsoft Azure Documentation. Introduction to the
fault analysis service, 2016; https://azure.microsoft.
39. Musuvathi, M. et al. CMC: A pragmatic approach to
model checking real code. ACM SIGOPS Operating
Systems Review. In Proceedings of the 5th Symposium
on Operating Systems Design and Implementation 36
40. Musuvathi, M. et al. Finding and reproducing
Heisenbugs in concurrent programs. In Proceedings
of the 8th Usenix Conference on Operating Systems
Design and Implementation (2008), 267–280.
41. Newcombe, C. et al. Use of formal methods at
Amazon Web Services. Technical Report, 2014; http://
42. Olston, C., Reed, B. Inspector Gadget: A framework
for custom monitoring and debugging of distributed
data flows. In Proceedings of the ACM SIGMOD
International Conference on the Management of Data
43. Open Tracing. 2016; http://opentracing.io/.
44. Pasquier, T.F. J.-M., Singh, J., Eyers, D. M., Bacon, J.
CamFlow: Managed data-sharing for cloud services,
45. Patterson, D.A., Gibson, G., Katz, R.H. A case for
redundant arrays of inexpensive disks (RAID). In
Proceedings of the 1988 ACM SIGMOD International
Conference on Management of Data, 109–116;
46. Ramasubramanian, K. et al. Growing a protocol. In
Proceedings of the 9th Usenix Workshop on Hot Topics
in Cloud Computing (2017).
47. Reinhold, E. Rewriting Uber engineering: The
opportunities microservices provide. Uber Engineering,
2016; https: // eng.uber.com/building-tincup/.
48. Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end
arguments in system design. ACM Trans. Computing
Systems 2, 4 (1984): 277–288.
49. Sandberg, R. The Sun network file system: design,
implementation and experience. Technical report, Sun
Microsystems. In Proceedings of the Summer 1986
Usenix Technical Conference and Exhibition.
50. Shkuro, Y. Jaeger: Uber’s distributed tracing system.
Uber Engineering, 2017; https://uber.github.io/jaeger/.
51. Sigelman, B.H. et al. Dapper, a large-scale distributed
systems tracing infrastructure. Technical report.
Research at Google, 2010; https://research.google.
52. Shenoy, A. A deep dive into Simoorg: Our open source
failure induction framework. Linkedin Engineering,
53. Yang, J. et al. L., Zhou, L. MODIST: Transparent
model checking of unmodifed distributed systems.
In Proceedings of the 6th Usenix Symposium on
Networked Systems Design and Implementation
54. Yu, Y., Manolios, P., Lamport, L. Model checking TLA+
specifications. In Proceedings of the 10th IFIP WG
10. 5 Advanced Research Working Conference on
Correct Hardware Design and Verification Methods
55. Zhao, X. et al. Lprof: A non-intrusive request flow
profiler for distributed systems. In Proceedings of the
11th Usenix Conference on Operating Systems Design
and Implementation (2014), 629–644.
Peter Alvaro is an assistant professor of computer
science at the University of California Santa Cruz,
where he leads the Disorderly Labs research group
Severine Tymon is a technical writer who has written
documentation for both internal and external users
of enterprise and open source software, including for
Microsoft, CNET, VMware, and Oracle.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00.
comes, 10 then the root cause of the discrepancy would be likely to be near the
“frontier” of the difference.
A sea change is occurring in the techniques used to determine whether
distributed systems are fault tolerant.
The emergence of fault injection approaches such as Chaos Engineering
and Jepsen is a reaction to the erosion
of the availability of expert programmers, formal specifications, and uniform source code. For all of their promise, these new approaches are crippled
by their reliance on superusers who
decide which faults to inject.
To address this critical shortcoming, we propose a way of modeling and
ultimately automating the process
carried out by these superusers. The
enabling technologies for this vision
are the rapidly improving observability and fault injection infrastructures
that are becoming commonplace in
the industry. While LDFI provides constructive proof that this approach is
possible and profitable, it is only the
beginning. Much work remains to be
done in targeting faults at a finer grain,
constructing more accurate models of
system redundancy, and providing better explanations to end users of exactly
what went wrong when bugs are identified. The distributed systems research
community is invited to join in exploring this new and promising domain.
Fault Injection in Production
The Verification of a Distributed System
Injecting Errors for Fun and Profit
1. Alvaro, P. et al. Automating failure-testing research
at Internet scale. In Proceedings of the 7th ACM
Symposium on Cloud Computing (2016), 17–28.
2. Alvaro, P., Rosen, J., Hellerstein, J. M. Lineage-driven
fault injection. In Proceedings of the ACM SIGMOD
International Conference on Management of Data
3. Andrus, K. Personal communication, 2016.
4. Aniszczyk, C. Distributed systems tracing with Zipkin.
Twitter Engineering; https://blog.twitter.com/2012/
5. Barth, D. Inject failure to make your systems more
reliable. DevOps.com; http://devops.com/2014/06/03/
6. Basiri, A. et al. Chaos Engineering. IEEE Soft ware 33, 3
7. Beyer, B., Jones, C., Petoff, J., Murphy, N. R. Site
Reliability Engineering. O’Reilly, 2016.
8. Birrell, A.D., Nelson, B.J. Implementing remote
procedure calls. ACM Trans. Computer Systems 2, 1
9. Chandra, T. D., Hadzilacos, V., Toueg, S. The weakest
failure detector for solving consensus. J.ACM 43, 4
10. Chen, A. et al. The good, the bad, and the differences:
better network diagnostics with differential
provenance. In Proceedings of the ACM SIGCOMM
Conference (2016), 115–128.
11. Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T.
Explaining outputs in modern data analytics. In
Proceedings of the VLDB Endowment 9, 12 (2016):
12. Chow, M. et al. The Mystery Machine: End-to-end
performance analysis of large-scale Internet services.
In Proceedings of the 11th Usenix Conference on
Operating Systems Design and Implementation
13. Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of
view data in a warehousing environment. ACM Trans.
Database Systems 25, 2 (2000), 179–227.
14. Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A
Fault Injection Environment for Distributed Systems.
In Proceedings of the 26th International Symposium
on Fault-tolerant Computing, (1996).
15. Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility
of distributed consensus with one faulty process.
J. ACM 32, 2 (1985): 374–382; https://groups.csail.mit.
16. Fisman, D., Kupferman, O., Lustig, Y. On verifying
fault tolerance of distributed protocols. In Tools
and Algorithms for the Construction and Analysis of
Systems, Lecture Notes in Computer Science 4963,
Springer Verlag (2008). 315–331.
17. Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure
injection testing. Netflix Technology Blog; http://
18. Gray, J. Why do computers stop and what can
be done about it? Tandem Technical Report 85. 7
tandem/TR- 85. 7.pdf.
19. Gunawi, H. S. et al. FATE and DESTINI: A framework
for cloud recovery testing. In Proceedings of the 8th
Usenix Conference on Networked Systems Design
and Implementation (2011), 238–252; http://db.cs.
20. Holzmann, G. The SPIN Model Checker: Primer and
Reference Manual. Addison-Wesley Professional, 2003.
21. Honeycomb. 2016; https://honeycomb.io/.
22. Interlandi, M. et al. Titian: Data provenance support in
Spark. In Proceedings of the VLDB Endowment 9, 33
23. Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army.
Netflix Technology Blog; http: // techblog.netflix.
24. Jepsen. Distributed systems safety research, 2016;
25. Jones, N. Personal communication, 2016.
26. Kafka 0.8.0. Apache, 2013; https://kafka.apache.
27. Kanawati, G. A., Kanawati, N. A., Abraham, J. A. Ferrari:
A flexible software-based fault and error injection
system. IEEE Trans. Computers 44, 2 (1995): 248–260.
28. Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note
on distributed computing. Technical Report, 1994. Sun
29. Killian, C. E., Anderson, J. W., Jhala, R., Vahdat, A. Life,
death, and the critical transition: Finding liveness
bugs in systems code. Networked System Design and
Implementation, (2007); 243–256.
30. Kingsbury, K. Call me maybe: Kafka, 2013; http://
31. Kingsbury, K. Personal communication, 2016.
32. Lafeldt, M. The discipline of Chaos Engineering.
Gremlin Inc., 2017; https://blog.gremlininc.com/the-
33. Lampson, B. W. Atomic transactions. In Distributed
Systems—Architecture and Implementation, An
Advanced Cours: (1980), 246–265; https://link.
34. LightStep. 2016; http://lightstep.com/.
35. Marinescu, P. D., Candea, G. LFI: A practical and
general library-level fault injector. In IEEE/IFIP
International Conference on Dependable Systems and