distributed system a fault could be a
dropped message, a network partition,
or even the loss of an entire data center.
Fault-injection testing forces these failures to occur and allows engineers to
observe and measure how the system
under test behaves. If failures do not
occur, this does not guarantee a system
is correct since the entire state space of
failures has not been exercised.
Game days. In October 2014, Stripe
uncovered a bug in Redis by running a
fault-injection test. The simple test of
running kill - 9 on the primary node in
the Redis cluster resulted in all data in
that cluster being lost. Stripe referred
to its process of running controlled
fault-injection tests in production as
game day exercises. 4
Jepsen. Kyle Kingsbury has written
a fault-injection tool called Jepsen3
that simulates network partitions in
the system under test. After the simulation, the operations and results are
analyzed to see whether data loss occurred and whether claimed consistency guarantees were upheld. Jepsen
has proved to be a valuable tool, uncovering bugs in many popular systems such as MongoDB, Kafka, Elastic-Search, etcd, and Cassandra.
Netflix Simian Army. The Netflix
Simian Army5 is a suite of fault-injection tools. The original tool, called
Chaos Monkey, randomly terminates
instances running in production,
thereby injecting single-node failures
into the system. Latency Monkey injects network lag, which can look like
delayed messages or an entire service
being unavailable. Finally, Chaos Gorilla simulates an entire Amazon availability zone going down.
As noted in Yuan et al., 15 most of
the catastrophic errors in distributed
systems were reproducible with three
or fewer nodes. This finding demonstrates that fault-injection tests do not
even need to be executed in production
and affect customers in order to be
valuable and discover bugs.
Once again, passing fault-injection
tests does not guarantee a system is
correct, but these tests do greatly in-
crease confidence the systems will
behave correctly under failure scenar-
ios. As Netflix puts it, “With the ever-
growing Netflix Simian Army by our
side, constantly testing our resilience
to all sorts of failures, we feel much
more confident about our ability to
deal with the inevitable failures that
we’ll encounter in production and to
minimize or eliminate their impact to
our subscribers.” 5
Lineage-driven fault injection. Like
building them, testing distributed systems is an incredibly challenging problem and an area of ongoing research.
One example of current research is lineage-driven fault injection, described
by Peter Alvaro et al. 1 Instead of exhaustively exploring the failure space
as a model checker would, a lineage-driven fault injector reasons about
successful outcomes and what failures
could occur that would change this.
This greatly reduces the state space of
failures that must be tested to prove a
system is correct.
Formal methods can be used to verify
a single component is provably correct, but composition of correct components does not necessarily yield a
correct system; additional verification
is needed to prove the composition is
correct. Formal methods are still valuable and worth the time investment for
foundational pieces of infrastructure
and fault-tolerant protocols. Formal
methods should continue to find greater use outside of academic settings.
Verification in industry generally
consists of unit tests, monitoring,
and canaries. While this provides
some confidence in the system’s correctness, it is not sufficient. More
exhaustive unit and integration tests
should be written. Tools such as random model checkers should be used
to test a large subset of the state
space. In addition, forcing a system to
fail via fault injection should be more
widely used. Even simple tests such as
running kill - 9 on a primary node have
found catastrophic bugs.
Efficiently testing distributed systems is not a solved problem, but by
combining formal verification, model
checking, fault injection, unit tests, canaries, and more, you can obtain higher confidence in system correctness.
Acknowledgments. Thank you to
those who provided feedback, including Peter Alvaro, Kyle Kingsbury, Chris
Meiklejohn, Flavio Percoco, Alex Rasmussen, Ines Sombra, Nathan Taylor,
and Alvaro Videla.
Monitoring and Control of Large Systems
Iosif Legrand et al.
There’s Just No Getting around It: You’re
Building a Distributed System
Testing a Distributed System
1. Alvaro, P., Rosen, J. and Hellerstein, J.M. Lineage-driven fault injection, 2015; http://www.cs.berkeley.
2. Aniszczyk, C. Distributed systems tracing with Zipkin;
3. Aphyr. Jepsen; https://aphyr.com/tags/Jepsen.
4. Hedlund, M. Game day exercises at Stripe: learning
from “kill - 9”, 2014; https://stripe.com/blog/game-day-exercises-at-stripe.
5. Izrailevsky, Y. and Tseitlin, A. The Netflix Simian Army;
6. Killian, C., Anderson, J. W., Jhala, R., Vahdat, A. Life,
death, and the critical transition: finding liveness bugs
in system code, 2006; http://www.macesystems.org/
7. Lamport, L. and Yu, Y. TLC—the TLA+ model checker,
8. Newcomb, C., Rath, T., Zhang, F., Munteanu, B.,
Brooker, M. and Deardeuff, M. How Amazon Web
Services uses formal methods. 2015. Commun.
ACM 58, 4 (Apr. 2015), 68–73; http://cacm.acm.
9. QuickCheck; https://hackage.haskell.org/package/
10. Sheehy, J. There is no now. ACM Queue 13, 33 (2015);
11. Spin; http://spinroot.com/spin/whatispin.html.
12. Sigelman, B.H., Barroso, L.S., Burrows, M., Stephenson,
P., Plakal, M., Beaver, D., Jaspan, S. and Shanbhag,
C. Dapper, a large-scale distributed systems tracing
infrastructure, 2010; http://research.google.com/pubs/
13. Thompson, A. QuickChecking poolboy for fun and
profit, 2012; http://basho.com/posts/technical/
14. Yang, J. et al. MODIS T: transparent model checking of
unmodified distributed systems, 2009; https://www.
15. Yuan, D. et al. Simple testing can prevent most
critical failures: an analysis of production failures in
distributed data-intensive systems, 2014; https://www.
16. Wilcox, J.R. et al. Verdi: A framework for
implementing and formally verifying distributed
systems. In Proceedings of the ACM SIGPLAN
2015 Conference on Programming Language Design
and Implementation: 357–368; https://homes.
Caitie McCaffrey ( CaitieM.com; @Caitie) is the tech
lead for observability at Twitter. Prior to that she spent
the majority of her career building services and systems
that power the entertainment industry at 343 Industries,
Microsoft Game Studios, and HBO. She has worked on
several video games including Gears of War 2 and 3 and
Halo 4 and 5.
Copyright held by author.
Publication rights licensed to ACM. $15.00