for validating the system’s correctness,
fault tolerance, and redundancy.
Unit and integration tests. Engineers
have long included unit and integration tests in their testing repertoires.
Often, however, these tests are skipped
or not focused on in distributed systems because of the commonly held
beliefs that failures are difficult to produce offline and that creating a pro-duction-like environment for testing is
complicated and expensive.
In a 2014 study, Yuan et al. 15 argue
this conventional wisdom is untrue.
Notably, the study shows that:
˲ Three or fewer nodes are sufficient
to reproduce most failures;
˲ Testing error-handling code could
have prevented the majority of catastrophic failures; and,
˲ Incorrect error handling of nonfatal errors is the cause of most catastrophic failures.
Unit tests can use mock-ups to
prove intrasystem dependencies and
verify the interactions of various components. In addition, integration tests
can reuse these same tests without the
mock-ups to verify they run correctly in
an actual environment.
The bare minimum should be employing unit and integration tests that
focus on error handling, unreachable
nodes, configuration changes, and
cluster membership changes. Yuan et
al. argue this testing can be done at low
cost and it greatly improves the reliability of a distributed system.
Random model checkers. Libraries
such as QuickCheck9 aim to provide
property-based testing. QuickCheck allows users to specify properties about a
program or system. It then generates a
configurable amount of random input
and tests the system against that input.
If the properties hold for all inputs,
the system passes the test; otherwise,
a counterexample is returned. While
QuickCheck cannot declare a system
provably correct, it helps increase
confidence that a system is correct by
exploring a large portion of its state
space. QuickCheck is not designed explicitly for testing distributed systems,
but it can be used to generate input
into distributed systems, as shown by
Basho, which used it to discover and fix
bugs in its distributed database, Riak. 13
Fault-injection testing causes or
introduces a fault in the system. In a
However, all is not lost! You can em-
ploy testing methods that greatly in-
crease your confidence the systems you
build are correct. While these methods
do not provide the gold star of verified
provable correctness, they do provide a
silver star of “seems pretty legit.”
Monitoring is often cited as a means
for verifying and testing distributed
systems. Monitoring includes metrics,
logs, distributed tracing systems such
as Dapper12 and Zipkin, 2 and alerts.
While monitoring the system and detecting errors is an important part of
running any successful service, and
necessary for debugging failures, it is a
wholly reactive approach for validating
distributed systems; bugs can be found
only once the code has made it into
production and is affecting customers. All of these tools provide visibility
into what your system is currently doing versus what it has done in the past.
Monitoring allows you only to observe
and should not be the sole means of
verifying a distributed system.
Canarying new code is an increasingly popular way of “verifying” the
code works. It uses a deployment pattern in which new code is gradually
introduced into production clusters.
Instead of replacing all the nodes in
the service with the new code, a few
nodes are upgraded to the new version.
The metrics and/or output from the
canary nodes are compared with the
nodes running the old version. If they
are deemed equivalent or better, more
nodes can be upgraded to the canary
version. If the canary nodes behave
differently or are faulty, they are rolled
back to the old version.
Canarying is very powerful and
greatly limits the risk of deploying new
code to live clusters. It is limited in the
guarantees it can provide, however. If a
canary test passes, the only guarantee
you have is the canary version performs
at least as well as the old version at this
moment in time. If the service is not
under peak load or a network partition
does not occur during the canary test,
then no information about how the canary performs compared with the old
version is obtained for these scenarios.
Canary tests are most valuable for
validating the new version works as
expected in the common case and no
regressions in this path have occurred
in the service, but it is not sufficient
All is not lost!
You can employ
the systems you
build are correct.