˲ Building resilient systems does
not happen at a single point in time;
it is an ongoing process that involves
discovering weaknesses and dealing
with them in an iterative learning cycle. Deep visibility into the system is
key to understanding how the system
operates and in which ways it fails.
Few root-cause investigations would
succeed without metrics and insights
into operations of the system and its
components. Monitoring provides a
deep understanding of how the system operates, especially when it fails,
and makes it possible to discover
weaknesses in the system and identify
anti-patterns for resilience.
One of the most important first
questions to ask during a customer-impacting event is, “What changed?”
Therefore, another key aspect of monitoring and observability is the ability
to record changes to the state of the
system. Whether a new code deployment, a change in runtime configuration, or a state change by an externally
used service, the change must be recorded for easy retrieval later. Netflix
built a system, internally called Chronos, for this purpose. Any event that
changes the state of the system is recorded in Chronos and can be quickly
queried to aid in causality attribution.
the antifragile organization
Resilience to failure is a lofty goal. It
enables a system to survive and withstand failure. There is an even higher
peak to strive for, however: making the
system stronger and better with each
failure. In Nassim Taleb’s parlance,
it can become antifragile—growing
stronger from each successive stressor,
disturbance, and failure. 8
Netflix has taken the following steps
to create a more antifragile system and
organization:
1. Every engineer is an operator of the
service. This is sometimes referred to
in jest as “no ops,” though it is really
more “distributed ops.” Separating de-
velopment and operations creates a di-
vision of responsibilities that can lead
to a number of challenges, including
network externalities and misaligned
incentives. Network externalities are
caused by operators feeling the pain of
problems that developers introduce.
Misaligned incentives are a result of
operators wanting stability while de-
velopers desire velocity. The DevOps
movement was started in response to
this divide. Instead of separating de-
velopment and operations, develop-
ers should operate their own services.
They deploy their code to production
and then they are the ones awakened
in the middle of the night if any part
of it breaks and impacts customers.
By combining development and op-
erations, each engineer can respond
to failure by altering the service to be
more resilient to and fault tolerant of
future failures.
conclusion
The more frequently failure occurs,
the more prepared the system and or-
ganization become to deal with it in a
transparent and predictable manner.
Inducing failure is the best way of en-
suring both system and organizational
resilience. The goal is to maximize
availability, insulating users of a ser-
vice from failure and delivering a con-
sistent and available user experience.
Resilience can be improved by increas-
ing the frequency and variety of failure
and evolving the system to deal better
with each newfound failure, thereby
increasing antifragility. Focusing on
learning and fostering a blameless cul-
ture are essential organizational ele-
ments in creating proper feedback in
the system.
Related articles
on queue.acm.org
Automating Software Failure Reporting
Brendan Murphy
http://queue.acm.org/detail.cfm?id=1036498
Keeping Bits Safe: how hard Can It Be?
David S. H. Rosenthal
http://queue.acm.org/detail.cfm?id=1866298
Monitoring, at Your Service
Bill Hoffman
http://queue.acm.org/detail.cfm?id=1113335
References
1. acm. resilience engineering: learning to embrace
failure. Commun. ACM 55, 11 (nov. 2012), 40–47;
http://dx.doi.org/10.1145/2366316.2366331.
2. bennett, c. edda—learn the stories of your cloud
deployments. the netflix tech blog; http://techblog.
netflix.com/2012/11/edda-learn-stories-of-your-cloud.
html.
3. bennett, c. and tseitlin, a. chaos monkey released
into the wild. the netflix tech blog; http://techblog.
netflix.com/2012/07/chaos-monkey-released-into-
wild.html.
4. chandra, t. D., griesemer, r. and redstone, J. Paxos
made live: an engineering perspective. In Proceedings
of the 26th Annual ACM Symposium on Principles of
Distributed Computing (2007), 398–407; http://labs.
google.com/papers/paxos_made_live.pdf.
5. Izrailevsky, y. and tseitlin, a. the netflix Simian
army. the netflix tech blog; http://techblog.netflix.
com/2011/07/ netflix-simian-army.html.
6. Sondow, J. asgard: Web-based cloud management
and deployment. the netflix tech blog; http://
techblog.netflix.com/2012/06/asgard-web-based-
cloud-management-and.html.
7. Strigini, l. fault tolerance and resilience:
meanings, measures and assessment. centre for
Software reliability, city university, london, u.k.,
2009; http://www.csr.city.ac.uk/projects/amber/
resilienceftmeasurementv06.pdf.
8. taleb, n. Antifragile: Things That Gain from Disorder.
random house, 2012.
Ariel Tseitlin is director of cloud solutions at netflix
where he manages the netflix cloud and is responsible
for cloud tooling, monitoring, performance and scalability,
and cloud operations and reliability engineering. he is also
interested in resilience and highly available distributed
systems. Prior to joining netflix, he was most recently VP
of technology and products at Sungevity and before that
was the founder and ceo of cto Works.