They must embrace failure and automatically place services onto the hardware as needed.
Operations must evolve from manual to automated to autonomous. Humans should set the goals and handle
major exceptions. The system does
Software must embrace failures
with pools of stateless servers and special storage servers designed to cope
with the loss of replicas.
Integrated development, test, and
deployment are built deeply into the
system with controlled and automated
deployment of software.
Order from Chaos
Learning to Embrace Failure
A discussion with Jesse Robbins,
Kripa Krishnan, John Allspaw,
and Tom Limoncelli
Schema.org: Evolution of
Structured Data on the Web
R.V. Guha, Dan Brickley, and Steve Macbeth
1. Chaiken, R., Jenkins, R., Larson, P-A., Ramsey, B.,
Shakib, D., Weaver, S., Zhou, J. SCOPE: Easy and
efficient parallel processing of massive data sets. In
Proceedings of ACM VLDB, 2008; http://www.vldb.org/
2. Ghemawat, S., Gobioff, H. and Leung, S.-T.
The Google file system. In Proceedings of the 19th
ACM Symposium on Operating Systems Principles,
3. Shvachko, H., Kuang, H., Radia, S., Chansler, R. Hadoop
Distributed File System. In Proceedings of the IEEE
26th Symposium on Mass Storage Systems and
Technologies, 2010, 1–10.
Pat Helland has been implementing transaction systems,
databases, application platforms, distributed systems,
fault-tolerant systems, and messaging systems since
1978. For recreation, he occasionally writes technical
papers. He currently works at Salesforce.
Ed Harris leads the Infrastructure Compute team at
Salesforce.com, building the compute, network, and
storage substrate for the world’s most trusted enterprise
cloud. Prior to Salesforce, Ed worked in the Bing
Infrastructure at Microsoft, working on the Cosmos big
Simon Weaver has been devising and building distributed
and autonomous systems for 18 years solving problems
in the fields of big data, lights-out infrastructure, and
artificial intelligence. Simon currently works at Salesforce.
Copyright held by authors.
Publication rights licensed to ACM. $15.00.
Microsoft Bing’s Cosmos distributed
1 and others leveraging
Large datacenters need to consider
two essential aspects of this approach.
First, software-driven robotic recovery
is essential for high availability. Humans take too long and would be overwhelmed by managing the failures.
Second, the larger the cluster, the more
storage servers are possible. The more
storage servers, the smaller the pieces
smeared around the cluster and the
faster the recovery time.
Development, Test, and
Deployment at Web Scale
Development and testing in large-scale
Web environments are best suited for
sharing resources with production.
Successfully developing, testing, and
deploying software presents many
challenges. To top it off, you are pretty
much constantly working to keep up
with the changing environment.
The isolation ward. It’s important
to concentrate datacenter resources.
Separating datacenters for development and test from production may
be tempting, but that ends up creating problems. Managing demand and
resources is difficult when they are
separate. It is also hard to ensure that
production is in sync with dev/test and
that you are testing what you will run
Developing and testing using production datacenters means sharing
resources. You must separate the resources using containers or VMs (
virtual machines). Your customers will
be using the same servers as your testers so you must isolate the production data. In general, development
and test personnel must not have access to customer data to comply with
the security team.
While ensuring safe isolation of
workloads presents many challenges,
the benefits outweigh the hassles. You
don’t need dedicated datacenters for
dev/test, and resources can ebb and flow.
Sign-off, rollout, canaries, and
rollback. The management of software change in a large-scale environment has complexities, too. Formalized approvals, rollout via a secure
path, automated watchdogs looking
for problems, and automated rollback are all essential.
Formal approvals involve release
rules. There will be code review and automated test suites. The official signoff
typically involves at least two people.
Rollout to production includes code
signing whereby a cryptographic signature is created to verify the integrity of
the software. The software is released
to the needed servers. Sometimes, tens
of thousands of nodes may be receiving
the new version. Each node is told the
code signature, and it verifies that the
correct bits are there. The old version
of the software will be kept side-by-side
with the new one.
A few servers are assigned to be
canaries that will try the new version before all the others. Like the little birds
taken into underground mines to
check for dangerous gases (the canaries would die of the gas before the min-ers, who would then skedaddle to safety), in software releases, the automated
robot tries a few servers first, and only
when successful, rolls out more.
Rollback is the automated mechanism to undo the deployment of a new
version. Each server keeps the old version of the software and can pop back
to it when directed.
Running large datacenters requires
some fundamental changes in approach. Everything must be simpler, be
automated, and expect failure.
A simple model for server and service behavior. Web-scale systems run
on three important premises:
˲ Expect failure. Any component may
fail and the system continues automatically without human intervention.
˲ Minimal levers. The underlying system provides support for deployment
and rollback. When a server is sick, it’s
better to kill the whole server than deal
with partial failures.
˲ Software control. If it can’t be controlled by software, don’t let it happen.
Use version control for everything with
Reliability is in the service, not the
Embrace failure so it doesn’t embrace you. Running hundreds of thousands of servers requires a different
approach. It requires consistent hardware (servers and network) with minimal variety. Datacenters must have
simple and predictable configurations.