creating catch- 22 situations that had to
be resolved in the heat of the moment.
This 10-hour ordeal was the result
of big batches. Because failovers happened rarely, there was an accumulation of infrastructure skew, dependencies, and stale code. There was also an
accumulation of ignorance: new hires
had never experienced the process;
others had fallen out of practice.
To fix this problem the team decided to do more failovers. The batch
size was the number of accumulated
changes and other things that led to
problems during a failover. Rather
than let the batch size grow and grow,
the team decided to keep it small.
Rather than waiting for the next real
disaster to exercise the failover process, they would intentionally introduce disasters.
The concept of activating the
failover procedure on a system that was
working perfectly may seem odd, but
it is better to discover bugs and other
problems in a controlled situation
than during an emergency. Discovering
a bug during an emergency at 4 a.m. is
troublesome because those who can fix
it may be unavailable—and if they are
available, they are certainly unhappy
to be awakened. In other words, it is
better to discover a problem on Saturday at 10 a.m. when everyone is awake,
available, and presumably sober.
If schoolchildren can do fire drills
once a month, certainly system administrators can practice failovers a few
times a year. The team began doing
failover drills every two months until
the process was perfected.
Each drill surfaced problems with
code, documentation, and procedures. Each issue was filed as a bug
and was fixed before the next drill. The
next failover took five hours, then two
hours, then eventually the drills could
be done in an hour with zero user-visi-ble downtime.
The process found infrastructure
changes that had not been replicated in
Oregon and code that did not failover
properly. It identified new services that
had not been engineered for smooth
failover. It discovered a process that
could be done only by one particular
engineer. If he was on vacation or unavailable, the company would be in
trouble. He was a single point of failure.
Over the course of a year all these
coordination among engineering,
marketing, sales, customer support,
and other groups. That said, all of
these teams loved the transition from
an unreliable mostly every-six-months
schedule to a reliable monthly sched-
ule. Soon these teams started initia-
tives to attempt weekly releases, with
hopes of moving to daily releases. In
the new small-batch world the follow-
ing benefits were observed:
˲ Features arrived faster. While in
the past a new feature took up to six
months to reach production, now it
could go from idea to production in
˲ Hell month was eliminated. After
hundreds of trouble-free pushes to
beta, pushing to production was easy.
˲ The operations team could focus
on higher-priority projects. The team
was no longer directly involved in software releases other than fixing the automation, which was rare. This freed up
the team for more important projects.
˲ There were fewer impediments to
fixing bugs. The first step in fixing a
bug is to identify which code change
was responsible. Big-batch releases
had hundreds or thousands of changes
to sort through to identify the guilty
party. With small batches, it was usually quite obvious where to find the bug.
˲ Bugs were fixed in less time.
Fixing a bug in code that was written six
months ago is much more difficult
than fixing a bug in code while it is
still fresh in your mind. Small batches
meant bugs were reported soon after
the code was written, which meant developers could fix them more expertly
in a shorter amount of time.
˲Developers experienced instant
gratification. Waiting six months to
see the results of your efforts is demoralizing. Seeing your code help people
shortly after it was written is addictive.
˲ Most importantly, the operations
team could finally take long vacations,
the kind that require advance planning
and scheduling, thus giving them a way
to reset and live healthier lives.
While these technical benefits are
worthwhile, the business benefits are
even more exciting:
˲Their ability to compete im-
proved. Confidence in the ability to
add features and fix bugs led to the
company becoming more aggressive
about new features and fine-tuning
existing ones. Customers noticed and
˲ Fewer missed opportunities. The
sales team had been turning away business because of the company’s inability to strike fast and take advantage of
opportunities as they arrived. Now the
company could enter markets it hadn’t
˲ Enabled a culture of automation
and optimization. Rapid releases removed common excuses not to automate. New automation brought
consistency, repeatability, better error checking, and less manual labor.
Plus, automation could run any time,
not just when the operations team
The Failover Process
Stack Overflow’s main website infra-
structure is in a datacenter in New York
City. If the datacenter fails or needs to
be taken down for maintenance, dupli-
cate equipment and software are run-
ning in Oregon, in stand-by mode.
The failover process is complex.
Database masters need to be transi-
tioned. Services need to be reconfig-
ured. It takes a long time and requires
skills from four different teams. Every
time the process happens it fails in
new and exciting ways, requiring ad-
hoc solutions invented by whoever is
doing the procedure.
In other words, the failover process is
risky. When Tom was hired at Stack, his
first thought was, “I hope I’m not on call
when we have that kind of emergency.”
Drunk driving is risky, so we avoid
doing it. Failovers are risky, so we
should avoid them, too. Right?
Wrong. There is a difference between behavior and process. Risky behaviors are inherently risky; they cannot be made less risky. Drunk driving
is a risky behavior. It cannot be done
safely, only avoided.
A failover is a risky process. A risky
process can be made less risky by doing
it more often.
The next time a failover was attempted at Stack Overflow, it took 10
hours. The infrastructure in New York
had diverged from Oregon significantly. Code that was supposed to seamlessly failover had been tested only in
isolation and failed when used in a real
environment. Unexpected dependencies were discovered, in some cases