issues were fixed. Code was changed,
better pretests were developed, and
drills gave each member of the SRE
(site reliability engineering) team a
chance to learn the process. Eventually
the overall process was simplified and
easier to automate. The benefits Stack
Overflow observed included:
˲ Fewer surprises. The more frequent the drills, the smoother the process became.
˲ Reduced risk. The procedure was
more reliable because there were fewer
hidden bugs waiting to bite.
˲ Higher confidence. The company
had more confidence in the process,
which meant the team could now focus
on more important issues.
˲ Bugs fixed faster. The smaller accumulation of infrastructure and code
changes meant each drill tested fewer
changes. Bugs were easier to identify
and faster to fix.
˲ Bugs fixed during business hours.
Instead of having to find workarounds
or implement fixes at odd hours when
engineers were sleepy, they were
worked on during the day when engineers were there to discuss and implement higher-quality fixes.
˲Better cross training. Practice
makes perfect. Operations team members all had a turn at doing the process
in an environment where they had help
readily available. No person was a single point of failure.
˲Improved process documentation and automation. Documentation
improved while the drill was running.
Automation was easier to write because the repetition helped the team
see what could be automated or what
pieces were most worth automating.
˲ New opportunities revealed. The
drills were a big source of inspiration
for big-picture projects that would radically improve operations.
˲ Happier developers. There was less
chance of being woken up at 4 a.m.
˲ Happier operations team. The fear
of failovers was reduced, leading to
less stress. More people trained in the
failover procedure meant less stress
on the people who had previously been
single points of failure.
˲ Better morale. Employees could
schedule long vacations again.
Again, it became easier to schedule
long vacations.
The Monitoring Project
An IT department needed a monitoring system. The number of servers had
grown to the point where situational
awareness was no longer possible by
manual means. The lack of visibility into the company’s own network
meant that outages were often first reported by customers, and often after
the outage had been going on for hours
and sometimes days.
The system administration team
had a big vision for what the new monitoring system would be like. All services and networks would be monitored,
the monitoring system would run on a
pair of big, beefy machines, and when
problems were detected a sophisticated on-call schedule would be used to
determine whom to notify.
Six months into the project they had
no monitoring system. The team was
caught in endless debates over every
design decision: monitoring strategy,
how to monitor certain services, how
the pager rotation would be handled,
and so on. The hardware cost alone
was high enough to require multiple
levels of approval.
Logically the monitoring system
couldn’t be built until the planning
was done, but sadly it looked like the
planning would never end. The more
the plans were discussed, the more
issues were raised that needed to be
discussed. The longer the planning
lasted, the less likely the project would
come to fruition.
Fundamentally they were having
a big-batch problem. They wanted to
build the perfect monitoring system in
one big batch. This is unrealistic.
The team adopted a new strategy:
small batches. Rather than building
the perfect system, they would build a
small system and evolve it.
At each step they would be able to
show it to their co-workers and customers to get feedback. They could validate
assumptions for real, finally putting a
stop to the endless debates the requirements documents were producing. By
monitoring something—anything—
they would learn the reality of what
worked best.
Small systems are more flexible and
malleable; therefore, experiments are
easier. Some experiments would work
well, others would not. Because they
would keep things small and flexible,
The concept
of activating
the failover
procedure on
a system that was
working perfectly
may seem odd,
but it is better
to discover
bugs and
other problems in
a controlled
situation than
during an
emergency.