and its user guide. The latest script and
document are both under Steve’s home
directory, in the bin/folder, of course.
Zoë writes this down in her notes for
future reference, hoping devoutly that
she will get through this shift without
further alerts. She wonders whether
her tech lead or anyone else will follow
up on the problems uncovered during
the postmortem discussion, or whether future SREs are doomed to repeat
the same painful on-call experience.
Later that day Zoë attends an SRE
onboarding session, where the SRE
team meets with a product development team to talk about taking over
their service. Steve leads the meeting, asking several pointed questions about operational procedures
and current reliability problems with
the service, and asking the developer
team to make several operational and
feature changes before the SRE team
can take it over. Zoë has been to a few
such meetings already, which are led
either by Steve or another senior SRE.
She realizes the questions asked and
the actions assigned to the developers
seem to vary quite a bit, depending on
who is leading the meeting and what
types of product failures the SRE team
has dealt with in the past week.
She wishes vaguely that the team
had more consistent standards and
procedures but doesn’t quite know
how to achieve that goal. Later, she
hears two of the developers joking near
the coffee machine that many of the
questions seemed quite unrelated to
carrying a pager, and they had no idea
where those questions came from. She
wishes product development teams
could understand that SREs do a lot
more than carry pagers. Back at her
desk, however, Zoë finds several urgent
tickets to resolve, so she never follows
up on those thoughts.
Luckily, all the characters and episodes
in this story are fictional. Still, consider
whether any part of the story resembles
any of your real-life experiences. The
solution to this fictional team’s struggles is entirely obvious, and the next
section expands on this solution.
The Importance of Documentation
In the early stages of an SRE team’s
existence, the organization depends
˲ Reviewing planned system changes.
˲ Scaling systems sustainably through
mechanisms such as automation.
˲ Evolving systems by pushing for
changes that improve reliability and
velocity.
˲Conducting incident responses
and blameless postmortems.
Once services reach end of life, SREs
decommission them in a predictable
fashion with clear messaging and documentation.
A mature SRE team likely has well-defined bodies of documentation associated with many SRE functions. If
you manage an SRE team or intend to
start one, this article will help you understand the types of documents your
team needs to write and why each type
is needed, allowing you to plan for and
prioritize documentation work along
with other team projects.
A SRE’s Story
Before discussing the nuances of SRE
documentation, let’s examine a night
and day in the life of Zoë, a new SRE.
Zoë is on her second on-call shift as an
SRE for Acme Inc.’s flagship AcmeSale
product. She has been through her induction process as a team member,
where she watched her colleagues while
they were on-call, and she took notes as
well as she could. Now she has the pager.
As luck would have it, the pager
goes off at 2: 30 a.m. The alert says
“Ragnarok job flapping,” and Zoë
has no idea what it means. She flips
through her notes and finds the link
to the main dashboard page. Everything looks OK. She does a search on
the Acme intranet to find any document referencing Ragnarok, and after precious minutes go by, she finds
an outdated design document for the
service, which turns out to be a critical
dependency for AcmeSale.
Luckily, the design document links
to a “Ragnarok Ops” page, and that
page has links to a dashboard with
charts that look like they might be use-
ful. One of the charts displays a traffic
dip that looks alarming. The Ops page
also references a script called ragtool
that can apparently fix problems like
the one she is seeing, but this is the
first time she has heard of it. At this
point, she pages the backup on-call
SRE for help because he has years of ex-
perience with the service and its man-
agement tools. Unfortunately, she gets
no response. She checks her email and
finds a message from her colleague
saying he is offline for an hour because
of a health emergency. After a moment
of inner debate, she calls her tech lead,
but the call goes to voicemail. It looks
like she has to tackle this on her own.
After more searching to learn
about this mysterious ragtool script,
she finds a document with one-line
descriptions of its command-line op-
tions, which also tells her where to
find the script. She runs ragtool-
restart and crosses her fingers.
Nothing changes, and in fact the traf-
fic drops even more. She reads franti-
cally through more command-line op-
tions but is not sure whether they will
do more harm than good. Finally, she
concludes that ragtool–ebalance
e–dc=atlanta might help, since an-
other chart indicates that the Atlanta
data center is having more trouble.
Sure enough, the line on the traffic
chart starts creeping upward, and she
thinks she is out of the woods. MTTR
(mean time to repair) is 45 minutes.
The next day Zoë has a postmortem
discussion about the incident with her
team. They are having this discussion
because the incident was a major outage
causing loss of revenue, and their manager has been asking them to do more
postmortems. She asks the team how
they would have handled the situation
differently, and she hears three different
approaches. There appears to be no standard troubleshooting process. Her colleagues also acknowledge that the “
flapping” alert is poorly named, and that the
failure was a result of a well-known bug
in the product that hasn’t been a high
priority for the developer team.
Finally, Steve, her tech lead, asks,
“Which version of ragtool did you
use?” and then points out that the version she used was very old. A new release came out a week ago with brand-new documentation describing all
its new features and even explaining
how to fix the “Ragnarok job flapping”
problem. It might have reduced the
MTTR to five minutes.
The existence of the new version of
ragtool comes as a surprise to about
half the team, while the other half is
somehow familiar with the new version