˲ Locate as much documentation as
possible and do a gap analysis on content;
˲ Set up a basic structure for the site
so that new documentation can be created in the correct location;
˲ Port relevant existing documentation to a new location;
˲ Create a monitoring and reporting structure to track the progress of
migration;
˲ Archive and tear down old documentation;
˲ Perform periodic checks to verify that consistency/quality is being
maintained;
˲ Verify that commonly used search
terms bring up the right documents
near the top of the search results; and,
A note on repository maintenance: it
is important that docs are reviewed and
updated on a regular basis. The owner’s
name should be visible, as well as the
last reviewed date—all this information helps with the trustworthiness of
the documentation. In Zoë’s story she
found and used an obsolete document
for a critical operational tool and thereby missed an opportunity to resolve an
incident quickly rather than experience
a major outage. If documents cannot be
trusted to be accurate and current, this
can make SREs less effective, directly
impacting the reliability of the services
they manage.
Repository availability. SRE teams
must ensure documentation is available even during an outage that makes
the standard repository unavailable.
At Google, SREs have personal copies
of critical documentation. This copy
is available on an encrypted compact
storage device or similar detachable
but secure physical media that all on-call SREs carry with them.
Documents for
Service Decommissioning
Once services reach end of life, SREs
decommission them in a predictable
fashion. Here, we provide messaging
and documentation guidelines for
service deprecation leading to eventual decommissioning.
It is important to announce decom-
missioning to current service users well
ahead of time and provide them with a
timeline and sequence of steps. Your
announcement should explain when
new users will no longer be accepted,
how existing and newly found bugs will
be handled, and when the service will
completely stop functioning. Be clear
about important dates and when you
will be reducing SRE support for the
service, and send interim announce-
ments as the timeline progresses.
Sending an email message is not
sufficient, and you must also update
your main documentation pages, play-
books, and codelabs. Also, annotate
header files if applicable. Capture the
details of the announcement in a docu-
ment (in addition to email), so it’s easy
to point users to the document. Keep
the email as short as possible, while
capturing the essential points. Provide
additional details in the document,
such as the business motivations for
decommissioning the service, which
tools your users can take advantage of
when migrating to the replacement
service, and what assistance is avail-
able during migration. You should also
create a FAQ page for the project, grow-
ing the page over time as you field new
questions from your users.
Technical writers provide a variety of
services that make SREs effective and
productive. These services extend well
beyond writing individual documents
based on requirements received from
SRE teams.
Here is some guidance to technical
writers on best practices for working
with SRE teams.
˲ Technical writers should partner
with SREs to provide operational docu-
mentation for running services and
product documentation for SRE prod-
ucts and features.
˲ They can create and update doc re-
positories, restructure and reorganize re-
positories to align with user needs, and
improve individual docs as part of the
overall repository management effort.
˲ Writers should provide consulting
to assess, assist, and address documen-
tation and information management
needs. This involves conducting doc
assessments to gather requirements,
enhancing docs and sites created by
engineers, and advising teams on mat-
ters related to documentation creation,
organization, redesign, findability, and
maintenance.
˲ Writers should evaluate and im-
prove documentation tools to provide
the best solutions for SRE.
Templates. Tech writers also pro-
vide templates to make SRE documen-
tation easier to create and use. Tem-
plates do the following:
˲ Makeiteasyforauthorstocreatedocu-
mentation by providing a clear structure so
that engineers can populate it quickly with
relevant information;
˲ Ensure documentation is complete
by including sections for all required piec-
es of documentation; and,
˲ Make it easy for readers to under-
stand the topic of the doc quickly, the type
of information it’s likely to contain, and
how it’s organized.
Site Reliability Engineering contains
several examples of documentation
templates. To view the templates, visit
https://queue.acm.org/appendices/
SRE_Templates.html
Conclusion
Whether you are an SRE, a manager
of SREs, or a technical writer, you now
understand the critical importance of
documentation for a well-functioning
SRE team. Good documentation enables SRE teams to scale up and take a
principled approach to managing new
and existing services.
Related articles
on queue.acm.org
The Calculus of Service Availability
Ben Treynor, Mike Dahlin,
Vivek Rau, and Betsy Beyer
https://queue.acm.org/detail.cfm?id=3096459
Resilience Engineering:
Learning to Embrace Failure
A discussion with Jesse Robbins, Kripa
Krishnan, John Allspaw, and Tom Limoncelli
https://queue.acm.org/detail.cfm?id=2371297
Reliable Cron across the Planet
Štepán Davidovic and Kavita Guliani
https://queue.acm.org/detail.cfm?id=2745840
References
1. Blank-Edelman, D. N. Seeking SRE: Conversations
About Running Production Systems at Scale. O’Reilly
Media, 2018.
2. Murphy, N., Beyer, B., Jones, C., Petoff, J. Site
Reliability Engineering: How Google Runs Production
Systems. O’Reilly Media, 2016
Shylaja Nukala is a technical writing lead for Google
Site Reliability Engineering. She leads the documentation,
information management, and select training efforts for
SRE, Cloud, and Google engineers.
Vivek Rau is a Site Reliability Engineer at Google, working
on Customer Reliability Engineering (CRE). The CRE team
teaches customers core SRE principles, enabling them to
build and operate highly reliable products on the Google
Cloud Platform.
Copyright held by authors/owners.