writing a postmortem, see the postmortem template described in Site Reliability Engineering. 2)
Policies. Policy documents mandate
specific technical and nontechnical policies for production. Technical policies
can apply to areas such as production-change logging, log retention, internal
service naming (naming conventions
engineers should adopt as they implement services), and use of and access to
emergency credentials.
Policies can also apply to process.
Escalation policies help engineers classify production issues as emergencies
or non-emergencies and provide recommendations on the appropriate action
for each category; on-call expectations
policies outline the structure of the team
and responsibilities of team members.
Service-level agreement. An SLA is a
formal agreement with a customer on
the performance a service commits to
provide and what actions will be taken
if that obligation is not met. SRE teams
document their service(s) SLA for availability and latency, and monitor service performance relative to the SLA.
Documenting and publishing an
SLA, and rigorously measuring the end-user experience and comparing it with
the SLA, allows SRE teams to innovate
more quickly while preserving a good
user experience. SREs running services
with well-defined SLAs will detect outages faster and therefore resolve them
faster. Good SLAs also result in less friction between SRE and software engineer
(SWE) teams because those teams can
negotiate targets and results objectively,
and avoid subjective discussions of risk.
Note that having an external legally
enforceable agreement may not be applicable to most SRE teams. In these
cases, SRE teams can go with a set of
service-level objectives (SLOs). An SLO
is a definition of the desired performance of a service for a single metric
such as availability or latency.
Documents for production products. SRE teams aim to spend 50% of
their time on project work, developing
software that automates away manual
work or improves the reliability of a
managed service. Here, we describe
documents that are related to the products and tools SREs develop.
These documents are important
because they enable users to find out
whether a product is right for them to
˲ SLO negotiation process and error
budgets;
˲ New launch criteria and launch
freeze policy (if applicable);
˲ Content and frequency of service
status reports from the SRE team;
˲ SRE staffing requirements; and,
˲Feature roadmap planning process and priority of reliability features
(requested by SREs) versus new product functionality.
Documents for Running a Service
The core operational documents SRE
teams rely on to perform production
services include service overviews,
playbooks and procedures, postmortems, policies, and SLAs. (Note: this
section appeared in the “Do Docs Better” chapter of Seeking SRE. 1)
Service overviews are critical for SRE
understanding of the services they support. SREs need to know the system architecture, components and dependencies, and service contacts and owners.
Service overviews are a collaborative
effort between the development team
and the SRE team and are designed to
guide and prioritize SRE engagement
and uncover areas for further investigation. These overviews are often an output of the PRR process, and they should
be updated as services change (for example, new dependency).
A basic service overview provides
SREs with enough information about
the service to dig deeper. A complete
service overview provides a thorough
description of the service and how it interacts with the world around it, as well
as links to dashboards, metrics, and
related information that SREs need to
solve unexpected issues.
Playbook. Also called a runbook, this
quintessential operational doc lets on-call engineers respond to alerts generated by service monitoring. If Zoë’s
team, for example, had a playbook that
explained what the “Ragnarok job flapping” alert meant and told her what
to do, the incident could have been
resolved in a matter of minutes. Playbooks reduce the time it takes to mitigate an incident, and they provide useful links to consoles and procedures.
Playbooks contain instructions
for verification, troubleshooting, and
escalation for each alert generated
from network-monitoring processes.
Playbooks typically match alert names
generated from monitoring systems.
They contain commands and steps
that need to be tested and reviewed for
accuracy. They often require updates
when new troubleshooting processes
become available and when new fail-
ure modes are uncovered or depen-
dencies are added.
Playbooks are not exclusive to
alerts and can also include production
procedures for pushing releases, monitoring, and troubleshooting. Other
examples of production procedures
include service turnup and turndown,
service maintenance, and emergency/
escalation.
Postmortem. SREs work with large-scale, complex, distributed systems,
and they also enhance services with
new features and the addition of new
systems. Therefore, incidents and outages are inevitable given SRE scale and
velocity of change. The postmortem is
an essential tool for SRE, representing
its formalized process of learning from
incidents. In the hypothetical SRE story, Zoë’s team had no formal postmortem procedure or template and, therefore, no formal process for capturing
the learning from an incident and preventing it from recurring, so they are
doomed to repeat the same problems.
SRE teams need to create a standardized postmortem document template
with sections that capture all the important information about an outage.
This template will ideally be structured
in a format that can be readily parsed
by data-analysis tools that report on
outage trends, using postmortems as a
data source. Each postmortem derived
from this template describes a production outage or paging event, including
(at minimum):
˲ Timeline;
˲ Description of user impact;
˲ Root cause; and,
˲ Action items/lessons learned.
The postmortem is written by a member of the group that experienced the
outage, preferably someone who was
involved and can take responsibility for
the follow-up. A postmortem needs to be
written in a blameless manner. It should
include the information needed to understand what happened, as well as a list
of action items that would significantly
reduce the possibility of recurrence, reduce the impact, and/or make recovery
more straightforward. (For guidance on