call SRE can flounder during a crisis,
turning a potentially minor incident
into a major outage.
Many SRE teams use checklists for
on-call training. An on-call checklist generally covers all the high-level areas team
members should understand well, with
subsections under each area. Examples
of high-level areas include production
concepts, front-end and back-end stack,
automation and tools, and monitoring
and logs. The checklist can also include
instructions about preparing for on-call
and tasks that need to be completed
when on call.
SREs also use role-play training drills
(referred to within Google as Wheel of
Misfortune) as an educational tool for
training team members. A Wheel of
Misfortune exercise presents an outage
scenario to the team, with a set of data
and signals that the hypothetical on-call SRE will need to use as input to resolve the outage. Team members take
turns playing the role of the on-call engineer in order to hone emergency mitigation and system-debugging skills.
Wheel of Misfortune exercises should
test the ability of individual SREs to
know where to find the documentation
most relevant to troubleshooting and
resolving the outage at hand.
Repository management. SRE team
information can be scattered across a
number of sites, local team knowledge,
and Google Drive folders, which can
make it difficult to find correct and relevant information. As in the SRE example earlier, a critical operational tool
and its user manual were unavailable to
Zoë (the on-call SRE) because they were
hidden under the home directory of
her tech lead, and her inability to find
them greatly prolonged a service outage. To eliminate this type of failure,
it is important to define a consistent
structure for all information and ensure all team members know where to
store, find, and maintain information.
A consistent structure will help team
members find information quickly.
New team members can ramp up more
quickly, and on-call and on-duty engineers can resolve issues faster.
Here are some guidelines to create
and manage a team documentation
repository:
˲Determine relevant stakeholders
and conduct brief interviews to identify
all needs;
improve across key operational areas
such as on-call health, projects vs. interrupts, SLOs, and capacity planning.
Documents for Running SRE Teams
SRE teams need to have a cohesive set
of reliable, discoverable documenta-
tion to function effectively as a team.
Team site. Creating a team site is im-
portant because it provides a focal point
for information and documents about
the SRE team and its projects. At Google,
for example, many SRE teams use g3doc
(Google’s internal doc platform, where
documentation lives in source code
alongside associated code), but some
teams use a combination of Google Sites
and g3doc, with the g3doc pages closely
tied to the code/implementation details.
Team charter. SRE teams are expected to maintain a published charter that explains the rationale for the
team and documents its current major engagements. A charter serves to
establish the team identity, primary
goals, and role relative to the rest of
the organization.
A charter generally includes the following elements:
˲A high-level explanation of the
space in which the team operates. This
includes the types of services the team
engages with (and how), related systems, and examples.
˲ A short description of the top two
or three services managed by the team.
This section also highlights key technologies used and the challenges to
running them, benefits of SRE engagement, and what SRE does.
˲ Key principles and values for the
team.
˲ Links to the team site and docs.
Teams are also expected to publish
a vision statement (an aspirational description of what the team would like
to achieve in the long term) and a roadmap spanning multiple quarters.
Documents for
New SRE Onboarding
SRE teams invest in training materials
and processes for new SREs because
training results in faster onboarding
to the production environment. SRE
teams also benefit from having new
members acquire the skills required to
join the ranks of on-call as early as possible. In the absence of comprehensive
training, as seen in Zoë’s story, the on-
SRE teams invest in
training materials
and processes for
new SREs because
training results in
faster onboarding
to the production
environment.