heavily on the performance of highly
skilled individuals on the team. The
team preserves important operational concepts and principles as nuggets
of “tribal knowledge” that are passed
on verbally to new team members.
If these concepts and principles are
not codified and documented, they
will often need to be relearned—
painfully—through trial and error.
Sometimes team members perform
operational procedures as a strict sequence of steps defined by their predecessors in the distant past, without understanding the reasons these
steps were initially prescribed. If this
is allowed to continue, processes
eventually become fragmented and
tend to degenerate as the team scales
up to handle new challenges.
SRE teams can prevent this process decay by creating high-quality
documentation that lays the foundation for such teams to scale up
and take a principled approach
to managing new and unfamiliar
services. These documents capture
tribal knowledge in a form that is
easily discoverable, searchable, and
maintainable. New team members
are trained through a systematic and
well-planned induction and education program. These are the hallmarks
of a mature SRE team.
The remainder of this article describes the various types of documents
SREs create during the life cycle of the
services they support.
Documents for New
SREs conduct a production readiness
review (PRR) to ensure a service meets
accepted standards of operational
readiness, and that service owners
have the guidance they need to take advantage of SRE knowledge about running large systems.
A service must go through this review process prior to its initial launch
to production. (During this stage, the
service has no SRE support; the product development team supports the
service.) The goal of the prelaunch PRR
is just to ensure the service meets certain minimum standards of reliability
at the time of its launch.
A follow-on PRR can be performed
before SRE takeover of a service,
which may happen long after the ini-
tial launch. For example, when an
SRE team decides to onboard a new
service, the team conducts a thor-
ough review of the production state
and practices of the new service. The
goals are to improve the service being
onboarded from a reliability and op-
erational sustainability perspective,
as well as to provide SREs with pre-
liminary knowledge about the service
for its operation.
SREs conducting a PRR before
service takeover may ask a more
comprehensive set of questions and
apply higher standards of reliabil-
ity and operational ease than when
conducting a PRR at the time of
the initial launch. They may inten-
tionally keep the launch-time PRR
“lighter” than the service take-over
PRR in order to avoid unduly slowing
down the developer team.
In Zoë’s SRE story, her team had no
standardized PRR process or checklist,
which means they might miss asking
important questions during service
takeover. Therefore, they run the risk
of encountering many problems while
managing a new service that were eas-
ily foreseeable and could have been ad-
dressed before SREs became respon-
sible for running the service.
An SRE PRR/takeover requires
the creation of a PRR template and a
process doc that describes how SRE
teams will engage with a new service,
and how SRE teams will use the PRR
template. The template used at the
time of takeover might be more com-
prehensive than the one used at the
time of initial launch.
A PRR template covers several ar-
eas and ensures that critical questions
about each area are answered. The ac-
companying table lists some of the ar-
eas and related questions that the tem-
The process doc should also iden-
tify the kinds of documentation that
the SRE team should request from the
product development team as a pre-
requisite for takeover. For example,
they might ask the developer team to
create initial playbook entries for stan-
In addition to these onboarding
documents, the SRE organization
must create overview documents that
explain the SRE role and responsibili-
ties in general terms to product devel-
opment teams. This serves to set their
expectations correctly. The first such
document would explain what SRE is,
covering all the topics listed at the be-
ginning of this article, including core
functions, the service life cycle, and
ties. A primary goal of this document
is to ensure developer teams do not
equate SREs with an Ops team or con-
sider pager response to be their sole
function. As shown in the earlier SRE
story, when developers do not fully
understand what SREs do before they
hand off a service to SREs, miscom-
munication and misunderstandings
Additionally, an engagement mod-
el document goes a little further in
setting expectations by explaining
how the SRE team will engage with
developer teams during and after ser-
vice takeover. Topics covered in this
˲ Service takeover criteria and the
Example PRR template areas
Architecture and dependencies What is your request flow from user to front end to back end?
Are there different types of requests with
different latency requirements?
Capacity planning How much traffic and rate of growth do you expect during
and after the launch?
Have you obtained all the compute resources needed
to support your traffic?
Failure modes Do you have any single points of failure in your design?
How do you mitigate unavailability of your dependencies?
Processes and automation Are any manual processes required to keep the service running?
External dependencies What third-party code, data, services, or events do
the service or the launch depend upon?
Do any partners depend on your service? If so, do they need
to be notified of your launch?