AS DETAILED IN Site Reliability Engineering: How
Google Runs Production Systems1 (hereafter referred
to as the SRE book), Google products and services
seek high-velocity feature development while
maintaining aggressive service-level objectives (SLOs)
for availability and responsiveness. An SLO says
that the service should almost always be up, and the
service should almost always be fast; SLOs also provide
precise numbers to define what “almost always”
means for a particular service. SLOs are based on the
The vast majority of software services and systems
should aim for almost-perfect reliability rather than
perfect reliability—that is, 99.999% or 99.99% rather
than 100%—because users cannot tell the difference
between a service being 100% available and less than
“perfectly” available. There are many other systems in
the path between user and service (laptop, home WiFi,
ISP, the power grid ...), and those systems collectively
are far less than 100% available.
Thus, the marginal difference be-
tween 99.99% and 100% gets lost in
the noise of other unavailability, and
the user receives no benefit from the
enormous effort required to add that
last fractional percent of availability.
Notable exceptions to this rule in-
clude antilock brake control systems
For a detailed discussion of how
SLOs relate to SLIs (service-level indicators) and SLAs (service-level agreements), see the “Service Level Objectives” chapter in the SRE book. That
chapter also details how to choose
metrics that are meaningful for a particular service or system, which in turn
drives the choice of an appropriate SLO
for that service.
This article expands upon the topic
of SLOs to focus on service dependencies. Specifically, we look at how the
availability of critical dependencies informs the availability of a service, and
how to design in order to mitigate and
minimize critical dependencies.
Most services offered by Google aim
to offer 99.99% (sometimes referred
to as the “four 9s”) availability to users. Some services contractually commit to a lower figure externally but set
a 99.99% target internally. This more
stringent target accounts for situations
in which users become unhappy with
service performance well before a contract violation occurs, as the number
one aim of an SRE team is to keep users
happy. For many services, a 99.99% internal target represents the sweet spot
that balances cost, complexity, and
availability. For some services, notably
global cloud services, the internal target is 99.999%.
Observations And Implications
Let’s examine a few key observations
about and implications of designing
and operating a 99.99% service and
then move to a practical application.
Observation 1. Sources of outages.
Outages originate from two main
sources: problems with the service it-
Article development led by
You’re only as available as
the sum of your dependencies.
BY BEN TREYNOR, MIKE DAHLIN, VIVEK RAU, AND BETSY BEYER