ond-order dependencies need two extra 9s, third-order dependencies need
three extra 9s, and so on.
This inference is incorrect. It is
based on a naive model of a dependency hierarchy as a tree with constant fan-out at each level. In such a model, as
shown in Figure 1, there are 10 unique
first-order dependencies, 100 unique
second-order dependencies, 1,000
unique third-order dependencies,
and so on, leading to a total of 1,111
unique services even if the architecture
is limited to four layers. A highly available service ecosystem with that many
independent critical dependencies is
A critical dependency can by itself
cause a failure of the entire service (or
service shard) no matter where it appears in the dependency tree. Therefore, if a given component X appears
as a dependency of several first-order
dependencies of a service, X should be
counted only once because its failure
will ultimately cause the service to fail
no matter how many intervening services are also affected.
The correct rule is as follows:
˲ If a service has N unique critical
dependencies, then each one contributes 1/N to the dependency-induced
unavailability of the top-level service,
regardless of its depth in the dependency hierarchy.
˲ Each dependency should be counted only once, even if it appears multiple
times in the dependency hierarchy (in
other words, count only unique dependencies). For example, when counting
dependencies of Service A in Figure 2,
count Service B only once toward the
For example, consider a hypo-
thetical Service A, which has an error
get. If you do not correct or address the
discrepancy, an outage will inevitably
force the need to correct it.
Let’s consider an example service with
a target availability of 99.99% and work
through the requirements for both its
dependencies and its outage responses.
The numbers. Suppose your 99.99%
available service has the following
˲ One major outage and three minor outages of its own per year. Note
that these numbers sound high, but
a 99.99% availability target implies a
20- to 30-minute widespread outage
and several short partial outages per
year. (The math makes two assumptions: that a failure of a single shard is
not considered a failure of the entire
system from an SLO perspective, and
that the overall availability is computed with a weighted sum of regional/
˲ Five critical dependencies on other, independent 99.999% services.
˲ Five independent shards, which
cannot fail over to one another.
˲ All changes are rolled out progressively, one shard at a time.
The availability math plays out as
˲ The total budget for outages for the
year is 0.01% of 525,600 minutes/year,
or 53 minutes (based on a 365-day year,
which is the worst-case scenario).
˲ The budget allocated to outages
of critical dependencies is five independent critical dependencies, with
a budget of 0.001% each = 0.005%;
0.005% of 525,600 minutes/year, or
˲ The remaining budget for outages
caused by your service, accounting for
outages of critical dependencies, is 53
- 26 = 27 minutes.
Outage response requirements.
˲ Expected number of outages: 4 ( 1
full outage, 3 outages affecting a single
˲ Aggregate impact of expected outages: ( 1 x 100%) + ( 3 x 20%) = 1. 6
˲ Time available to detect and recover from an outage: 27/1.6 = 17 minutes
˲ Monitoring time allotted to detect
and alert for an outage: 2 minutes
˲ Time allotted for an on-call responder to start investigating an alert:
five minutes. (On-call means that a
technical person is carrying a pager
that receives an alert when the service
is having an outage, based on a monitoring system that tracks and reports
SLO violations. Many Google services
are supported by an SRE on-call rotation that fields urgent issues.)
˲ Remaining time for an effective
mitigation: 10 minutes
Implication. Levers to make a service more available. It’s worth looking
closely at the numbers just presented
because they highlight a fundamental
point: there are three main levers to
make a service more reliable.
˲ Reduce the frequency of outages—
via rollout policy, testing, design reviews, and other tactics.
˲ Reduce the scope of the average
outage—via sharding, geographic isolation, graceful degradation, or
˲ Reduce the time to recover—via
monitoring, one-button safe actions
(for example, rollback or adding emergency capacity), operational readiness
practice, and so on.
You can trade among these three
levers to make implementation easier.
For example, if a 17-minute MTTR is
difficult to achieve, instead focus your
efforts on reducing the scope of the
average outage. Strategies for minimizing and mitigating critical dependencies are discussed in more depth later
in this article.
Clarifying the “Rule of the Extra 9”
for Nested Dependencies
A casual reader might infer that each
additional link in a dependency chain
calls for an additional 9, such that sec-
Figure 1. Dependency hierarchy: Incorrect model.