tion. One of the challenges of enterprise applications is there is no agreement or consensus among vendors on
common standards around software
technologies, operating systems,
and workflow orchestration methodologies, such as release management
and patch management. Each vendor
provides its own flavor.
The role of SRE is to publish common standards for the portfolio of
tools and technology they support (the
base operating system, release management, and configuration frameworks) and the minimum operational
maturity they expect from the vendor
(for example, automated installs and
seamless patching workflows).
Mature enterprises that rely on
multiple software vendors recognize
the importance of having a baseline
ecosystem and strong operational
maturity. They not only consider business functionality, but also account for
ecosystem maturity when looking for
Change management. Change is
powerful. You can build a highly reliable system, but one small change (a
bad config push or a software bug)
can compromise the reliability of the
entire system. Preserving reliability
comes from having a change-manage-ment rigor with a set of checks and
balances that can detect, prevent, or
minimize the impact pf problems.
SRE should be responsible for maintaining this rigor. Consider the following checks and balances.
Measure, monitor, and alert.
Measure, monitor, and introduce thresholds to alert for everything that is on
the critical path of your SLO. This provides the ability to proactively detect
and fix issues.
Streamline change. Require all
changes to go through validation and
regression testing. This should be en-
The net revenue (~$1.29 million)
clearly exceeds the target revenue of
$1.2 million, but 100% availability is
infeasible. Figure 5 illustrates how to
choose the perfect availability SLO that
meets the ROI.
Here are the key conclusions
reached in this scenario:
1. A 90% availability SLO generates
~$1.16 million in revenue, which falls
short of the target revenue of $1.2 million. This SLO is not feasible.
2. A 95% availability SLO generates
~$1.23 million in revenue, which comfortably meets (slightly exceeds) the
revenue objective of $1.2 million. This
SLO is feasible.
3. A 99% availability SLO generates
~$1.28 million in revenue, which far
exceeds the revenue objective of $1.2
million, but it comes with additional
˲ A 95% SLO guarantees no more than
36 hours downtime per month and still
comfortably meets the target revenue.
˲ In contrast, a 99% SLO guarantees
no more than 7. 2 hours downtime per
month, but the cost of engineering and
support can be higher.
˲ As long as the cost to engineer
a 99% SLO does not exceed $80,000
($1.28 million to $1.2 million), this is a
4. The net revenue growth for each
additional nine provides diminishing
returns (delta revenue)—for example,
between 99.99% and 99.999%:
˲ There is a significant reduction in
downtime per month from 4. 32 minutes to 25. 92 seconds, but the revenue
increase is only $116.64.
˲ To choose a 99.999% SLO, the
added engineering cost should be
Account for application dependencies. To design a system with a 99.9%
SLO, the rule of thumb is to have all criti-
cal dependent systems provide an addi-
tional nine (that is, 99. 99). This means
you have to factor in the reliability invest-
ment (additional cost) for your applica-
tion and all of its critical dependencies,
because a system is only as available as
the sum of its dependencies.
Choose a SLO that fits the ROI curve.
The ideal SLO is one that delivers the
required functionality with a degree of
reliability that fits within the ROI curve.
In the previous scenario, the best SLO
would be 95%, because it is the least expensive option that meets the business
goal ($1.2 million).
Overengineering reliability prod-cues diminishing ROI. From the previous scenario, it is evident that increasing the availability of a service does not
always translate to a significant growth
in revenue. This is clearly evident from
the scenario. In fact, with each additional nine, the benefit of engineering
the reliability increases sublinearly,
breaking the ROI curve.
Preserving Enterprise Reliability
Reliability is not just a systems design
problem. You can have the world’s
best-designed system, but without
proper rigor and discipline, preserving core aspects of the system such as
availability, performance, and security
can become extremely difficult. Reliability is a responsibility that should
be shared across all teams involved in
the system, including vendors, development, and SRE. The SRE teams are
ultimately accountable, however, since
they are responsible for achieving their
SLOs. During the lifecycle of an application there are a few critical junctures
where maintaining proper rigor can
translate into preserving the reliability
of the service.
Design for standardization and
uniformity. Reliability is preserved
when you recognize the importance of
uniformity and invest in standardiza-
Figure 5. Selecting the right availability SLO.
month Net Revenue Target Revenue Delta Revenue
72 hours 648 hours $1,166,400 < 1.2M --
1. 5 36 hours 684 hours $1,231,200 > 1.2M $64,800
99 2 7. 2 hours 712.8 hours $1,283,040 > 1.2M $51.840
99. 9 3 43. 2 minutes 719.28 hours $1,294,704 > 1.2M $11,664
99. 99 4 4. 32 minutes 719.928 hours $1,295,987 > 1.2M $1,166.4
99.999 5 25. 92 seconds 719.9928 hours $1,295,987 > 1.2M $116.64