Their approach could be purely business driven, and they may expect the
application never to go down. Likewise,
the vendor may not entirely understand
how the IT ecosystem is designed and
cannot operate independently to deliver the system. The SRE team should become a true partner to bring alignment
between the customer and vendor and
develop a shared understanding of the
overall objective, specific requirements,
and constraints of the domain.
Given the nature of third-party domains, it may be difficult to find a
perfect system that meets 100% of the
business functionality, as there are
many variables in the equation (for example, third-party software, hardware,
cost, and vendor). Therefore, working
closely with the customers in developing a set of detailed requirements
and distinguishing core vs. optional
requirements helps with the trade-off
analysis—for example, if the application has constraints, evaluating their
impact on business objectives or revisiting and adjusting customer requirements without compromising the
business objectives, or finding a new
Taking customers through this journey from beginning to end helps them
better understand the space and weigh
in on all important considerations, ultimately allowing them to make effective business-driven decisions.
SLOs as a means to customer happiness; solve for SLOs. Solving for
customer happiness based of objective goals is key; it is better to cater to
functionality based on the customers’
objectives in a measurable way (SLOs).
Customers have only one fundamental
criterion: Is the system able to translate business objectives into business
functionality in a cost-effective and reliable manner?
Having this objective view creates
a transparent and blameless culture.
The key point to remember, however, is
that SLOs are not fixed for life: As business needs evolve, the system SLOs
need to be revisited. Therefore, having
a strong discipline of revisiting the SLO
agreements periodically with the customer helps tackle these changes and
adjust the scope and expectations as
business needs evolve.
Vendor selection. Enterprise-appli-
cation engineering with a vendor is a
pushing new code, vendors publishing
new security patches, or infrastructure
teams updating the software of the un-
Reliability is not a “build once
and forget for life” construct; it is a
continuous process of maintaining
and upholding its principles and
methodologies. Enterprises that recognize the need and invest in developing SRE skills stand out from the rest
because they recognize that without
these skills, enterprise reliability cannot be sustained.
Designing for enterprise reliability is a
multidimensional problem that spans
multiple entities: customer, vendor,
platform engineering, cost, and the SRE
organization. The rest of this article ex-
pands on these axioms and describes
the behaviors, principles, and meth-
odologies that influence and shape the
discipline of enterprise reliability.
Customer objectives. “If you don’t
understand your customer objectives,
then you do not need to exist as an org.”
Whether you are a traditional IT orga-
nization or a mature SRE org, this fun-
damental principle holds true.
Translate customer objectives to
SLOs. In an enterprise setting, a typi-
cal customer is the owner of a business
vertical such as legal, finance, or HR
trying to accomplish a specific busi-
ness goal. Having a well-defined set of
business objectives lays the founda-
tion for developing concrete function-
al requirements, allowing you to ef-
fectively translate those requirements
into quantifiable and measurable out-
comes, also known as SLOs.
Defining SLOs early on leads to a bet-
ter design and implementation of the
overall system. Arriving at a clear set
of measurable SLOs, however, is an ex-
haustive process with a lot of consider-
ations (for example, what is technically
feasible vs. infeasible, expensive vs. cost
effective, reliable vs. fragile). Closely
involving the customer and vendor
throughout this process is crucial, as it
develops a shared understanding of re-
quirements, constraints, and trade-offs,
and helps reconcile the gap between as-
pirational and achievable SLOs.
Documenting the SLOs, including a
strong rationale for the established tar-
gets and thresholds (for example, 99. 9
uptime) is key, as this becomes the con-
tract among all the parties (SRE team,
software vendor, and customer). This
rigor also creates a culture of transpar-
ency and openness to inform how the
system should be designed and how the
service should operate. For a deep dive
into engineering SLOs effectively, refer
to the SLO chapter in the SRE book.
Empathy toward customer and ven-
dor is key. Customers (business own-
ers) may not always have the same level
of understanding of the problem space.
Business owner: The owner or
person leading a business vertical
such as legal, finance, or HR;
business owner and customer are
Customer: The business owner of
a business vertical such as legal,
finance, or HR.
Enterprise applications: Software
owned by an external company, also
known as third-party software.
SLO: Service-level objective;
a quantifiable objective that
measures the effectiveness of
SRE: Site reliability engineering; the
enterprise’s support organization.
Some enterprises may not have
a dedicated SRE team; instead,
they have support teams such as
DevOps and system or IT admins.
The principles and methodologies
outlined here are generic enough
that they can be applied to any
User: An employee of an organization
who represents the consumer
of an enterprise application.
Vendor: A provider of third-party
Figure 1. Reliability as a function of availability.
Availability Percent Nines Downtime / Month
99. 9 3 43. 2 minutes
99. 99 4 4. 32 minutes