Functional and infrastructure requirements can heavily influence the
design and delivery of an application.
Therefore, evaluating the feasibility of these requirements is a crucial
step in engineering the reliability of
Common application platform.
Most enterprises rely on third-party
software to support the operations
and needs of their business verticals
(Figure 2). Running different third-party applications, however, can lead
to a large number of disparate systems
within an enterprise. Not having a
common baseline across applications
makes maintaining the reliability and
efficiency of service more difficult over
time. This creates a lot of overhead for
the SRE team and increases the organization’s operational costs.
A common platform provides a standard operating environment in which
to run all of a system’s applications,
enhancing the overall reliability and efficiency of an enterprise. The key principle of implementing a common platform is to identify, build, and enforce
a set of shared modules and standards
that can be reused across the applications that support the business verticals.
On the other side, overengineering a
common platform can have a negative
impact. If a platform has many standards in place or becomes too rigid,
an enterprise’s delivery and execution
speed can decrease significantly.
The goal is to develop a strategy that
allows enterprises to find the right balance between optimizing for reliability and maintaining the development
speed needed to deliver and support
business functionality. Finding this
balance requires a careful analysis of
the trade-offs and net benefits.
Common platform layout. An application platform consists of a set
of modules that can be grouped into
three main categories (Figure 3):
˲ Infrastructure deployment modules.
˲ Application management modules.
˲ Common service modules.
Infrastructure deployment modules
provide intent-based deployment of
an end-to-end application environ-
ment based on a set of resource re-
quirements such as CPU, memory,
operating systems, and the number of
instances. This mechanism is highly
efficient since the workflows only need
to be configured once and can be trig-
gered as needed. It also provides a
standardized, consistent, and predict-
able environment, which improves
Many enterprises are already em-
bracing open-source technologies
to help them manage the underlying
infrastructure of their applications.
Tools such as Terraform provide ab-
stractions to handle the provisioning
and deployment of end-to-end envi-
ronments agnostic to the underlying
platform (for example, on premises
Application management modules
handle critical workflows during the
life of an application. A few examples
of these workflows include:
˲ Configuration management work-
flows to deploy application configu-
˲ Release management workflows to
manage software releases and rollbacks.
˲ Security management workflows to
manage secrets and certification de-
Software solutions such as Puppet,
Chef, and Ansible provide frame-
works and solutions for enterprises to
orchestrate these workflows across
Common service modules manage
the standardized workflows that can
be shared across all applications, such
as logging, monitoring, and reporting.
This layer can also include custom ser-
vice modules for the specific needs of
an enterprise, such as a custom web
front end or a single sign-on service.
Some examples of common service
˲Monitoring module to collect
and publish metrics for reporting
˲ Backup module to execute back-
ups, retention, and recovery.
˲ Log collection module to securely
ship logs to a centralized log service.
˲ Custom Weblogic/Tomcat as a ser-
vice offering middleware capabilities.
˲Managed DBaaS (database as a
service) module to manage database
Combining infrastructure deploy-
ment, application management, and
common service modules creates a plat-
form that enables enterprises to move
away from managing monolithic appli-
cations and into a new realm of modular,
extensible, and reusable applications.
Cost engineering. When enterprises
opt for third-party software, they are
making a cost- and ROI-based decision
to use a “reliable” enterprise applica-
tion that delivers the business func-
tionality in a cost-effective manner. De-
termining the right reliability-to-cost
trade-off that sustains the ROI curve is
the crux of cost engineering.
Reliability-to-cost trade-off. Figure 4
illustrates how reliability (the number
of nines) directly influences the overall
availability or reduction in downtime.
The reduction with each additional
nine is sublinear. While it is extremely
tempting to add a nine, it is important
to recognize that engineering an ad-
ditional nine can be expensive, and
overengineering reliability produces
diminishing ROI. To understand this,
let’s look at the following scenario.
Enterprise ABC is looking for a
third-party sales application that can
provide market analysis and insights.
The sales team predicts they can gen-
erate an average of $600/hour of rev-
enue by leveraging those insights.
Their revenue target per quarter is ap-
proximately $1.2 million. What is the
required uptime (availability SLO) for
If the application was available
100% of the time, the maximum rev-
enue would be:
Net revenue = hours in a quarter ( 3
months x 30 days x 24 hours = 2,190)
earnings per hour ($600)
$1,296,000 (~$1.29M) = 2,160 hours
in a quarter $600 per hour
Figure 4. Availability and reliability.
Percent Nines Downtime / Qtr Downtime / Month
1 nine 9 days 3 days
2 nines 21. 6 hours 7. 2 hours
99. 9 3 nines 2. 16 hours 43.2 minutes
99. 99 4 nines 12. 96 minutes 4. 32 minutes
99.999 5 nines 1. 30 minutes 25. 9 seconds