forced as a strong requirement across
all teams that introduce changes.
Dedicated canary environment. Every
critical production application should
have a dedicated canary environment
as a prerequisite. It should be an exact
replica of the production environment.
This allows for testing user-facing impact such as load and performance.
Phased rollouts help reveal unforeseen issues (those not uncovered by
tests) that are discovered only in production. This provides the agility to roll
back the changes quickly and minimize
Rollbacks and restore. Another key
discipline is to ensure every change can
be rolled back. It is particularly important to understand the dependency
graph of the change and ensure an
atomic rollback. This is difficult in complex systems, but in such cases having a
clear restore point is key for most critical changes.
Error budgets are a simple concept.
Every service has a target SLO, and if
it exceeds that SLO, then that positive
delta of uptime becomes the budget to
use in pushing any changes or releases.
This is a powerful concept explained in
depth in the SRE book. 1 Sharing this
rigor with your application development team is a good way to ensure service reliability.
Outages and incidents. No matter
how reliable a system is, you should anticipate and prepare for a disaster. Rather than solving for no outages, which
is impractical, the focus should be on
effectively managing the outage (
minimizing downtime) and learning from it,
so the same patterns don’t repeat.
Resiliency testing. The goal here is
to stress test application resiliency by
breaking the system, observing the effects of the breakage, and subsequently
improving the reliability of the application.
Incident preparedness. The SRE team
should periodically run fire drills to
practice incident management that involves extensive coordination with partner teams, timely communication to
stakeholders, and restoring the service
as soon as possible. Responding to and
handling an actual incident without this
preparation can reduce the speed and
effectiveness of restoring the service.
Learning from outages. A repeated
outage is not an outage anymore; it is a
mistake. For every outage there should
be a thorough post-mortem that clearly
identifies the root cause of the outage
and focuses on what went wrong and
what can be improved going forward.
It is critical for enterprises to foster a
blameless post-mortem culture that
focuses on improving the reliability of
The Future of Enterprise Reliability
Over the past few years, cloud platform providers have increasingly focused on enterprises, offering a suite
of secure, reliable, and cost-effective
products from highly scalable compute, storage, and networking services to modernized managed offerings
such as container as a service (
Kubernetes), serverless, and DBaaS. In addition, cloud providers are delivering
advanced services in the realms of AI
(artificial intelligence), ML (machine
learning), and big data, opening a
wide range of possibilities for enterprises to rethink and transform their
This shift represents a tremendous
opportunity for enterprises to embrace
and adopt the cloud. Undertaking such
a large-scale migration, however, introduces a new challenge: How can enterprises adapt and rapidly evolve without
reducing their reliability?
Cloud migration strategy.
Enterprises typically have complex business
requirements, so a lift-and-shift strategy to migrate 100% of their workloads
to a single cloud provider may not be
feasible. A hybrid cloud environment
provides the flexibility for workloads to
operate seamlessly across both public
and private cloud environments. This
approach greatly simplifies the cloud
adoption strategy and provides a controlled environment that ensures a predictable level of reliability throughout
the transition to the cloud.
Enterprises that thoughtfully em-
brace the hybrid cloud strategy have
less risk in terms of overall reliability
and have a faster path to cloud trans-
formation. Investing in a common
application platform, coupled with
the adoption of technologies such as
Kubernetes ( https://kubernetes.io/),
Istio ( https://istio.io/), and serverless
computing ( https://en.wikipedia.org/
the flexibility to operate workloads,
agnostic to the cloud provider. Tech-
nologies such as the GCP (Google
Cloud Platform) Anthos platform
can also help enterprises expedite
their transition to the cloud in a reli-
able and efficient manner.
VEC ecosystem. Developing a strong
relationship among vendors, enter-
prises, and cloud providers is pivotal
to the future of enterprise reliability.
Cloud providers need to motivate soft-
ware vendors, through partnership
programs, to modernize third-party
software embracing cloud-based tech-
nologies and building certified mul-
ticloud-compliant software offerings.
This VEC (vendor-enterprise-cloud)
ecosystem coupled with the technolog-
ical shift will bring a rapid transforma-
tion shaping the enterprise domain.
Maintaining enterprise reliability is a continuous process that is in
a crucial moment with the advent of
the cloud. The next decade will be the
era of large-scale enterprise transformations leveraging cloud capabilities, and only those enterprises that
grasp the discipline of reliability engineering will be able to transform
successfully into the realm of cloud-based enterprise computing.
Toward Software-defined SLAs
Enterprise Software as Service
Why Cloud Computing Will Never Be Free
1. Jones, C., Wilkes, J., Murphy, N. and Smith, C. Service-level objectives. Site Reliability Engineering. B. Beyer,
C. Jones, J. Petoff, and N.R. Murphy, eds. O’Reilly
Media, 2016; https://landing.google.com/sre/sre-book/
2. Treynor, B., Dahlin, M., Rau, V., Beyer, B. The calculus
of service availability. acmqueue 15, 2 (2017); https://
Sanjay Sha is an SRE Manager at Google. With more than
14 years’ experience running several large-scale systems
at Google, he currently leads the Enterprise domain,
managing SRE teams supporting Google’s key business
verticals. He is currently working on the Corp to Cloud
initiative to run Google’s internal enterprise workloads
Copyright held by author/owner.
Publication rights licensed to ACM.