these reasons, it is a best practice to
measure service efficiency regularly.b
Conclusion
The metrics discussed in this article
should be useful to those who run a service and care about reliability. If you measure these metrics, set the right targets,
and go through the work to measure the
metrics accurately, not as an approximation, you should find that your service
runs better; you experience fewer outages; and you see a lot more user adoption.
Most of us like those three properties.
b For additional details, see “Managing Load,”
in the SRE workbook. Chapter 11 contains two
case studies of managing overload.
Related articles
on queue.acm.org
A Purpose-built Global Network:
Google’s Move to SDN
A discussion with Amin Vahdat,
David Clark, and Jennifer Rexford
https://queue.acm.org/detail.cfm?id=2856460
From Here to There, the SOA Way
Terry Coatta
https://queue.acm.org/detail.cfm?id=1388788
Voyage in the Agile Memeplex
Philippe Kruchten
https://queue.acm.org/detail.cfm?id=1281893
References
1. Beyer, B., Jones, C., Petoff, J. and Murphy, N.R. Site
Reliability Engineering: How Google Runs Production
Systems. O’Reilly Media, 2016.
2. Beyer, B., Murphy, N.R. and Rensin, D.K., Kawahara,
K., Thorne, S. The Site Reliability Workbook: Practical
Ways to Implement SRE. O’Reilly Media, 2018.
3. Brutlag, J. Speed matters. Google AI Blog, 2009;
https://research.googleblog.com/2009/06/speed-
matters.html.
4. Kübler-Ross, E. Kübler-Ross Model; https://
en.wikipedia.org/wiki/K%C3%BCbler-Ross_model.
5. PageSpeed. Analyze and optimize your website with
PageSpeed tools; https://developers.google.com/speed/
6. Tassone, E. and Rohani, F. Our quest for robust time
series forecasting at scale. The Unofficial Google Data
Science Blog;
7. http://www.unofficialgoogledatascience.com/2017/04/
our-quest-for-robust-time-series.html.
8. Treynor, B. Metrics that matter (Google Cloud Next),
2017; https://youtu.be/iF9Noq YBb4U.
Benjamin Treynor Sloss started programming at age
6 and joined Oracle as a software engineer at 17. He has
also worked at Versant, E.piphany, SEVEN, and (currently)
Google. His team of approximately 4,700 is responsible for
site reliability engineering, networking, and datacenters
worldwide.
Shylaja Nukala is a technical writing lead for Google
Site Reliability Engineering. She leads the documentation,
information management, and select-training efforts for
SRE, Cloud, and Google engineers.
Vivek Rau is a site reliability engineer at Google, working
on customer reliability engineering (CRE). The CRE team
teaches customers core SRE principles, enabling them to
build and operate highly reliable products on the Google
Cloud Platform.
Copyright held by authors/owners.
˲ Add a margin of uncertainty to the
forecast where possible, by provisioning three to five times the resources
implied by the forecast.
˲While traffic from brand-new
products is harder to predict, it is also
usually small, so you can overprovision
for this traffic without incurring too
much cost.
Lesson 4. Measure
Service Efficiency
SRE teams should regularly measure
the efficiency of each service they run,
using load tests and benchmarking
programs to determine how many user
requests per second can be handled
with acceptable response times, given
a certain quantity of computing resources (CPU, memory, disk I/O, and
network bandwidth). While performance testing may seem an obvious
best practice, in real life teams frequently forget about service efficiency.
They may benchmark a service once a
year, or just before a major release, and
then assume unconsciously that the
service’s performance remains constant between benchmarks. In reality,
even minor changes to the code, or to
user behavior, can affect the amount of
resources required to serve a given volume of traffic.
A common way of finding out that
a service has become less efficient is
through a product outage. The SRE
team may think they have enough capacity to serve peak traffic even with
two datacenters’ worth of resources
turned down for maintenance or
emergency repairs, but when the rare
event occurs where both datacenters
are actually down during peak traffic
hours, the performance of the service
radically degrades and causes a partial
outage or becomes so slow as to make
the service unusable. In the worst case,
this can turn into a “cascading failure”
where all serving clusters collapse like
a row of dominoes, inducing a global
product outage.
Ironically, this type of massive
failure is triggered by the system’s attempt to recover from smaller failures.
One cluster of servers happens to get a
higher load for reasons of geography
and/or user behavior, and this load is
large enough to cause all the servers to
crash. The traffic load-balancing sys-
tem observes these servers going off-
line and performs a failover operation,
diverting all the traffic formerly going
to the crashed cluster and sending it
to nearby clusters instead. As a result,
each of these nearby servers now gets
even more overloaded and crashes as
well, resulting in more traffic being
sent to even fewer live servers. The cycle
repeats until every single server is dead
and the service is globally unavailable.
Services can avoid cascading failures using the drop overload technique.
Here the server code is designed to detect when it is overloaded and randomly drop some incoming requests under
those circumstances, rather than attempting to handle all requests and
eventually melting down. This results
in a degraded customer experience for
users whose requests are dropped, but
that can be mitigated to a large extent
by having the client retry the request;
in any case, slower responses or outright error responses to a fraction of
users are a lot better than a global service failure.
It would be better, of course, to
avoid this situation altogether, and the
only way to do that is to regularly measure service efficiency to confirm the
SRE team’s assumptions about how
much serving capacity is available. For
a service that ships out releases daily
or more frequently, daily benchmarking is not an extreme practice—
benchmarking can be built into the automated release testing procedure. When
newly introduced performance regressions are detected early, the team can
provision more resources in the short
term and then get the performance
bugs fixed in the long term to bring resource costs back in line.
If you run your service on a cloud
platform, some cloud providers have
an autoscaling service that will automatically provision more resources
when your service load increases. This
setup may be better than running
products on premises or in a datacenter with fixed hardware resources, but
it still does not get you off the hook for
regular benchmarking. Even though
the risk of a complete outage is lower,
you may find out too late that your
monthly cloud bill has increased dramatically just because someone modified the encoding scheme used for
compressing data, or made some other
seemingly innocuous code change. For