practice
I
M
A
G
E
R
Y
F
R
O
M
S
H
U
T
T
E
R
S
T
O
C
K
.
C
O
M
SITE RELIABILITY ENGINEERING, or SRE, is a software-engineering specialization that focuses on the
reliability and maintainability of large systems. In its
experience in the field, Google has found some critical
but oft-neglected metrics that are important for
running reliable services.
This article, based on Ben Treynor’s talk at the
Google Cloud Next 2017 conference,
7 addresses those
metrics, specifically for product development and
SRE teams, managers of such teams, and anyone
else who cares about the reliability of Web products
or infrastructure. To further explain its approach
to product reliability, Google has published Site
Reliability Engineering: How Google Runs Production
Systems1 (hereafter referred to as the SRE book) and
The Site Reliability Workbook: Practical
Ways to Implement SRE2 (hereafter referred to as the SRE workbook).
One of the most important choices
in offering a service is which service
metrics to measure, and how to evaluate them. The difference between
great, good, and poor metric and metric threshold choices is frequently the
difference between a service that will
surprise and delight its users with how
well it works, one that will be acceptable for most users, and one that will
actively drive away users—regardless
of what the service actually offers.
For example, it is not uncommon
to measure the QPS (queries per second) received at a Web or API server,
and to assess that this metric indicates
good service health if the graph of the
metric over time has a smooth sinusoidal diurnal curve with no unexpected
spikes or troughs, and the peaks of the
curve are rising over time, indicating
user growth. Yet this is a poor metric
choice—at best it will provide the operator with a lagging indicator of large-scale problems. It misses a host of real,
common problems, including partial
unreachability, error rates in the 0.1%–
3% range, high latency, and intervals of
bad results.
These problems lead to unhappy
users and service abandonment—yet
throughout it all, the QPS Received
graph continues to show its happy sinusoidal curves and to provide a soothing
sense that all is well. The best that can
be said about the QPS Received metric
is that it’s relatively simple to implement—and even that is a problem, because it is often implemented early and
thus takes the place of more sophisticated and useful metrics that would
provide an operator with more accurate
and useful data about the service.
What follows are the types of metrics the Google SRE team has adopted
for Google services. These metrics are
not particularly easy to implement,
and they may require changes to a service to instrument properly. It has been
our consistent experience at Google,
however, that every service team that
Metrics
That
Matter
DOI: 10.1145/3303874
Article development led by
queue.acm.org
Critical but oft-neglected service metrics that
every SRE and product owner should care about.
BY BENJAMIN TREYNOR SLOSS, SHYLAJA NUKALA, AND VIVEK RAU