if you are measuring only server-side
latency, because you will be completely
unaware the product is slow for users.
Even if you get anecdotal reports of
slowness and try to follow up on them,
you will have no way of determining
which subset of users is experiencing
slowness, and when.
To measure the actual user experience, you have to measure and record
client-side latency. It can be hard work
to instrument the client code to capture this latency metric and then to
ship client-side metrics back to the
datacenter for analysis. The work may
be further complicated by the need to
handle broken network connections by
storing the data and uploading it later.
Though difficult, client-side metrics
are essential and achievable.
For a browser application, you can
write additional JavaScript that gathers
these statistics for users on different
platforms, in different countries, and
so on, and send these statistics back to
the server. For a thick client, the path
is more obvious, but it’s still important
to measure the time from the moment
the user interacts with the client until
the response is delivered. Either way,
instrumenting the user experience
takes a relatively small fraction of the
effort previously expended to write the
entire application, and the payback for
this incremental effort is high.
To take an example from Google’s
own history, when Gmail was launched,
most users accessed it through a Web
browser (not a mobile client), and
Google’s Web client code had no in-
strumentation to capture client-side la-
tency. So, we relied on server-side laten-
cy data, and the response time seemed
quite acceptable. When Google finally
launched an instrumented JavaScript
client, at first we did not believe the
data it was sending back—it seemed
impossible the user experience was
that bad. We went through the denial
stage for a while, and then anger, and
eventually got to bargaining.
4 We made
some major changes to how the Gmail
server and its client worked to improve
our client-side latency, and the reward
was a visible inflection point in Gmail’s
growth once the user experience im-
proved. The long-term trends in our
monitoring dashboards showed users
responding to the improved product
experience. For around 3% of the effort
implements these metrics is happy
afterward that it made the effort to do
so. The metrics investment is small
compared with the overall effort to
build and launch the service in the first
place, and the prompt payback in user
satisfaction and usage growth is out-
sized relative to the effort required. We
believe you will find this is true for your
service, too.
Lesson 1. Measure the Actual
User Experience
The SRE book emphasizes that speed
matters to users, as demonstrated by
Google’s research on shifts in behavior when users are exposed to delayed
responses from a Web service.
3 When
services get too slow, users start to
disengage, and when they get even
slower, users leave. “Speed matters” is
a good axiom for SREs to apply when
thinking about what makes a service
attractive to users.
A good follow-up question is, “Speed
for whom?” Engineers often think
about measuring speed on the server
side, because it is relatively easy to instrument servers to export the required
metrics, and standard monitoring tools
are designed to capture such metrics
from servers in dashboards and highlight anomalies with alerts. What this
standard setup is measuring is the interval between the point in time when
a user request enters a datacenter and
the point in time when a response to
that request leaves the datacenter. In
other words, the metric being captured
is server-side latency. Measuring server-side latency is not sufficient, though it is
better than not measuring latency at all.
Measuring and reporting on server-side
latency can be a useful stopgap while
solving the harder problem of measuring client-side latency.
The problem is that users have no
interest in this server-side metric. Users care about how fast or slow the application is when responding to their
actions, and, unfortunately, this can
have very little correlation with server-side latency. Perhaps these users have a
cheap phone, on a slow 2G network, in
a country far away from your servers; if
your product doesn’t work for them, all
your hard work building great features
will be wasted, because users will be
unhappy and will use a different product. The problem will be compounded
Though difficult,
client-side metrics
are essential
and achievable.