Optimizely.com: “Why? Because looking at your average response time is
like measuring the average temperature of a hospital. What you really care
about is a patient’s temperature, and
in particular, the patients who need
the most help.” 16 In the next section
we will meet median values and quantiles, which are better suited for this
kind of performance analysis.
Spike erosion. Viewing metrics as
line plots in a monitoring system often reveals a phenomenon called spike
erosion. 5 To reproduce this phenomenon, select a metric (for example, ping
latencies) that experiences spikes at
discrete points in time and zoom in
on one of those spikes and read the
height of the spike at the y-axis. Now
zoom out of the graph and read the
height of the same spike again. Are
they equal?
Figure 10 shows such a graph. The
spike height has decreased from 0.8
to 0.35.
How is that possible? The result is
an artifact of a rollup procedure that
is commonly used when displaying
graphs over long time ranges. The
amount of data gathered over the period of one month (more than 40,000
minutes) is larger than the amount
of pixels available for the plot. Therefore, the data has to be rolled up to
larger time periods before it can be
plotted. When the mean value is used
for the rollups, the single spike is averaged with an increasing number
of “normal” samples and hence decreases in height.
How to do better? The immediate
way of addressing this problem is to
choose an alternative rollup method,
such as max values, but this sacrifices
information about typical values. Another more elegant solution is to roll
up values as histograms and display a
two-dimensional heat map instead of a
line plot for larger view ranges.
Deviation measures. Once the mean
value μ of a dataset has been established, the next natural step is to measure the deviation of the individual
samples from the mean value. The following three deviation measures are
often found in practice.
The maximal deviation is defined as
maxdev (x1,…,xn) = max{|xi – μ| |i = 1,…,n},
and gives an upper bound for the distance to the mean in the dataset.
Figure 12. The cumulative distribution function for a dataset of request rates.
500
0.4
0.2
0.6
0.8
1.0
1500 1000 2000 2500
Request Rates in rps
Cu
mu
lat
i
veS
a
mple
Fre
que
n
c
y
Figure 13. Histogram metric with quantile (QMIN(0.8) over 1H windows.
June
19
4
2
6
8
10
June
21
June
20
June
22
June
24
June
23
L
ate
ncy
inm
s
June
25
June
26
Figure 14. Histogram metric with inverse quantile CF D(3ms) over 1H windows.
June
19
4
2
6
8
40
20
0
60
80
10010
June
21
June
20
June
22
June
24
June
23
June
25
June
26
L
ate
ncyi
nm
s(
He
atm
ap)
I
n
ver
s
ePe
rce
nti
le
in
%(
Li
ne)