the SLA. Figure 14 shows a plot of the
inverse quantile CDF(X1h, 3ms), which
takes values on the right axis between
0% and 100%. The SLA violation manifests as the inverse quantile dropping
Hence, quantiles and inverse quantiles give complementary views of the
current service level.
This article has presented an overview of some statistical techniques
that find applications in IT operations. We discussed several visualization methods, their qualities and
relations to each other. Histograms
were shown to be an effective tool for
capturing data and visualizing sample distributions. Finally, we have
seen how to analyze request latencies
with (inverse) percentiles.
1. Cockcroft, A. Monitorama—Please, no more Minutes,
Milliseconds, Monoliths or Monitoring Tools, 2014;
2. Georgii, H.-O. Stochastik. DeGruyter, 2002.
3. Gunther, N. J. Guerrilla Capacity Planning.
Springer-Verlag, Berlin, 2007.
4. Hartmann, H. Statistics for Engineers, 2015; https://
5. Hartmann, H. Show Me the Data, 2016; http://www.
6. HDR Histogram: A high dynamic range histogram;
7. Histogram; https://en.wikipedia.org/wiki/Histogram.
8. IPython; http://ipython.org.
9. Izenman, A.J. Modern Multivariate Statistical
Techniques. Springer-Verlag, New York, 2008.
10. Janert, P.K. Data Analysis with Open Source Tools.
11. Lp space; https://en.wikipedia.org/wiki/Lp_space.
12. Matplotlib; http://matplotlib.org.
13. Quantile; https://en.wikipedia.org/wiki/Quantile.
14. Schlossnagle, T. The problem with math: why your
monitoring solution is wrong, 2015; http://www.
15. Schwarz, B. Practical Scalability Analysis with
the Universal Scalability Law, 2015; https://www.
16. Ugurlu, D. The Most Misleading Measure of Response
Time: Average, 2013; https://blog.optimizely.
17. Waskom, M. Seaborn: statistical data visualization,
18. Waskom, M. Seaborn example gallery, 2012–2015;
Heinrich Hartmann is chief data scientist for the Circonus
monitoring platform, leading the efforts to make Circonus
the best tool to monitor APIs and services. Previously,
he worked as an independent consultant for a number of
different companies and research institutions.
Copyright held by author.
Publication rights licensed to ACM. $15.00
˲ at least q × n samples are less than
or equal to y;
˲ at least ( 1 – q) × n samples are greater than or equal to y.
Familiar examples are the minimum, which is a 0-quantile; the maximum, which is a 1-quantile; and the
median, which is a 0.5-quantile. Common names for special quantiles include percentiles for k/100-quantiles
and quartiles for k/4-quantiles.
Note that quantiles are not unique.
There are ways of making them
unique, but those involve a choice that
is not obvious. Wikipedia lists nine
different choices that are found in
common software products. 13
Therefore, if people talk about the q
-quantile or the median, one should always
be careful and question which choice
As a simple example of how quantiles are non-unique, take a dataset with
two values X = [ 10, 20]. Which values are
medians, 0-quantiles, 0.25-quantiles?
Try to figure it out yourself.
The good news is that q-quantiles
always exist and are easy to compute.
Indeed, let S be a sorted copy of the
dataset X such that the smallest element X is equal to S and the largest element of X is equal to S[n – 1].
If d = floor(q × (n – 1)), then S[d] will
have d+ 1 samples S,...,S[d], which
are less than or equal to S[d], and n
– d + 1 samples S[d],..., S[n], which
are greater than or equal to S[d]. It
follows that S[d] = y is a q-quantile.
The same argument holds true for
d = ceil(q × (n – 1)).
The following listing is a Python implementation of this construction:
def quantile _ range(X,q):
S = sorted(X)
r = q (len(X)- 1)
It is not difficult to see this construction consists of the minimal
and maximal possible q-quantiles.
The notation Qmin (X, q) represents
the minimal q-quantile. The minimal
quantile has the property Qmin (X, q) ≤
y if and only if at least n × q samples of
X are less than or equal to y. A similar
statement holds true for the maximal
quantile when checking ratios of sam-
ples that are greater than y.
Quantiles are closely related to
the cumulative distribution functions discussed in the previous section. Those concepts are inverse to
each other in the following sense: If
CDF(X, y) = q, then y is a q-quantile for
X. Because of this property, cumulative distribution function values are
also referred to as inverse quantiles.
Quantiles and CDFs provide a powerful
method to measure service levels. To
see how this works, consider the fol-
lowing SLA that is still commonly seen
in practice: “The mean response time
of the service shall not exceed three
milliseconds when measured each
minute over the course of one hour.”
An SLA that captures the quality of
service as experienced by the custom-
ers looks like this: 80% of all requests
served by the API within one hour
should complete within 3ms.
Not only is this SLA easier to formulate, but it also avoids the above problems. A single long-running request
does not violate the SLA, and a busy
period with long response times will
violate the SLA if more than 20% of all
queries are affected.
To check the SLA, here are two
equivalent formulations in terms of
quantiles and CFDs:
˲ The minimal 0.8-quantile is at
most 3ms: Qmin (X1h, 0.8) ≤ 3ms.
˲ The 3-ms inverse quantile is larger
than 0.8: CDF(X1h, 3ms) ≥ 0.8.
Here X1h denotes the samples that
lie within a one-hour window. Both
formulations can be used to monitor service levels effectively. Figure
13 shows Qmin (X1h, 0.8) as a line plot.
Note how on June 24 the quantile rises above 3ms, indicating a violation of