some accuracy, you can save massive
amounts of space by using histogram
data structures.
The essential idea is, instead of
storing the individual samples as a
list, to use the vector of bin counts
that occurs as an intermediate result
in the histogram computation. The
example in Figure 5 arrived at the following values:
sample_ count = [0, 10, 8, 4, 25,
23, 4, 2]
The precise memory representation used for storing histograms varies.
The important point is that the sample
count of each bin is available.
Histograms allow approximate
computation of various summary statistics, such as mean values and quantiles that will be covered next. The precision depends on the bin sizes.
Also, histograms can be aggregated easily. If request latencies are
available for each node of a database
cluster in histograms with the same
bin choices, then you can derive the
latency distribution of the whole cluster by adding the sample counts for
each bin. The aggregated histogram
can be used to calculate mean values
and quantiles over the whole cluster.
This is in contrast to the situation in
which mean values or quantiles are
computed for the nodes individually.
It is not possible to derive, for example, the 99th-percentile of the whole
cluster from the 99th-percentiles of the
individual nodes. 14
High-dynamic range histogram. An
high-dynamic range (HDR) histogram
provides a pragmatic choice for bin
width that allows a memory-efficient
representation suitable for capturing
data on a very wide range that is common to machine-generated data such
as I/O latencies. At the same time HDR
histograms tend to produce acceptable
visual representations in practice.
An HDR histogram changes the bin
width dynamically over the value range.
A typical HDR histogram has a bin
size of 0.1 between 1 and 10, with bin
boundaries: 1, 1. 1, 1. 2,..., 9. 9, 10. Similarly, between 10 and 100 the bin size
is 1, with boundaries 10, 11, 12, ..., 100.
This pattern is repeated for all powers
of 10, so that there are 90 bins between
10k and 10k+ 1.
The general definition is a little bit
more complex and lengthy, so it is not
Software products make default
choices for the value range and bin
width. Typically the value range is taken
to be the range of the data, and equally
spaced bins are used. Several formulas
exist for selecting the number of bins
that yield “ideal” results under certain
assumptions—in particular, n1/2 (Excel)
and 3.5σ/n1/3 (Scott’s rule7). In practice,
these choices do not yield satisfying re-
sults when applied to operations data,
such as request latencies, that contain
many outliers.
Histogram as aggregation method.
When measuring high-frequency data
such as I/O latencies, which can arrive
at rates of more than 1,000 samples per
second, storing all individual samples
is no longer feasible. If you are willing
to forget about ordering and sacrifice
Figure 6. Histogram plot with value range (500, 2,200) and 100 equally sized bins.
600
0.30
0.25
0.20
0.15
0.10
0.05
0
800 1000 1200 1400 1600 1800 2000
Request Rate in rps
Sam
pl
eD
en
sit
y
Figure 7. Request rate histogram ( 50 bins) presented as a heat map.
400 600 800 1000 1200 1400 1600 1800 2200 2000
Request Rate in rps
Figure 8. Request latency heat map over time in Circonus.
16:00 17:00 18:00
100
80
60
40
20
0
22:00 21:00 23:00 0:00 19:00
Re
qu
es
tL
at
en
cyin
ms
20:00
262.8
100
Figure 9. Rug plot of a two-modal dataset.
0 20 40
mean value
60 80 100 120