Histograms are addressed in more
Scatter plots. The scatter plot is
the most basic visualization of a two-dimensional dataset. For each pair of
values x,y a point is drawn on a canvas
that has coordinates (x,y) in a Cartesian coordinate system.
The scatter plot is a great tool to
compare two metrics. Figure 4 plots
the request rates of two different database nodes in a scatter plot. In the plot
shown on top the points are mainly
concentrated on a diagonal line, which
means that if one node serves many
requests, then the other is doing so
as well. In the bottom plot the points
are scattered all over the canvas, which
represents a highly irregular load distribution, and might indicate a problem with the db configuration.
In addition to the fault-detection scenario outlined above, scatter plots are
also an indispensable tool for capacity
planning and scalability analysis. 3, 15
Line plots. The line plot is by far the
most popular visualization method
seen in practice. It is a special case
of a scatter plot, where time stamps
are plotted on the x-axis. In addition,
a line is drawn between consecutive
points. Figure 3 shows an example of
a line plot.
The addition of the line provides the
impression of a continuous transition
between the individual samples. This
assumption should always be challenged and taken with caution (for example, just because the CPU was idle at
1:00PM and 1:01PM, this does not mean
it did not do any work in between).
Sometimes the actual data points
are omitted from the visualization altogether and only the line is shown. This
is a bad practice and should be avoided.
The line plot is a great tool to surface time-dependent patterns such as
periods or trends. For time-indepen-dent questions—typical values, for
example—other methods such as rug
plots might be better suited.
Which one to use? Choosing a suitable visualization method depends on
the question to be answered. Is time
dependence important? Then a line
plot is likely a good choice. If not, then
rug plots or histograms are likely better tools. Do you want to compare different metrics with each other? Then
consider using a scatter plot.
Rug plots are suitable for all questions where the temporal ordering of
the samples is not relevant, such as
common values or outliers. Problems
occur if there are multiple samples with
the same sample value in the dataset.
Those samples will be indistinguishable in the rug plot. This problem can
be addressed by adding a small random
displacement (jitter) to the samples.
Despite its simple and honest character, the rug plot is not commonly
used. Histograms or line plots are used
instead, even if a rug plot would be
Histograms. The histogram is a
popular visualization method for one-
dimensional data. Instead of drawing
rugs on an axis, the axis is divided
into bins and bars of a certain height
are drawn on top of them, so that
the number of samples within a bin
is proportional to the area of the bar
The use of a second dimension often makes a histogram easier to comprehend than a rug plot. In particular,
questions such as “Which ratio of the
samples lies below y?” can be effectively estimated by comparing areas. This
convenience comes at the expense of
an extra dimension used and additional choices that have to be made about
value ranges and bin sizes.
Figure 1. Rug plot of Web-request rates.
Request Rates in rps
600 800 1000 1200 1400 1600 1800 2000 2200
Figure 2. Histogram of Web-request rates.
Request Rates in rps
800 1000 1200 1400 1600 1800 2000 2200
Figure 3. Line plot of Web-request rates.
Time-offset in Minutes
20 40 60 10 30 50 7080