figure 9. high-latency i/o.
another level (which was selected in
this screenshot). At various points it
appears as though a latency level has
been promoted to a higher level. This
was recently discovered and so far is
not clearly understood. It is provided
here as another example of unexpected details that latency heat maps have
exposed.
types (for example, UFS, HFS+, CIFS),
where characteristics can be identified and interpreted in similar ways.
For example, UFS (Unix file system) as
deployed on Solaris executes a thread
named fsflush to periodically write
dirty data to disk. This can update the
UFS cylinder group blocks that are
spaced across the disk, resulting in
high-latency I/O resulting from seek
and rotational latency. On older versions of Solaris the interval between
writing was five seconds (tune _ t _
fsflushr), so on a latency heat map
of disk I/O this may be easy to identify,
appearing as bursts of high latency
spaced five seconds apart.
The heat-map visualization can
also be applied to other metrics, apart
from latency. I/O size can be visualized
as a heat map with size (bytes) on the
y-axis, allowing any particularly large
or small I/O to be identified, either of
which is interesting for different reasons. I/O location can be visualized
as a heat map (as mentioned earlier)
with offset on the y-axis, allowing random or sequential I/O to be identified.
Utilization of components can also
be visualized as a heat map showing
the percent utilization of individual
components, instead of displaying
an average percent utilization across
all components. Utilization can be
shown on the y-axis, and the number of
components at that utilization can be
shown by the color of the heat-map pixel. This is particularly useful for examining disk and CPU utilization to check
how load is balanced across these components. A tight grouping of darker colors shows load is balanced evenly, and
a cloud of lighter pixels shows it isn’t.
Outliers are also interesting: a sin-
gle CPU at 100% utilization may be
shown as a light line at the top of the
heat map and is typically the result
of a software scalability issue (single
thread of execution). A single disk at
100% utilization is also interesting and
can be the result of a disk failure. This
cannot be identified using averages or
maximums alone: a maximum cannot
differentiate between a single disk at
100% utilization and multiple disks at
100% utilization, which can happen
during a normal burst of load.
shouting at JBoDs
Although not as beautiful as the previous examples, the story behind the
next heat map has gained some notori-ety and is worth including to stress that
this was a latency heat map that identified the issue.
The system included several JBODS
with dozens of disks and was performing a streaming write workload.
I discovered that if I shouted into the
JBODs as loud as I could, the disks returned I/O with extremely high latency.
Figure 9 shows the heat map from this
unusual test.
The heat map shows two spikes in
latency, corresponding to each of my
shouts. We videotaped this discovery
and uploaded it to YouTube, where I
describe the effect as disk vibration. 3
It has since been suggested that this is
better described as shock effects, not
vibration, because of the volume of the
shouts.
The affected disk I/O shown in the
heat map has very high latency—more
than one second. If average latency
were tracked instead, a few high-latency I/O events may be drowned out on
a system performing more than 8,000
faster I/O events at the same time. The
lesson from this experience was how
well latency heat maps could identify
this perturbation.
conclusion
Presenting latency as a heat map is an
effective way to identify subtle characteristics that may otherwise be missed,
such as when examining per-second
average or maximum latency. Though
many of the characteristics shown in
this article are not understood, now
that their existence is known we can
study them and over time identify
them properly. Some of the heat maps,
such as the rainbow pterodactyl, are
also interesting examples of how deep
and beautiful a simple visualization
can be.
Related articles
on queue.acm.org
hard Disk Drives: The Good, the Bad
and the Ugly
Jon Elerath
http://queue.acm.org/detail.cfm?id=1317403
hidden in Plain Sight
Bryan Cantrill
http://queue.acm.org/detail.cfm?id=1117401
Fighting Physics: A Tough Battle
Jonathan M. Smith
http://queue.acm.org/detail.cfm?id=1530063
References
1. Cantrill, b. 2006. hidden in plain sight. ACM Queue 4 1
(Feb. 2006), 26–36.
2. Gregg, b. DRAM latency; Feb. 6, 2009; http://blogs.
sun.com/brendan/entry/dram_latency.
3. Gregg, b. shouting in the datacenter; http://www.
youtube.com/watch?v=tDacjrsCeq4.
4. levanthal, A. Flash storage memory. Commun. ACM
51, 7 (july 2008), 47–51.
5. taztool; http://www.solarisinternals.com/si/tools/taz/
index.php.
6. ZFs; http://en.wikipedia.org/wiki/ZFs.
other applications
The previous examples showed latency heat maps for systems deploying
the ZFS file system, accessed over NFS.
Latency heat maps are also applicable
for other local and remote file system
Brendan Gregg ( brendan.gregg@oracle.com) is a
principal software engineer at oracle, and works on
performance analysis and observability in the Fishworks
advanced development team. he is also the creator
of the DTrace Toolkit and is the co-author of “solaris
Performance and Tools.”
Copyright © 2010, oracle and/or its affiliates.
All rights reserved.