Article development led by
Heat maps are a unique and powerful way to
visualize latency data. Explaining the results,
however, is an ongoing challenge.
BY BRenDan GReGG
wheN i/o LaTeNCY is presented as a visual heat map
some intriguing and beautiful patterns can emerge.
These patterns provide insight into how a system is
actually performing and what kinds of latency end-user applications experience. Many characteristics
seen in these patterns are still not understood, but so
far their analysis is revealing systemic behaviors that
were previously unknown.
Latency is time spent waiting and has a direct
impact on performance when induced by a
synchronous component of an application request.
This makes interpretation straightforward—the
higher the latency, the worse the performance. Such
a simple interpretation is not possible for many
other statistics types that are commonly examined
for performance analysis, such as utilization, IOPS
(I/O per second), and throughput. Those statistics
are often better suited for capacity planning and
for understanding the nature of workloads. For identifying performance
issues, however, understanding latency is essential.
For application protocols measured from the application server, latency can refer to the time from when
a request was received to when the
completion was sent—for example,
the time for a Web server to respond
to HTTP GETs or a file server to respond to NFS (network file system)
operations. Such a measurement is
extremely important for performance
analysis since the client and end users are usually waiting during this
For resource components such as
disks, latency can refer to the time
interval between sending the I/O
request and receiving the completion interrupt. High disk latency
often translates to application performance issues, but not always: file
systems may periodically flush dirty
cached data to disks; however, the
I/O is asynchronous to the application. For example, the Oracle Solaris
ZFS file system periodically flushes
transaction groups to disks, causing a spike in average disk latency.
This does not reflect the file-system
latency experienced by ZFS consumers, since the average disk latency includes asynchronous writes from the
transaction flush. (This misconception would be alleviated somewhat if
read and write latency were observed
separately, since the transaction flush
affects write latency only.)
While it’s desirable to examine
latency, it has been historically difficult or impossible to measure directly for some components. For
example, examining application-level latency server side may have
involved instrumenting the application or examining network packet
captures and associating request to
response. With the introduction of
DTrace, 1 however, measuring latency at arbitrary points has become
possible for production systems—
and in real time.