ior is not the desired behavior. There
may be bugs in the code that gathers
metrics and processes them, as when,
say, a clock is read at the wrong time
or the 99th percentile is miscomputed.
The system being measured may have
functional bugs. And, finally, the system may have performance bugs, so
the measurements do not reflect the
system’s true potential.
I have been involved in dozens of
and cannot recall a single one in which
the first results were correct. In each
case there were multiple problems
from the list just outlined. Only after
working through them all did my colleagues and I obtain measurements
that were meaningful.
Mistake 2: Guessing instead of
measuring. The second common mistake is to draw conclusions about a
system’s performance based on educated guesses, without measurements
to back them up. For example, I found
the following explanation in a paper
I reviewed recently: “ ... throughput
does not increase with the number of
threads ... This is because the time taken to traverse the relatively long linked
list bounds server performance.” There
was no indication that the authors
had measured the actual length of the
list or the time taken to traverse it, yet
they stated their conclusion as fact. I
frequently encounter unsubstantiated
conclusions in papers; there were at
least five other occurrences in the paper with the quote.
Educated guesses are often correct
and play an important role in guiding
performance measurement; see Rule
3 (Use your intuition to ask questions,
not answer them). However, engineers’
intuition about performance is not reli-
able. When my students and I designed
our first log-structured file system, 4 we
were fairly certain that reference pat-
terns exhibiting locality would result
in better performance than those with-
out locality. Fortunately, we decided to
measure, to be sure. To our surprise, the
workloads with locality behaved worse
than those without. It took consider-
able analysis to understand this behav-
ior. The reasons were subtle, but they
exposed important properties of the
system and led us to a new policy for gar-
bage collection that improved the sys-
tem’s performance significantly. If we
for them. There was not enough time
to validate or double-check the num-
bers, and you could only hope there
were not too many errors.
Measurements gathered this way
are likely incomplete, misleading, or
even erroneous. This article describes
how to conduct performance measurement well. I first discuss five mistakes
that account for most of the problems
with performance measurements, all
of which occurred in the scenario I just
outlined. I then spell out four rules to
follow when evaluating performance.
These rules will help you avoid the mistakes and produce high-quality performance evaluations. Finally, I offer four
suggestions about infrastructure to assist in performance evaluation.
The most important idea overall, as
reflected in this article’s headline, is to
dig beneath the surface, measuring the
system in depth and detail from multiple
angles to create a complete and accurate
understanding of performance.
Most Common Mistakes
When performance measurements go
wrong, it is usually due to five common
Mistake 1: Trusting the numbers.
Engineers are easily fooled during
performance measurements because
measurement bugs are not obvious.
Engineers are used to dealing with
functional bugs, which tend to be noticeable because they cause the system
to crash or misbehave. If the system
produces the desired behavior, it is
probably working. Engineers tend to
apply the same philosophy to performance measurements; if performance
numbers are being generated and the
system is not crashing, they assume
the numbers are correct.
Performance-measurement code is
just as likely to have bugs as any other
code, but the bugs are less obvious.
Most bugs in performance-measurement code do not cause crashes or
prevent numbers from appearing; they
simply produce incorrect numbers.
There is no easy way to tell from a number whether it is right or wrong, so engineers tend to assume the numbers are
indeed correct. This is a mistake. There
are many ways for errors to creep into
performance measurements. There
may be bugs in the benchmarks or test
applications, so the measured behav-