hardware performance counters. When a
thread finds it is executing the target
method, it increments the global counter. When another thread processes its
sample, it compares the value of its local counter to the global counter, and if
the local counter is behind, increments
the counter and pauses the thread.
This sampling scheme is extremely elegant, and it allows Coz to set the virtual speedup of an execution by simply
adjusting the ratio of the delay and the
sampling period.
Finally, Coz’s predictions proved
to be accurate across a wide range of
benchmarks. Maybe most impressive,
the authors used Coz to fix a really
nasty and longstanding performance
bug in a deployed hash table. Coz
reported the greatest optimization
opportunity in a file-compression application was a method that traversed
the linked list of a hash-table bucket.
Coz continued to identify this particular line even after the authors increased the number of buckets. Upon
closer inspection, the authors discovered the table’s hashing function was
assigning entries to only 3% of the
available buckets. Fixing the buggy
hash function required changing
three lines of code and led to a nearly
9% end-to-end benchmark speedup.
Coz not only identified the bug and
predicted the impact of a fix, but it allowed the authors to discover and resolve the problem in just a few hours.
Above all else, what makes Coz
noteworthy is that it exemplifies the
best kind of systems work: it is elegant,
insightful, and practical all at once.
Papers that simultaneously achieve all
of these qualities are exceedingly rare,
and I suspect that practitioners and researchers will continue returning to it
for many years to come.
Landon P. Cox is an associate professor in the
Department of Computer Science at Duke University,
Durham, NC, USA.
Copyright held by owner/author.
WHEN PROGRAMMERS WANT to improve
their program’s performance, they often turn to profilers to tell them what
code to optimize. A profiler can measure many aspects of a program’s behavior, such as how many times each
method is typically called, how long
each method typically takes, and which
methods typically lie on the critical
path. Unfortunately, this information
is not always relevant, and it can often
cause programmers to waste their time
on optimizations that have little impact on overall performance.
In particular, conventional profilers
struggle to help developers optimize
multithreaded programs. A simple but
useful example is a program in which
an initial thread waits for several worker
threads to complete. If each worker runs
for approximately the same amount
of time, then most profilers will report
that each worker accounted for an equal
share of the execution time. At the same
time, optimizing any individual worker
will have a minimal impact on overall
execution time since the program will
only finish after all workers are done.
A programmer could waste countless
hours optimizing code without making
the program run any faster.
In the following paper, Curtsinger
and Berger describe a better approach
called causal profiling; causal profilers
tell programmers exactly how much
speed-up bang to expect for their optimization buck. That is, a causal profiler can predict with spooky accuracy
that speeding up a line of code by x%
will improve overall responsiveness or
throughput by y%. At first blush this
seems like magic. No tool can make an
arbitrary line of code arbitrarily faster
and then measure the sped-up line’s
impact on overall execution time. And
yet Curtsinger and Berger developed a
tool called Coz that would seem to do
the impossible.
The key observations underlying
Coz are that arbitrarily accelerating
code is infeasible, but arbitrarily slow-
ing code is quite easy, and all a tool
has to do is adjust the relative speeds
of code fragments in order to predict
how a change in one fragment will im-
pact overall performance. Put another
way, Curtsinger and Berger realized
one can measure the impact of speed-
ing up a target code fragment by x by
slowing everything but the target by x%
and measuring the overall slowdown.
This wonderful technique is called
a virtual speedup, and it is one of my
favorite pieces of work in the past few
years. Using virtual speedups for code
profiling is such a clever, simple, and
powerful idea that you wish you had
thought of it yourself.
While the insights underlying Coz
are undeniably clever, the implementa-
tion and evaluation results are equally
impressive. Conceptually, whenever Coz
runs a target line of code, it pauses all
parallel threads by a delay that is less
than the average runtime of the target
line. However, naïvely pausing threads
each time a target line executes would
require heavy code instrumentation
that could significantly skew the re-
sults and undermine the utility of the
profiler. Instead, Coz uses a lightweight
sampling scheme that requires no instru-
mentation at all. In Coz, each thread
maintains a local counter and periodi-
cally samples its program counter and
stack using Linux’s interface to the CPU’s
Technical Perspective
Measuring Optimization
Potential with Coz
By Landon P. Cox
research highlights
DOI: 10.1145/3205913
While the insights
underlying Coz are
undeniably clever,
the implementation
and evaluation
results are equally
impressive.
To view the accompanying paper,
visit doi.acm.org/10.1145/3205911