mainly from starting and stopping sampling with the perf_
event API at thread creation and exit. This cost could be
amortized by sampling globally instead of per-thread, which
would require root permissions on most machines. If the
perf_event API supported sampling all threads in a process, this overhead could be eliminated. Delay overhead,
the largest component of Coz’s total overhead, could be
reduced by allowing programs to execute normally for some
time between each experiment. Increasing the time
between experiments would significantly reduce overhead,
but a longer profiling run would be required to collect a
Efficiency summary. Coz’s profiling overhead is on average 17.6% (minimum: 0.1%, maximum: 65%). For all but
three of the benchmarks, its overhead was under 30%.
Given that the widely used gprof profiler can impose much
higher overhead (e.g., 6 times for ferret, versus 6% with
Coz), these results confirm that Coz has sufficiently low
overhead to be used in practice.
5. RELATED WORK
Causal profiling differs from past profiling techniques,
which have focused primarily on collecting as much
detailed information as possible about a program without
disturbing its execution. Profilers have used a wide variety
of techniques to gather different types of information in
different settings, which we summarize here.
5. 1. General-purpose profilers
General-purpose profilers are designed to monitor where a
4. 4. Efficiency
program spends its execution time. Profilers such as gprof
and oprofile are typical of this category. 7, 11 While oprofile
uses sampling exclusively, gprof mixes sampling and
instrumentation to measure both execution time and col-
lect call graphs, which show how often each function was
called, and where it was called from. Later extensions to
this work have reduced the overhead of call graph profil-
ing and added additional detail with path profiling, but
corresponds to a speedup of the line Coz identified by 96%.
For this speedup, Coz predicted a performance improve-
ment of 9%, very close to our observed speedup of 8.95%.
Results for ferret are similar; Coz predicted a speedup of
21.4%, and we observe an actual speedup of 21.2%.
We measure Coz’s profiling overhead on the PARSEC benchmarks running with the native inputs. The sole exception is
streamcluster, where we use the test inputs because execution time was excessive with the native inputs.
Figure 6 breaks down the total overhead of running Coz
on each of the PARSEC benchmarks by category. The average overhead with Coz is 17.6%. Coz collects debug information at startup, which contributes 2.6% to the average
overhead. Sampling during program execution and attributing these samples to lines using debug information is
responsible for 4.8% of the average overhead. The remaining
overhead ( 10.2%) comes from the delays Coz inserts to perform virtual speedups.
These results were collected by running each benchmark in four configurations. First, each program was run
without Coz to measure a baseline execution time. In the
second configuration, each program was run with Coz, but
execution terminated immediately after startup work was
completed. Third, programs were run with Coz configured
to sample the program’s execution but not to insert delays
(effectively testing only virtual speedups of size zero).
Finally, each program was run with Coz fully enabled. The
difference in execution time between each successive configuration give us the startup, sampling, and delay overheads, respectively.
Reducing overhead. Most programs have sufficiently long
running times (mean: 103s) to amortize the cost of processing
debug information, but especially large executables can be
expensive to process at startup (e.g., x264 and vips). Coz
could be modified to collect and process debug information
lazily to reduce startup overhead. Sampling overhead comes
Causal profile for fluidanimate
Line 151 Line 184
0% 50% 100%0% 50% 100%
Figure 5. COZ output for fluidanimate, prior to optimization. COZ
finds evidence of contention in two lines in parsec_barrier.cpp,
the custom barrier implementation used by both fluidanimate and
stream-cluster. This causal profile reports that optimizing either line
will slow down the application, not speed it up. These lines precede
calls to pthread_mutex_trylock on a contended mutex. Optimizing
this code would increase contention on the mutex and interfere
with the application’s progress. Replacing this inefficient barrier
implementation sped up fluidanimate and streamcluster by 37.5%
and 68.4% respectively.
Overhead of COZ
blackscholes bodytrack cannealdedup facesimferret fluidanimate freqmine raytrace streamcluster swaptionsvipsx264mean
Delay Sampling Startup
Figure 6. Percent overhead for each of COZ’s possible sources of
overhead. Delay is the overhead from adding delays for virtual
speedups, Sampling is the cost of collecting and processing
samples, and Startup is the initial cost of processing debugging
information. Note that sampling results in slight performance
improvements for swaptions, vips, and x264.