the functionality of general-purpose profilers remains
largely unchanged. This kind of profiling information is
useful for identifying where a program spends its time,
but not necessarily where developers should work to
5. 2. Parallel profilers
Several techniques have been used to identify performance
and scalability bottlenecks in parallel programs. Systems
such as IPS trace the execution of a running program to
identify its critical path, the longest sequence of dependencies in the complete program dependence graph. 13 While
this approach can work well for message-passing systems, it
would require instrumenting all memory accesses in a modern shared-memory parallel program; this would impose
substantial overhead, likely distorting the results far too
much to be representative of an un-profiled execution.
Other parallel profilers, such as FreeLunch and the WAIT
tool, identify code that runs while some of a program’s
threads sit idle. 1, 4 These systems assign some level of blame
for blocking to all of a program’s code. The idea is that code
running while other threads are blocked must be responsible
for the reduced parallelism. This heuristic works well for
some parallel performance issues, but not all performance
bottlenecks change a thread’s scheduler state.
5. 3. Profiling for scalability
Several systems have been developed to measure potential
parallelism in serial programs. 6, 16, 17 Other systems instead
examine parallel programs to predict how well the program
will scale to larger numbers of hardware threads. 10 These
approaches are distinct and complimentary to causal profiling. These tools help developers parallelize and scale applications, while Coz helps developers improve an existing
parallel program at the current level of parallelism.
5. 4. Performance experimentation
Coz is a significant departure from past profiling techniques in that it intentionally perturbs a program’s execution to model the effect of an optimization. While this
technique is unique for software profilers, the idea of a performance experiment has appeared in other systems.
Mytkowicz et al. 14 use delays to validate the output of profilers on single-threaded Java programs. 14 Snelick et al. 15 use
delays to profile parallel programs. 15 This approach measures the effect of slowdowns in combination, which
requires a complete execution of the program for each of an
exponential number of configurations. While these techniques involve performance experiments, Coz is the first
system to use performance perturbations to create the
effect of an optimization.
Profilers are the primary tool in the programmer’s toolbox
for identifying performance tuning opportunities. Previous
profilers only observe actual executions and correlate code
with execution time or performance counters. This informa-
tion can be of limited use because the amount of time spent
does not necessarily correspond to where programmers
should focus their optimization efforts. Past profilers are
also limited to reporting end-to-end execution time, an
unimportant quantity for servers and interactive applica-
tions whose key metrics of interest are throughput and
latency. Causal profiling is a new, experiment-based
approach that establishes causal relationships between
hypothetical optimizations and their effects. By virtually
speeding up lines of code, causal profiling identifies and
quantifies the impact on either throughput or latency of any
degree of optimization to any line of code. Our prototype
causal profiler, Coz, is efficient, accurate, and effective at
guiding optimization efforts. Coz is now a standard package
on current Debian and Ubuntu platforms and can be
installed via the command sudo apt-get install coz-
profiler or it can be installed from source on any Linux
distribution; all source is at http://coz-profiler.org.
This material is based upon work supported by the National
Science Foundation under Grants No. CCF-1012195 and
CCF-1439008. Charlie Curtsinger was supported by a Google
PhD Research Fellowship. The authors thank Dan Barowy,
Steve Freund, Emma Tosch, John Vilk, and Tim Harris for
their feedback and helpful comments.
1. Altman, E.R., Arnold, M., Fink, S.,
Mitchell, N. Performance analysis of
idle programs. OOPSLA. ACM (2010),
2. Curtsinger, C., Berger, E. D. Stabilizer:
Statistically sound performance
evaluation. In ASPLOS (New York, NY,
USA, 2013), ACM.
3. Curtsinger, C., Berger, E. D. COZ:
Finding code that counts with causal
profiling. In SOSP, (ACM, New York,
NY, 2015), 184–197.
4. David, F., Thomas, G., Lawall, J.,
Muller, G. Continuously measuring
critical section pressure with the
free-lunch profiler. In OOPSLA, (ACM,
New York, NY, 2014), 291–307.
5. Free Soft ware Foundation. Debugging
with GDB, 10th edn., The Free
Software Foundation, Boston, MA.
6. Garcia, S., Jeon, D., Louie, C.M., Taylor, M.B.
Kremlin: rethinking and rebooting
gprof for the multicore age. In PLDI,
(ACM, New York, NY, 2011), 458–469.
7. Graham, S.L., Kessler, P.B.,
McKusick, M.K. Gprof: A call graph
execution profiler. In SIGPLAN
Symposium on Compiler
Construction, (ACM, New York, NY,
8. Intel. Intel V Tune Amplifier, 2015.
9. kernel.org. perf: Linux profiling with
performance counters, 2014.
10. Kulkarni, M., Pai, V.S., Schuff, D.L.
Towards architecture independent
metrics for multicore performance
analysis. SIGMETRICS Performance
Evaluation Review 38, 3 (2010), 10–14.
11. Levon, J., Elie, P. Oprofile: A system
profiler for Linux, 2004. http://
12. Little, J. D. OR FORUM: Little’s Law as
Viewed on Its 50th Anniversary.
Operations Research 59, 3 (2011),
13. Miller, B. P., Yang, C.-Q. IPS: An
interactive and automatic
performance measurement tool for
parallel and distributed programs.
In ICDCS, 1987, 482–489.
14. Mytkowicz, T., Diwan, A., Hauswirth, M.,
Sweeney, P.F. Evaluating the accuracy
of Java profilers. In PLDI (2010)
15. Snelick, R., JáJá, J., Kacker, R., Lyon, G.
Synthetic-perturbation techniques for
screening shared memory programs.
Software Practice & Experience 24, 8
16. von Praun, C., Bordawekar, R.,
Cascaval, C. Modeling optimistic
concurrency using quantitative
dependence analysis. In PPoPP
(2008), ACM, 185–196.
17. Zhang, X., Navabi, A., Jagannathan, S.
Alchemist: A transparent dependence
distance profiling infrastructure. In
CGO (2009), IEEE Computer Society,
Copyright held by owners/authors. Publication rights licensed to ACM. $15.00.
Charlie Curtsinger (curtsinger@grinnell.
edu), Department of Computer Science,
Grinnell College, USA.
Emery D. Berger ( firstname.lastname@example.org),
College of Information and Computer
Sciences, University of Massachusetts