Figure 2. The gprof and causal profiles for the code in Figure 1. In the causal profile, the y-axis shows the program speedup that would be
achieved by speeding up each line of code by the percentage on the x-axis. The gray area shows standard error. Gprof reports that fa and fb
comprise similar fractions of total runtime, but optimizing fa will improve performance by at most 4.5%, and optimizing fb would have no
effect on performance. The causal profile predicts both outcomes within 0.5%.
Causal profile for example.cpp
2. It presents Coz, a causal profiler that works on unmodified Linux binaries. It describes Coz’s implementation
(Section 3), and demonstrates its efficiency and effec-
Conventional profile for example.cpp
time
cumulative
seconds
self
seconds
self
Ts/call
total
Ts/call calls name
55.20 a() 1 7.20 7.20
45.19 b() 1 5.89 13.09
time self children called name
55.0 7.20 0.00 a()
------------------------------------------------
45.0 5.89 0.00 b()
P
r
og
r
a
m
s
p
ee
d
up
line 2 (a) line 5 (b)
0%
2%
4%
6%
0% 25% 50% 75% 100%0% 25% 50% 75% 100%
Line speedup
(b) Causal profile for example.cpp (a) A gprof profile for example.cpp
92 COMMUNICATIONS OF THE ACM | JUNE 2018 | VOL. 61 | NO. 6
mimic the effect of optimizing a specific fragment of code
by a fixed amount. Fragments could be functions, basic
blocks, source lines. A fragment is virtually sped up by
inserting pauses to slow all other threads each time the
fragment runs. The key insight is that this slowdown has the
same relative effect as running that fragment faster, thus
“virtually” speeding it up. Figure 3 shows the equivalence of
virtual and actual speedups.
Figure 3. An illustration of virtual speedup: (a) shows the original
execution of two threads running functions f and g; (b) shows the
effect of a actually speeding up f by 40%; (c) shows the effect of
virtually speeding up f by 40%. Each time f runs in one thread, all
other threads pause for 40% of f’s original execution time (shown as
ellipsis). The difference between the runtime in (c) and the original
runtime plus nf · d—the number of times f ran times the delay size—
is the same as the effect of actually optimizing f.
Each performance experiment measures the effect of
virtually speeding up a fragment of code by a specific amount.
Speedups are measured in percent change; a speedup of 0%
means the fragment’s runtime is unchanged, while 75%
means the fragment takes a quarter of its original runtime.
By conducting many performance experiments over a range
of virtual speedups, a causal profiler can predict the effect
of any potential optimization on a program’s performance.
Illustration of virtual speedup
(a) Original program
g
g
f
f
g
(b) Actual speedup
Causal profiling further departs from conventional profiling by making it possible to view the effect of optimizations
on both throughput and latency. To profile throughput,
developers specify a progress point, indicating a line in the
code that corresponds to the end of a unit of work. For example, a progress point could be the point at which a transaction concludes, when a web page finishes rendering, or
when a query completes. A causal profiler then measures the
rate of visits to each progress point to determine any potential optimization’s effect on throughput. To profile latency,
programmers instead place progress points at the start and
end of an event of interest, such as when a transaction begins
and completes. A causal profiler then reports the effect of
potential optimizations on the average latency between
those two progress points.
g t2
t1 f
t1
t2
t1
t2
(c) Virtual speedup
original runtime
nf+. d
original runtime
f
f
fg
gg
g
effect of optimizing by d f
g
f
g
g
f
f
f
Time
Coz imposes low execution overhead. When it is possible, we
compare the observed performance improvements to Coz’s
predictions. In each case, we find that the real effect of our
optimization matches Coz’s prediction.
To demonstrate the effectiveness of causal profiling, we
have developed Coz, a causal profiler for Linux. We show that
causal profiling accurately predicts optimization opportunities, and that it is effective at guiding optimization efforts. We
apply Coz to Memcached, SQLite, and the extensively studied
PARSEC benchmark suite. Guided by Coz’s output, we improve
the performance of Memcached by 9%, SQLite by 25%, and six
PARSEC applications by as much as 68%. Our changes typically
require modifying under 10 lines of code. We also show that
1. 1. Contributions
This paper makes the following contributions:
1. It presents causal profiling, which identifies code
where optimizations will have the largest impact.
Using virtual speedups and progress points, causal profiling directly measures the effect of potential optimizations on both throughput and latency (Section 2).