Figure 4. COZ and perf output for SQLite before optimizations. The three lines in the causal profile correspond to the function prologues for
sqlite3MemSize, pthreadMutexLeave, and pcache1Fetch. A small optimization to each of these lines will improve program performance,
but beyond about a 25% speedup, COZ predicts that the optimization would actually lead to a slowdown. Changing indirect calls into direct
calls for these functions improved overall performance by 25.6% ± 1.0%. While the perf profile includes the functions we changed, they
account for just 0.15% of the sampled runtime.
Causal profile for SQLite
Perf profile for SQLite
0% 50% 0% 50% 0% 50% P
r
og
r
a
m
s
p
ee
d
up
Line 16916 Line 18974 Line 40345
25%
0%
−25%
−50%
... 32 lines ...
Runtime Symbol
85.55% _raw_spin_lock
0.09% sqlite3MemSize
... 28 lines ...
0.03% pcache1Fetch
0.03% kmem_cache_free
0.03% update_cfs_rq_blocked_load
0.03% pthreadMutexLeave
Line speedup
(b) Perf’s output for SQLite before optimizations. (a) COZ’s output for SQLite before optimizations.
Coz identifies the source line hashtable.c:217 as the
best opportunity for optimization. This code is the top of the
while loop in hashtable_search that traverses the
linked list of entries that have been assigned to the same
hash bucket. These results suggest that dedup’s shared hash
table has a significant number of collisions. Increasing the
hash table size had no effect on performance. This discovery
led us to examine dedup’s hash function, which could also
be responsible for the large number of hash table collisions.
We discovered that dedup’s hash function maps keys to just
2.3% of the available buckets; over 97% of buckets were
never used during the entire execution.
JUNE 2018 | VOL. 61 | NO. 6 | COMMUNICATIONS OF THE ACM 97
proach to parallelism. The application spawns worker
threads that execute in a series of concurrent phases, with
each phase separated from the next by a barrier. We placed a
progress point immediately after the barrier, so it executes
each time all threads complete a phase of the computation.
For dedup, Coz identified the top of the while loop that
traverses a hash bucket’s linked list. By replacing the degen-
erate hash function, we reduced the average number of ele-
ments in each hash bucket from 76. 7 to 2.09. This change
reduces the number of iterations from 77. 7 to 3.09 (account-
ing for the final trip through the loop). This reduction
parallel file compression via deduplication. This process
is divided into three main stages: fine-grained fragmenta-
tion, hash computation, and compression. We placed a
progress point immediately after dedup completes com-
pression of a single block of data (encoder.c:189).
In both benchmarks, Coz also identified the same points
of contention: PARSEC’s barrier implementation. Figure 5
shows the evidence for this contention in fluidanimate. This
profile tells us that optimizing the indicated line of code
would actually slow down the program, rather than speed it
up. Both lines run immediately before entering a loop that
repeatedly calls pthread_mutex_trylock. Removing
this spinning from the barrier would reduce contention, but
it was easier to replace the custom barrier with the default
pthread_barrier implementation. This one line change
led to a 37.5% ± 0.56% speedup in fluidanimate, and a 68.4%
± 1.12% speedup in streamcluster.
The original hash function adds characters of the hash
table key, which leads to virtually no high order bits being
set. The resulting hash output is then passed to a bit shifting procedure intended to compensate for poor hash functions. We removed the bit shifting step, which increased
hash table utilization to 54.4%. We then changed the hash
function to bitwise XOR 32 bit chunks of the key. Our new
hash function increased hash table utilization to 82.0% and
resulted in an 8.95% ± 0.27% performance improvement.
The entire optimization required changing just three lines
of code, and the entire profiling and tuning effort took just
two hours.
Effectiveness Summary. Our case studies confirm that Coz
is effective at identifying optimization opportunities and
guiding performance tuning. In every case, the information
Coz provided led us directly to the optimization we implemented. In most cases, Coz identified around 20 lines of
interest, with as many as 50 for larger programs (Memcached
and x264). Coz identified optimization opportunities in all
of the PARSEC benchmarks, but some required more invasive changes that are out of scope for this paper.
4. 3. Accuracy
Comparison with gprof. We ran both the original and optimized versions of dedup with gprof. The optimization
opportunities identified by Coz were not obvious in gprof’s
output. Overall, hashtable_search had the largest share
of highest execution time at 14.38%, but calls to
hashtable_search from the hash computation stage
accounted for just 0.48% of execution time; Gprof’s call
graph actually obscured the importance of this code. After
optimization, hashtable_search’s share of execution
time reduced to 1.1%.
For most of the optimizations described above, it is not possible to quantify the effect our optimization had on the specific lines that Coz identified. However, for two of our case
studies—ferret and dedup—we can directly compute the
effect our optimization had on the line Coz identified and
compare the resulting speedup to Coz’s predictions. Our
results show that Coz’s predictions are highly accurate.
Case study: Fluidanimate and streamcluster. The fluidanimate and streamcluster benchmarks use a similar ap-