Table 2. All benchmarks were run 10 times before and after
optimization. Standard error for speedup was computed using Efron’s
bootstrap method. All speedups are statistically significant at the
99.9% confidence level (a = 0.001) using the one-tailed Mann-Whitney
U test. Diff size reports the number of lines removed and added.
Summary of optimization results
Application Speedup Diff size Source lines
Memcached 9.39% ± 0.95% − 6, + 2 10,475
SQLite 25.60% ± 1.00% − 7, + 7 92,635
blackscholes 2.56% ± 0.41% − 61, + 4 342
dedup 8.95% ± 0.27% − 3, + 3 2,570
ferret 21.27% ± 0.17% − 4, + 4 5,937
fluidanimate 37.5% ± 0.56% − 1, +0 1,015
streamcluster 68.4% ± 1.12% − 1, +0 1,779
swaptions 15.8% ± 1.10% − 10, + 16 970
implementation-specific functions, but most of these functions are only ever given a default value determined by com-pile-time options. The three functions Coz identified
unlock a standard pthread mutex, retrieve the next item
from a shared page cache, and get the size of an allocated
object. These simple functions do very little work, so the
overhead of the indirect function call is relatively high.
Replacing these indirect calls with direct calls resulted in a
25.60% ± 1.00% speedup.
Comparison with conventional profilers. Unfortunately,
running SQLite with gprof segfaults immediately. The
application does run with the Linux perf tool, which
reports that the three functions Coz identified account for
a total of just 0.15% of total runtime (Figure 4b). Using
perf, a developer would be misled into thinking that optimizing these functions would be a waste of time. Coz accurately shows that the opposite is true: optimizing these
functions has a dramatic impact on performance.
Case study: Memcached. Memcached is a widely-used
in-memory caching system. To evaluate cache performance, we ran a benchmark ported from the Redis performance benchmark. This program spawns 50 parallel clients that collectively issue 100,000 SET and GET requests
for randomly chosen keys. We placed a progress point at
the end of the process_command function, which handles each client request.
Most of the lines Coz identifies are cases of contention,
with a characteristic downward-sloping causal profile plot.
One such line is at the start of item_remove, which locks
an item in the cache and then decrements its reference
count, freeing it if the count goes to zero. To reduce lock
initialization overhead, Memcached uses a static array of
locks to protect items, where each item selects its lock
using a hash of its key. Consequently, locking any one item
can potentially contend with independent accesses to
other items whose keys happen to hash to the same lock
index. Reference counts are updated atomically inside this
critical section, and the thread that decrements the reference count to zero has exclusive access to this item. As a
result, we can safely remove the lock from this function,
which yielded a 9.39% ± 0.95% speedup.
Case study: Dedup. The dedup application performs
Optimizations & phase correction. Coz attempts to minimize the number of delays inserted when all threads must
be paused. Minimizing delays reduces the performance
overhead of profiling without changing the accuracy of virtual speedups. The profile analysis also includes a correction for programs with phases; if left uncorrected, phases
could overstate the potential effect of an optimization. For
details, see our SOSP paper. 3
Our evaluation answers the following questions: ( 1) Does
causal profiling enable effective performance tuning?
( 2) Are Coz’s performance predictions accurate?, and ( 3) Is
Coz’s overhead low enough to be practical?.
4. 1. Experimental setup
We perform all experiments on a 64 core, four socket
Advanced Micro Devices (AMD) Opteron machine with 60GB
of memory, running Linux 3. 14 with no modifications. All
applications are compiled using GNU Compiler Collection
(GCC) version 4. 9. 1 at the –O3 optimization level and debug
information generated with -g. We disable frame pointer
elimination with the -fno-omit-frame-pointer flag so
Linux can collect accurate call stacks with each sample. Coz
is run with the default sampling period of 1ms, with sample
processing set to occur after every 10 samples. Each performance experiment runs with a cooling-off period of 10ms
after each experiment to allow any remaining samples to be
processed before the next experiment begins. Due to space
limitations, we only profile throughput (and not latency) in
4. 2. Effectiveness
We demonstrate causal profiling’s effectiveness through
case studies. Using Coz, we collect causal profiles for
Memcached, SQLite, and the PARSEC benchmark suite.
Using these causal profiles, we were able to make small
changes to two of the real applications and six PARSEC
benchmarks, resulting in performance improvements as
large as 68%. Table 2 summarizes the results of our optimization efforts. We describe some of our experience using
Case study: SQLite. The SQLite database library is
widely used by many applications to store relational data.
This embedded database, which can be included as a single large C file, is used by many applications such as
Firefox, Chrome, Safari, Opera, Skype, i Tunes, and is a
standard component of Android and iOS. We evaluated
SQLite performance using a write-intensive parallel workload, where each thread rapidly inserts rows to its own
private table. While this benchmark is synthetic, it exposes any scalability bottlenecks in the database engine itself
because all threads should theoretically operate independently. We placed a progress point in the benchmark itself (which is linked with the database), which executes
after each insertion.
Coz identified three important optimization opportunities, as shown in Figure 4a. At startup, SQLite populates a
large number of structs with function pointers to