This result lets Coz virtually speed up selected lines without instrumentation. Inserting a delay that is one quarter of
the sampling period will virtually speed up the selected line
by 25%.
Pausing threads. When one thread receives a sample in
the line selected for virtual speedup, all other threads
must pause. Rather than using POSIX signals, which
would have prohibitively high overhead, Coz controls
inter-thread pausing using counters. The first counter,
shared by all threads, records the number of times each
thread should have paused so far. Each thread has a local
counter of the number of times that thread has already
paused. Whenever a thread’s local count of pauses is less
than the number of required pauses in the global counter,
a thread must pause (and increment its local counter). To
signal all other threads to pause, a thread simply increments both the global counter and its own local counter.
Every thread checks if pauses are required after processing its own samples.
Thread creation. To start sampling and adjust delays, Coz
intercepts calls to the pthread_create function. When a
new thread is created, Coz first begins sampling in the new
thread. It then inherits the parent thread’s local delay count;
any previously inserted delays to the parent thread also
delayed the creation of the new thread.
Handling suspended threads. Coz only collects samples
and inserts delays in a thread while that thread is actually
executing. As a result, a backlog of required delays will accumulate in a thread while it is suspended. When a thread is
suspended on a blocking I/O operation, this is the desired
behavior; pausing the thread while it is already suspended
on I/O would not delay the thread’s progress. Coz simply
adds these delays after the thread unblocks.
However, a thread can also be suspended while waiting
for another thread using pthreads synchronization operations. As with blocking I/O, required delays will accumulate
while the thread is suspended but Coz may not need to
insert all of these delays when the thread resumes. When a
thread resumes after waiting on a lock, another thread must
have released the lock. If the unlocking thread has executed
all the required delays, then the blocked thread has effectively already been delayed. The suspended thread should be
credited for any delays inserted in the thread responsible for
waking it up. Otherwise, the thread should insert all the necessary delays that accumulated during the time the thread
was suspended. To implement this policy, Coz forces
threads to execute all required delays before blocking or
waking other threads.
Attributing samples to source lines. Samples are attributed to source lines using the source map constructed at
startup. When a sample does not fall in any in-scope
source line, the profiler walks the sampled callchain to
find the first in-scope address. This policy has the effect of
attributing all out-of-scope execution to the last in-scope
callsite responsible. For example, a program may call
printf, which calls vfprintf, which in turn calls
strlen. Any samples collected during this chain of calls
will be attributed to the source line that issues the original
printf call.
If the real runtime of f was t ¯f − d, but we forced every
thread in the program to pause for time d after f ran
(including the thread that just executed f) we would measure the same total runtime as with a virtual speedup. The
only difference between virtual speedup and a real speedup
with these additional pauses is that we use the time d to
allow one thread to finish executing f. The pauses inserted
for virtual speedup increase the total runtime by
nf · d, where nf is the total number of times f is called by any
thread. Subtracting nf · d from the total runtime with virtual
speedup gives us the execution time we would measure if f
had runtime t ¯f − d.
Implementing virtual speedup with sampling. The previ-
ous discussion of virtual speedups assumes an implemen-
tation where each execution of small code fragment—a
single line of code—causes all other threads instanta-
neously pause for a fraction of the time require to run the
line. Unfortunately, this approach would incur prohibi-
tively high overhead that would distort program execution.
Instead, Coz periodically samples the program counter
and counts samples that fall in the line selected for virtual
speedup. Other threads are delayed proportionally to the
number of samples. The number of samples in f with a sam-
pling period of P is approximately,
( 1)
In our original model of virtual speedups, delaying other
threads by time d each time the selected line is executed has
the effect of shortening this line’s runtime by d. With sam-
pling, only some executions of the selected line will result in
delays. The effective runtime of the selected line when sam-
pled is t ¯f − d, while executions of the selected line that are not
sampled simply take time t ¯f . The effective average time to run
the selected line is,
( 2)
Using ( 1), this reduces to,
( 3)
The relative difference between t and ēf, the amount of
virtual speedup, is simply,
( 4)
Table 1. Notation used in Section 3. 4.
Table of notation
t ¯f Average runtime of code fragment f
d Length of delay inserted for virtual speedup
nf Total number of executions of code fragment f
ēf Effective average runtime of f with virtual speedup
P Sampling period
sf Number of samples in code fragment f