Consider an asymmetric processor
with one fast core and nine slow cores,
where the fast core delivers approximately twice as much single-threaded
performance as the slow core. Let’s say
we run a workload of one sequential
application and one parallel application, and the parallel application has
nine threads. Under a naïve scheduling
policy that equally shares fast and slow
Figure 2. an illustration of how a Pa scheduler would accelerate a parallel application
limited by a sequential bottleneck on an amP processor.
Sequential phase runs
on a SLOW core
Sequential phase runs
on a FAST core
Parallel phase
Sequential phase
Completion time
Parallel phase
Sequential
phase
Completion time
Completion time
reduced by 25%
Figure 3. Speedup achieved with a Pa algorithm over the asymmetry-agnostic default
scheduler on an emulated amP system.
30%
Speedup over Default Scheduler
25%
20%
15%
10%
5%
0%
vips
Rna
wupwise
applu
swim
swaptions
tPC-C
FFtW
bodytrack
BLaSt
Figure 4. Speedup achieved with a Pa algorithm over the asymmetry-agnostic default
scheduler using the busy-wait and adaptive synchronization modes.
PA (busy-wait mode) PA (adaptive mode)
Speedup over Default Scheduler
40%
30%
20%
10%
0%
art
ammp
mG
Ft
semphy
cores among threads, each thread will
run on the fast core 10% of the time and
on the slow core 90% of the time. (Note
that existing operating-system schedulers would not share complex and simple
cores equally. Distribution of time on
different core types would be arbitrary.
We assume a policy that shares cores
equally to simplify the example.) To simplify comparison of different scheduling policies, we use as our performance
measure the overall workload speedup
relative to running all threads on slow
cores the entire time. Under this naïve
policy, each thread will speed up by 1. 1×
relative to running on a slow core (to
work this out, consider that each thread
runs at a speed of 2× for 10% of the time
and at a speed of 1× for 90% of the time),
and the workload-wide speedup will also
be 1. 1×. Note that when computing the
speedup for a parallel application we assume that the speedup for the entire application is close to the average speedup
of its threads rather than the aggregate
speedup—this assumption is reasonable
if threads of a parallel computation are
working on a common task as opposed
to performing unrelated tasks requiring
no inter-thread communication.
Under a PA policy, the single-threaded application will run on the fast core
for the entire time, and the threads of
the parallel application will run on slow
cores. As a result, the single-threaded
application will speed up by a factor of
2×, and the parallel application will have
no speedup ( 1×). The average speedup
for the two applications will be 1. 5×, or
40% better than under the naïve policy.
As another example, consider a parallel application with 50% of its code
executing sequentially. An asymmetry-unaware scheduler may fail to assign
the bottleneck sequential phase to run
on a fast core, but a PA scheduler would
make sure to accelerate it. Suppose the
fast core runs twice as fast as the slow
core: the PA scheduler would deliver up
to 25% performance improvement to
that application (see Figure 2).
Figure 3 shows the performance of a
number of parallel applications on an
emulated AMP system with our implementation of a PA scheduler in OpenSolaris relative to the default asymmetry-agnostic scheduler in that operating
system. To emulate AMP we use a real
multicore system (AMD Opteron with
16 cores), and we emulate fast cores by