using a high clock frequency ( 2.3GHz)
and slow cores by using a low frequency
( 1.15GHz). The implementation of the
algorithm and the experimental platform are described in more detail in
a previous work.
10 In this experiment
we use four fast and 12 slow cores.
The applications used in Figure 3 are
drawn from several benchmark suites,
such as SPEC OpenMP 2001, PARSEC,
MineBench, and NAS. RNA is a bioinformatics application performing RNA
The figure shows a variety of speedup
values. Applications with significant sequential phases (40%–60%) experience
a performance improvement of up to
26% relative to an asymmetry-unaware
scheduler. Applications with small sequential phases (wupwise, for example)
experience no speedup from the PA policy, because they do not stand to benefit
from acceleration of sequential phases
on fast cores.
Here, we discuss two main challenges involved in delivering the benefits
of the PA scheduling policy to applications: effectively detecting sequential
and parallel phases and avoiding performance overhead that may result
from cross-core migrations.
Detecting sequential and parallel
phases. The simplest way to detect parallel and sequential phases in an application is to use the runnable thread
count as a heuristic. If an application
uses a large number of threads, then its
runnable thread count will be high, and
this application would be in a parallel
phase. Conversely, an application with
a single runnable thread would be in a
sequential phase. The great property
of the runnable thread count is that in
modern multithreading environments
it is visible to the operating system, because these systems map application-level threads to kernel-level threads.
Therefore, by monitoring the runnable
thread count an operating system can
distinguish between parallel and sequential phases in applications.
Unfortunately, in some situations
using the runnable thread count for
detection of sequential phases might
not work. In particular, an application
could be running nonscalable code
while still using a large number of runnable threads. We describe two scenarios where this might happen and
discuss potential remedies.
In one scenario, an application might
be susceptible to an external scalability
bottleneck—for example, as a result of
memory bandwidth contention. In this
case the system memory bus is saturated, and using additional threads does
not speed up the application because
those threads do not contribute to useful computation. A sensible way to solve
this problem is to reduce the number of
threads used in an application to the
point where the application operates
at its peak efficiency. Essentially, this
boils down to configuring the number
of threads properly in a parallel application. Suleman et al. describe a technique called feedback-driven threading, which allows you to dynamically
determine the optimal thread count for
In another scenario, an application might be limited by internal scalability bottlenecks: for example, there
might be a load imbalance where some
threads do more work than others or
necks where one thread executes the
code in a critical section while other
threads wait. When threads wait they
may either block, relinquishing the
CPU, or busy-wait, spinning on the CPU
in a tight loop. If threads block, then
the runnable thread count is reduced
and any such reduction is visible to the
operating system; but if threads busy-wait, sequential phases might be hidden from the operating system.
Whether the application uses blocking or busy-waiting depends on the
implementation of the synchronization
primitives, which are used to construct
critical sections or barriers. Busy-waiting
makes sense on a multiprocessor when
the wait times are expected to be short.
Blocking a thread is an expensive operation that should be avoided during short
waiting periods. If the wait time is long,
however, blocking is usually preferred
so as to avoid wasting CPU resources.
One of the most popular strategies used
in synchronization libraries is the adap-
Figure 5. two configurations of an amP system. Large squares represent memory domains
with one or more cores and a last-level cache (LLC) inside the domain. “Fast” cores are
denoted by large red boxes, “slow” cores are denoted by small blue boxes.
there might be synchronization bottle- tive mode, where a thread busy-waits
Figure 6. Performance overhead relative to the default scheduler on a migration-unfriendly
topology (like in Figure 5a) and on a migration-friendly topology (like in Figure 5b, but with
three slow cores in each memory domain). Lower numbers denote higher overhead.
Migration-unfriendly topology Migration-friendly topology
Speedup over Default Scheduler