tempts to avoid cross-memory-domain
migrations whenever possible. We measured the overhead by making the designated fast cores on these two systems
run at the same frequency as the slow
cores—so no performance gains were
to be expected from asymmetry-aware
scheduling, but the overhead was still
present, since our scheduler still migrated threads across cores “thinking”
that the system is asymmetric.
Comparing the performance of applications under the PA scheduler and the
default scheduler, we can find out the mi-gration-related performance overhead.
In this case, performance degradation
under the PA scheduler is equivalent to
migration overhead. As Figure 6 shows,
performance overhead can be quite
significant on a migration-unfriendly
system, but it becomes negligible on a
migration-friendly system coupled with
a topology-aware scheduler.
In summary, a parallelism-aware
scheduling policy can deliver real performance improvements on asymmetric hardware for parallel applications
limited by sequential phases. The key
is to configure the synchronization library to “reveal” the sequential phases
to the scheduler. To avoid cross-memory-domain migration overhead, AMP
systems should be designed such that
fast cores share a memory domain with
some of the slow cores and combined
with a topology-aware scheduler that
minimizes cross-domain migrations.
A Scheduler Catering to Microarchitectural Diversity. Remember that the
idea of catering to microarchitectural
diversity of the workload is to assign
CPU-intensive threads (or phases of execution) to fast cores and memory-intensive threads (or phases) to slow cores.
Recall from Figure 1 that CPU-intensive
code will experience a higher relative
speedup running on fast vs. slow cores
than memory-intensive code, so scheduling it on fast cores is more efficient
in a cost-benefit analysis. Just like the
PA policy, this policy will be inherently
unfair: it may improve performance
of some applications at the expense of
others, but it will improve the efficiency
of the system as a whole.
The biggest challenge in implementing such an algorithm is to classify threads or phases of execution as
CPU-intensive or memory-intensive at
scheduling time. Two approaches were
For systemwide
efficiency, it is
more profitable
to run CPu-bound
programs on
fast cores and
memory-intensive
programs on
slow cores.
this is what
catering to
microarchitectural
diversity of
the workload
is all about.
proposed in the research community
to address this challenge. The first approach entails running each thread on
cores of different types, registering the
speedup obtained on a fast core relative to a slow core and using the resulting relative speedup as the measure for
classifying the applications. In a scheduling algorithm, a thread with a larger
relative speedup would be given preference to run on a fast core, and a thread
with a lower relative speedup would be
more likely to run on a slow core. Since
this approach relies on direct measurement of relative speedup, we refer to it
as the direct measurement approach.
A second approach, referred to as
the modeling approach, is to model
the speedup on a fast vs. slow core using a summary of an application’s runtime properties obtained either offline
or online. Modeling is less accurate
than direct measurement but does not
require running each thread on each
type of core, thus avoiding potential
load imbalance and expensive cross-core migration (we elaborate on these
issues later). In an effort to build an
asymmetry-aware algorithm that caters
to microarchitectural diversity, we have
experimented with both methods.
The direct measurement approach
manifested several performance problems. Consider a scenario where each
thread must be run on each core type
to determine its relative speedup. Given
that a running thread may switch phases of execution (that is, it may be doing
different types of processing at different
points in time), this measurement must
be repeated periodically; otherwise, the
scheduler might be operating on stale
data. Since the number of threads will
typically be larger than the number of
fast cores, there will always be a high
demand for running on fast cores for
the purpose of remeasuring relative
speedup. As a result, threads that are
“legitimately” assigned to run on fast
cores by the scheduling policy will observe undue interference from threads
trying to measure their speedup there.
Furthermore, having too many threads
“wanting” to run on scarce fast cores
may cause load imbalance, with fast
cores being busy and slow cores being
idle. When we used this direct measurement approach in an asymmetry-aware
algorithm, we found that these problems made it difficult to deliver signifi-