cant performance improvement relative
to an asymmetry-agnostic scheduler.
9, 11
The modeling approach involved
predicting relative speedup on different core types using certain properties
of the running programs. Since we were
keen on evaluating this approach on
real hardware (simulators put limitations on the length and number of experiments that can be performed), we
could experiment only with the asymmetry that was caused by the differences in the clock frequency of different
cores. As a result, our relative speedup
model was tuned to work for this specific type of asymmetric hardware.
(This is the only type of AMP configuration available on today’s commercial
hardware.) At the same time, we do not
see any fundamental reasons why our
model could not be adapted to work on
other single-ISA asymmetric systems.
Recall that the main factor determining how much speedup a program
would obtain from running on a fast
core is how memory-intensive the program is. A good way to capture memory-intensity is via a memory reuse profile,
a compact histogram showing how well
a program reuses its data.
3 If a program
frequently reuses the memory locations it has touched in the past, then
the memory reuse profile will capture
the high locality of reference. If a program hardly ever touches the memory
values used in the past (as would a vid-eo-streaming application, for example),
the memory reuse profile will capture
that as well. Memory reuse profiles are
so powerful that they can be used to
predict with high accuracy the cache-miss rate of a program in a cache of any
size and associativity. This is precisely
the feature that we relied on in evaluating memory-intensity of programs and
building our estimation model.
Without going into much detail, in
our scheduling system we associate a
memory reuse profile with each thread.
We refer to this profile as the architectural signature, since it captures how
the program uses the architectural features of the hardware. The idea is that
an architectural signature may contain
a broad range of properties needed to
model performance on asymmetric
hardware, but for our target AMP system, using just a memory reuse profile was sufficient. Using that profile,
the scheduler predicts each program’s
the best static
assignment
always results
in running the
CPu-bound
applications on the
fast cores and the
memory-intensive
applications
on the slow cores.
miss rate in the LLC, and using that
miss rate, it estimates the approximate fraction of CPU cycles that this
program will spend waiting on main
memory. The scheduler can then trivially estimate the speedup that each program will experience running on a fast
core relative to a slow core (see Figure
7). Then the scheduler simply assigns
threads with higher estimated speedups to run on fast cores and threads
with lower estimated speedups to run
on slow cores, making sure to preserve
the load balance and fairly distribute
CPU cycles. The resulting scheduler
is called HASS (heterogeneity-aware
signature-supported), and more details
about its implementation are available
in our earlier publication.
11
To evaluate how well the approach
based on architectural signatures
helps the scheduler determine the optimal assignment of threads to cores,
we compare the resulting performance
with that under the best static assignment. A static assignment is one where
the mapping of threads to cores is determined at the beginning of execution of a particular workload and never
changed thereafter. The best static
assignment is not known in advance,
but can be obtained experimentally by
trying all static assignments and picking the one with the best performance.
The best static assignment is the theoretical optimum for our signature-supported algorithm, since it relies on
static information to perform the assignment (the architectural signature)
and does not change an assignment
once it is determined.
Figure 8 shows the performance obtained using our signature-supported
algorithm relative to the best static
assignment. We show the overall performance for seven workloads. Each
workload is constructed of four SPEC
CPU2000 applications, two of which
are memory-intensive and two of which
are CPU-intensive. Each workload is
executed on an emulated AMP system
with two fast cores and two slow cores,
so one single-threaded application is
running on each core. The fast cores
run at 2.3GHz, and the slow cores run
at 1.15GHz. We used the AMD Opteron
(Barcelona) system for this experiment.
The best static assignment always
results in running the CPU-intensive
applications on the fast cores and