rithms for operating systems that wish
to take full advantage of AMPs. We also
discuss our experience in investigating
the design of such algorithms. We discuss a few surprising findings that we
have discovered about the implementation of these scheduling strategies,
which we think will be important to
those who are developing or adapting
an operating system for this upcoming
class of processor and platform.
Specialization on amPs
Efficiency of AMP systems could be
improved using two kinds of specialization: the first caters to diversity in
thread-level parallelism; the second
caters to microarchitectural diversity of
the workload.
Catering to diversity in thread-level
parallelism. Diversity in thread-level
parallelism refers to the two broad categories into which applications can be
classified: scalable parallel applications and sequential applications. Scalable parallel applications use multiple
threads of execution, and increasing
the number of threads typically leads
to reduced execution time or increased
amount of work performed in a unit
of time. Sequential applications, on
the other hand, typically use only one
or a small number of threads and it is
difficult to structure the computation
such that it runs efficiently in a multithreaded environment. In addition to
purely parallel or purely sequential ap-
plications, there is a hybrid type, where
an application might have phases of
highly parallel execution intermixed
with sequential phases.
These two types of applications
require different types of processing
cores to achieve the best trade-off in
performance and energy consumption.
Suppose we have a scalable parallel application with a choice of running it on
a processor either with a few complex
and powerful cores or with many simple
low-power cores. For example, suppose
we have a processor with four complex
and powerful cores and another area-equivalent and power-budget-equiva-lent processor consisting of 16 simple/
low-power cores. Suppose further that
each simple core delivers roughly half
the performance of one complex core.
(The numbers to estimate the conversion ratios of performance and power in
complex vs. simple cores were obtained
from Hill and Marty.
4) We configure the
number of threads in the application
to equal the number of cores, which is
a standard practice for compute-inten-sive applications. If we run this parallel application on the processor with
complex cores, then each thread will
run roughly twice as fast as the thread
running on the processor with simple
cores (assuming that threads are CPU-intensive and that synchronization and
other overhead is negligible), but we
can use only four threads on the complex-core processor vs. 16 threads on
Figure 1. Relative speedup experienced by applications from the SPEC CPu2000
benchmark suite from running on a fast core ( 2.3Ghz) vs. a slow core ( 1.15Ghz) of an
emulated amP system. the maximum achievable speedup is a factor of 2. the more
memory-intensive the application is the less speedup it experiences. more details on the
experimental setup can be found in Shelepov.
11
2.00
Speedup factor on
a 2.3Ghz core vs. 1.15Ghz core
1. 75
1.50
1. 25
1.00
ammp
applu
apsi
art
bzip
crafty
eon
equake
facerec
fma3d
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlb…
sixtrack
swim
twolf
vortex
vpr
vupw…
the simple-core processor. Since using
additional threads results in a proportional performance improvement in
this application, we get twice as much
performance running on a simple-core
processor as on a complex-core processor. Recalling that these two processors
use the same power budget, we achieve
twice as much performance per watt.
Contrast this to running a sequential application, which cannot increase
its performance by using additional
threads. Therefore, using a single
thread, it will run twice as slow on a
simple-core processor than on a complex-core processor, meaning we get
twice as much performance per watt
running on the complex-core system.
An experienced reader will observe
that power consumption on a simple-core system for this single-application
workload could be reduced by turning
off unused cores. Unfortunately, it is
not always possible to turn off unused
cores completely, especially if they are
located in the same power domain as
the active cores. Furthermore, an operating-system power manager may be
configured to avoid putting the unused
cores in a deep sleep state, because
bringing the cores up from this state
takes time. Thus, if a new application
begins running or if the operating system needs a core for execution, then additional latency will be incurred while
the dormant core is being brought up
in the active power state.
This example demonstrates that applications with different levels of parallelism require different types of cores to
achieve the optimal performance-per-watt ratio. AMP systems offer the potential to resolve this dilemma by providing
the cores of both types. Another advantage of having both “fast” and “slow”
cores in the system is that the fast ones
can be used to accelerate sequential
phases of parallel applications, mitigating the effect of sequential bottlenecks
and reducing effective serialization. Hill
and Marty demonstrated that for parallel applications with sequential phases
AMPs can potentially offer performance
significantly better than SMPs, as long
as sequential phases constitute as little
as 5% of the code.
4
Catering to microarchitectural diversity of the workload. The relative benefit
that an application derives from running on a fast core rather than a slow