one depends on the microarchitectural
properties of the program.
2, 5, 6, 11 Some
programs are very efficient at using the
CPU pipeline: they have a high amount
of instruction-level parallelism, meaning that a processor can issue many instructions in parallel without running
out of work. These programs show good
locality of memory accesses. As a result
they rarely access the main memory and
thus rarely stall the processor. We refer
to these programs as CPU-intensive.
At the other extreme are programs
that use the CPU pipeline very inefficiently. They typically have a high processor cache-miss rate and thus stall
the CPU pipeline, because they have to
wait while their data is being fetched
from main memory. We refer to these
programs as memory-intensive . (Note
that this is not the same as an I/O-bound application, which often relin-quishes the CPU when it must perform
device I/O. A memory-intensive application might run on the CPU 100% of its
allotted time, but it would use the CPU
inefficiently.)
CPU-intensive programs use the
hardware of fast cores very efficiently;
thus, they derive relatively large benefits from running on fast cores relative
to slow cores. Memory-intensive applications, on the other hand, derive relatively little benefit from running on fast
cores. Figure 1 shows some example
speedup ratios of applications in the
SPEC CPU2000 suite on an emulated
AMP system. An SMP was used to emulate this AMP system using dynamic
frequency scaling. The fast core was
emulated by running a core at 2.3GHz;
the slow core was emulated by using the
frequency of 1.15GHz. Note that some
applications experience a 2× speedup,
which is proportional to the difference in the CPU frequency between
the two processors. These are the CPU-intensive applications that have a high
utilization of the processor’s pipeline
functional units. Other applications
experience only a fraction of the achievable speedup. These are the memory-intensive applications that often stall the
CPU as they wait for data to arrive from
the main memory, so increasing the
frequency of the CPU does not directly
translate into better performance for
them. For example, a memory-intensive
application equake speeds up by only
25% when running on the fast core.
having cores of
different types in
a single processor
enables optimizing
performance per
watt for a wider
range of workloads.
having cores of
different types
on an amP enables
us to employ
specialization.
For systemwide efficiency, it is
more profitable to run CPU-intensive
programs on fast cores and memory-intensive programs on slow cores. This
is what catering to microarchitectural
diversity of the workload is all about.
Recent work from the University of California, San Diego and HP demonstrated
that AMP systems can offer up to 63%
better performance than can an SMP
that is comparable in area and power,
provided that the operating system employs a scheduling policy that caters to
the microarchitectural diversity of the
workload.
6
asymmetry-aware Scheduling
Employing specialization is the key to
realizing the potential of AMP systems.
Specialization on AMP systems will not
be delivered by the hardware; it is up
to the software to employ asymmetry-aware scheduling policies that tailor
asymmetric cores to the instruction
streams that use them most efficiently.
A thread scheduler must be aware of the
asymmetric properties of the system
and assign threads to cores in consideration of the characteristics of both.
In this section we report on our experience in designing and implementing
such asymmetry-aware scheduling algorithms in a real operating system. We
first describe an algorithm that caters
to diversity in thread-level parallelism
and then an algorithm that caters to diversity in the workload’s microarchitectural properties.
A Scheduler Catering to Diversity in
Thread-Level Parallelism. The idea behind our parallelism-aware (PA) scheduler is simple: it assigns threads running
sequential applications or sequential
phases of parallel applications to run on
fast cores and threads running highly
scalable parallel code to run on slow
cores. The following example demonstrates that the PA scheduling policy can
achieve a much better system efficiency
than an asymmetry-unaware policy. We
emphasize that the goal in using the PA
policy is to maximize systemwide efficiency, not to improve performance of
particular applications. As a result this
policy will be inherently unfair: some
threads will have a higher priority than
others in running on fast cores. Implications of a policy that equally shares
fast cores among all threads are demonstrated in our earlier study.
9