tive benchmarks on a system with a
given number of cores, such that there
is exactly one benchmark per core. On
four- and six-core systems, there were
210 such combinations, whereas an
eight-core system had 45 combinations, and a 10-core system had only
one combination. For each combination, we predicted the best schedule.
The average performance degradation
from the actual best for each of these
estimated best schedules is reported in
each bar in the figure.
The first thing we learned from the
metric in Figure 5 was that the Pain
model is effective for helping the scheduler find the best thread assignment. It
produces results that are within 1% of
the actual best schedule. (The effect on
the actual execution time is explored
later). We also found that choosing
a random schedule produces significantly worse performance, especially
as the number of cores grows. This is
significant in that a growing number
of cores is the expected trend for future
Figure 6 also indicates that the Pain
approximated by way of an online
metric works very well, coming within
just 3% of the actual best schedule. At
the same time, the SDC, a well-proven
model from an earlier study, turns out
to be less accurate. These results—
both the effectiveness of the approximated Pain metric and the disappointing performance of the older SDC
model—were quite unexpected. Who
could have imagined that the best way
to approximate the Pain metric would
be to use the LLC miss rate? In other
words, the LLC miss rate of a thread is
the best predictor of both how much
the thread will suffer from contention
(its sensitivity) and how much it will
hurt others (its intensity). As explained
at the beginning of this article, while
there was limited evidence indicating
that the miss rate predicts contention,
it ran counter to the memory-reuse-based approach, which was supported
by a much larger body of evidence.
Our investigation of this paradox
led us to examine the causes of con-
tention on multicore systems. We per-
formed several experiments that aimed
to isolate and quantify the degree of
contention for various types of shared
resources: cache, memory controller,
bus, prefetching hardware. The precise
setup of these experiments is described
in another study. 10 We arrived at the fol-
That is, they should not be co-scheduled in the same memory domain. Although some researchers have already
suggested this approach, it is not well
understood why using the miss rate
as a proxy for contention ought to be
effective, particularly in that it contradicts the theory behind the popular
memory-reuse model. Our findings
should help put an end to this controversy.
Based on this new knowledge, we
have built a prototype of a contention-aware scheduler that measures the
miss rates of online threads and decides how to place threads on cores
based on that information. Here, we
present some experimental data showing the potential impact of this contention-aware scheduler.
Based on our understanding of contention on multicore processors, we have
built a prototype of a contention-aware
scheduler for multicore systems called
Distributed Intensity Online (DIO).
The DIO scheduler distributes intensive applications across memory domains (and by intensive we mean those
with high LLC miss rates) after measuring online the applications’ miss rates.
Another prototype scheduler, called
Power Distributed Intensity (Power
DI), is intended for scheduling applications in the workload across multiple
machines in a data center. One of its
goals is to save power by determining
how to employ as few systems as possible without hurting performance.
The following are performance results
of these two schedulers.
Distributed Intensity Online.
Different workloads offer different opportunities to achieve performance
improvements through the use of a contention-aware scheduling policy. For
example, a workload consisting of non-memory-intensive applications (those
with low cache miss rates) will not experience any performance improvement
since there is no contention to alleviate
in the first place. Therefore, for our experiments we constructed eight-appli-cation workloads containing from two
to six memory-intensive applications.
We picked eight workloads in total, all
consisting of SPEC CPU2006 applications, and then executed them under
the DIO and the default Linux sched-