figure 6. the percentage by which performance of schedules estimated to be
best according to various modeling techniques varies from the actual best schedules.
Low bars are good.
8
7
Worse than Actual Best
6
5
4
3
2
Pain
approx-Pain
SDc
random
1
0
4-core
6-core
8-core
10-core
figure 7. A breakdown of factors causing performance degradation due to contention
for shared hardware on multicore systems based on tests using select applications in
the sPec cPu2006 suite.
Prefetch cache/controller bus
100
90
80
70
60
50
40
30
20
10
0
Gcc
Lbm
mcf
milc
soplex
sphinx
figure 8. A breakdown of factors causing performance degradation due to contention
for shared hardware on multicore systems based on tests using select applications in
the sPec cPu2006 suite. these experiments were performed on an intel xeon (cloverton)
processor. We also obtained data showing that cache contention is not dominant on AmD
opteron systems.
Prefetch L2 FSb Memorycontroller
100%
90%
contribution to total Degradation
80%
70%
60%
50%
40%
30%
20%
10%
0%
soplex
Gcc
Lbm
mcf
sphinx
milc
the evaluation results significantly.
Certainly, the error is not large enough
to lead to a choice of the “wrong” best
schedule.
evaluation of modeling techniques
Here, we present the results obtained
using our semi-analytical methodology, followed by the performance results
obtained via experiments only. Figure
6 compares the degradation over the
actual best schedules (the method for
which was indicated earlier) with estimated best schedules constructed
using various methods. The blue bar
indicating Pain is the model that uses
memory-reuse profiles to estimate Pain
and find the best schedule (the method
for which was set forth in Figure 4). In
the red bar indicating Approx-Pain, the
Pain for a given application running
with another is estimated with the aid
of data obtained online (we explain
which data this is at the end of this section); once Pain has been estimated,
we can once again use the method
shown in Figure 4. In SDC, a previously
proposed model based on memory-reuse profiles3 can be used to estimate
the performance degradation of an application when it shares a cache with a
co-runner. This estimated degradation
can then be used in place of Pain(A|B);
apart from that, the method shown
in Figure 4 applies. Although SDC is
rather complex for use in a scheduler,
we compared it with our new models
to evaluate how much performance
was being sacrificed by using a simpler model. Finally, in Figure 6 the bar
labeled Random shows the results for
selecting a random-thread placement.
Figure 6 shows how much worse the
schedule chosen with each method
ended up performing relative to the
actual best schedule. This value was
computed using the method shown in
Figure 4. Ideally, this difference from
the actual best ought to be small, so
in considering Figure 6, remember
that low bars are good. The results for
four different systems—with four, six,
eight, and 10 cores—are indicated.
In all cases there were two cores per
memory domain. (Actual results from
a system with a larger number of cores
per memory domain are shown later.)
Each bar represents the average for all
the benchmark pairings that could be
constructed out of our 10 representa-