best schedule. Figure 4 is a summary of
this procedure.
The estimated best schedule can
be obtained either by using the Pain
metric constructed via actual memory-reuse profiles or by approximating the
Pain metric using online data.
Once the best schedule has been estimated, we must compare the performance of the workload in the estimated best schedule with the performance
achieved in the actual best schedule.
The most direct way of doing this is to
run the estimated best schedule on real
hardware and compare its performance
with that of the actual best schedule,
which can be obtained by running all
schedules on real hardware and then
choosing the best one. Although this
is the most direct approach (which
we used for some experiments in the
study), it limited the number of workloads we could test because running all
possible schedules for a large number
of workloads is time consuming.
To evaluate a large number of workloads in a short amount of time, we
invented a semi-analytical evaluation
methodology that relies partially on
data obtained from tests on a real system and otherwise applies analytical
techniques. Using this approach, we
selected 10 benchmark applications
from the SPEC CPU2006 suite to use in
the evaluation. They were chosen using
the minimum spanning-tree-cluster-ing method to ensure the applications
represented a variety of memory-access
patterns.
We then ran all possible pairings of
these applications on the experimental
platform, a Quad-Core Intel Xeon system, where each Quad-Core processor
looked like the system depicted in Figure 1. In addition to running each possible pair of benchmark applications
on the same memory domain, we ran
each benchmark alone on the system.
This gave us the measure for the actual
degradation in performance for each
benchmark as a consequence of sharing a cache with another benchmark,
as opposed to running solo. Recall that
degradation relative to performance in
the solo mode is precisely the quantity
approximated by the Pain metric. So by
comparing the scheduling assignment
constructed based on the actual degradation to that constructed based on
the Pain metric, we can evaluate how
in evaluating the
new models for
cache contention,
our goal was to
determine how
effective the
models would be
for constructing
contention-free
thread schedules.
good the Pain metric is in finding good
scheduling assignments.
Having the actual degradations enabled us to construct the actual best
schedule using the method shown in
Figure 5. The only difference was that,
instead of using the model to compute
Pain(A,B), we used the actual performance degradation that we had measured on a real system (with Pain being
equal to the sum of degradation of A
running with B relative to running solo
and the degradation of B running with
A relative to running solo).
Once we knew the actual best schedule, we needed a way to compare it with
the estimated best schedule. The performance metric was the average degradation relative to solo execution for
all benchmarks. For example, suppose
the estimated best schedule was {(A,B),
(C,D)}, while the actual best schedule
was {(A,C), (B,D)}. We computed the
average degradation for each schedule to find the difference between the
degradation in the estimated best versus the degradation in the actual one,
as indicated in Figure 5. The notation
Degrad(A|B) refers to the measured
performance degradation of A when
running alongside B, relative to A running solo.
This illustrates how to construct estimated best and actual best schedules
for any four-application workload on a
system with two memory domains so
long as actual pair-wise degradations
for any pair of applications have been
obtained on the experimental system.
Using the same methodology, we can
evaluate this same model on systems
with a larger number of memory domains. In that case, the number of possible schedules grows, but everything
else in the methodology remains the
same. In using this methodology, we
assumed that the degradation for an
application pair (A,B) would be the
same whether it were obtained on a
system where only A and B were running or on a system with other threads,
such as (C,D) running alongside (A,B)
on another domain. This is not an entirely accurate assumption since, with
additional applications running, there
will be a higher contention for the
front-side bus. Although there will be
some error in estimating schedule-av-erage degradations under this method,
the error is not great enough to affect