Power DI uses an experimentally derived threshold of 1,000 misses per million instructions; an application whose
LLC miss rate exceeds that amount is
considered memory intensive.
Although we did not have a data-center setup available to us to evaluate
this algorithm, we simulated a multi-server environment in the following
way. The in-house AKULA scheduling
simulator created a schedule for a given workload on a specified data-center
setup, which in this case consisted of
16 eight-core systems, assumed by
the simulator to be Intel Xeon dual
quad-core servers. Once the simulated
scheduler decided how to assign applications across machines and memory domains within each machine, we
computed the performance of the entire workload from the performance
of the applications assigned by the
scheduler to each eight-core machine.
The performance on a single system
could be easily measured via experimentation. This simulation method
was appropriate for our environment,
since there was no network communication among the running applications, meaning that inferring the overall performance from the performance
of individual system was reasonable.
To estimate the power consumption, we used a rather simplistic model
(measurements with the actual power
meter are still under way) but captured
the right relationships between power
consumed in various load conditions.
We assumed that a memory domain
where all the cores are running applications consumes one unit of power. A
memory domain where one out of two
cores are busy consumes 0.75 units of
power. A memory domain where all
cores are idle is assumed to be in a very
low power state and thus consumes 0
units of power. We did not model the
latency of power-state transitions.
We constructed a workload of 64
SPEC CPU2006 applications randomly
drawn from the benchmark suite. We
varied the fraction of memory-inten-
sive applications in the workload from
zero to 100%. The effectiveness of
scheduling strategies differed accord-
ing to the number of memory-inten-
sive applications. For example, if there
were no memory-intensive applica-
tions, it was perfectly fine to cluster all
the applications to the greatest extent
possible. Conversely, if all the applica-
tions were memory intensive, then the
best policy was to spread them across
memory domains so that no two ap-
plications would end up running on
the same memory domain. An intelli-
gent scheduling policy must be able to
decide to what extent clustering must
be performed given the workload at
hand.
conclusion
Contention for shared resources significantly impedes the efficient operation of multicore systems. Our research has provided new methods for
mitigating contention via scheduling
algorithms. Although it was previously thought that the most significant
reason for contention-induced performance degradation had to do with
shared cache contention, we found
that other sources of contention—
such as shared prefetching hardware
and memory interconnects—are just
as important. Our heuristic—the LLC
miss rate—proves to be an excellent
predictor for all types of contention.
Scheduling algorithms that use this
heuristic to avoid contention have the
potential to reduce the overall completion time for workloads, avoid poor
performance for high-priority applications, and save power without sacrificing performance.
Related articles
on queue.acm.org
Maximizing Power Efficiency with
Asymmetric Multicore Systems
Alexandra Fedorova, Juan Carlos Saez,
Daniel Shelepov, and Manuel Prieto
http://queue.acm.org/detail.cfm?id=1658422
The Future of Microprocessors
Kunle Olukotun
http://queue.acm.org/detail.cfm?id=1095418
References
1. berg, E. and hagersten, E. Statcache: a probabilistic
approach to efficient and accurate data locality
analysis. in Proceedings of the IEEE International
Symposium on Performance Analysis of Systems and
Software (2004), 20–27.
2. cascaval, c., Derose, L., Padua, D.a. and reed, D.
1999. compile-time based performance prediction.
in Proceedings of the 12th International Workshop
on Languages and Compilers for Parallel Computing
(1999), 365–379.
3. chandra, D., guo, F., kim, S. and Solihin, y. Predicting
inter-thread cache contention on a multiprocessor
architecture. in Proceedings of the 11th International
Symposium on High-performance Computer
Architecture (2005), 340–351.
4. gonzalez, r. and horowitz, M. Energy dissipation in
general-purpose microprocessors. IEEE Journal of
Solid State Circuits 31, 9 (1999), 1277–1284.
5. knauerhase, r., brett, P., hohlt, b., Li, t. and hahn,
S. Using oS observations to improve performance in
multicore systems. IEEE Micro (2008), 54–66.
6. SPEc: Standard Performance Evaluation corporation;
http://www.spec.org.
7. Suh, g., Devadas, S. and rudolph, L. a new memory
monitoring scheme for memory-aware scheduling and
partitioning. in Proceedings of the 8th International
Symposium on High-performance Computer
Architecture (2002), 117.
8. tam, D., azimi, r., Soares, L. and Stumm, M.
rapidMrc: approximating L2 miss rate curves on
commodity systems for online optimizations. in
Proceedings of the 14th International Conference on
Architectural Support for Programming Languages
and Operating Systems (2009), 121–132.
9. tam, D., azimi, r. and Stumm, M. thread clustering:
sharing-aware scheduling on SMP-cMP-SMt
multiprocessors. in Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Computer
Systems (2007), 47–58.
10. zhuravlev, S., blagodurov, S. and Fedorova, a.
addressing shared resource contention in multicore
processors via scheduling. in Proceedings of the 15th
International Conference on Architectural Support
for Programming Languages and Operating Systems
(2010).
Alexandra Fedorova is an assistant professor of
computer science at Simon Fraser University in Vancouver,
canada, where she co-founded the SyNar (Systems,
Networking and architecture) research lab. her research
interests span operating systems and virtualization
platforms for multicore processors, with a specific focus
on scheduling. recently she started a project on tools and
techniques for parallelization of video games, which has
led to the design of a new language for this domain.
Sergey Blagodurov is a Ph.D. student in computer
science at Simon Fraser University, Vancouver, canada.
his research focuses on operating-system scheduling
on multicore processors and exploring new techniques
to deliver better performance on non-uniform memory
access (NUMa) multicore systems.
Sergey Zhuravlev is a Ph.D. student in computer science
at Simon Fraser University, Vancouver, canada. his
recent research focuses on scheduling on multiprocessor
systems to avoid shared resource contention as well as
simulating computing systems.