hit rate by improving the replacement policy (e.g., Jaleel et
14 among many others). All these attempt to exploit different heuristics of program behavior to predict a block’s
re-reference interval and mirror the Belady-optimal4 policy as closely as possible. While CCWS also attempts to
maximize cache efficiency, it does so by shortening the
re-reference interval rather than by predicting it. CCWS
has to balance the shortening of the re-reference interval by limiting the number of eligible wavefronts while
still maintaining sufficient multithreading to cover
most of the memory and operation latencies. Other
schemes attempt to manage interference among heterogeneous workloads,
12, 21 but each thread in our workload has roughly similar characteristics. Recent work
has explored the use of prefetching on GPUs.
prefetching cannot improve performance when an application is bandwidth limited whereas CCWS can help in
such cases by reducing off-chip traffic. Concurrent to our
work, Jaleel et al.
13 propose the CRUISE scheme which
uses LLC utility information to make high-level scheduling decisions in multiprogrammed chip multiprocessors
(CMPs). Our work focuses on the first level cache in a
massively multithreaded environment and is applied at a
much finer grain. Scheduling decisions made by CRUISE
tie programs to cores, where CCWS makes issue-level
decisions on which bundle of threads should enter the
execution pipeline next.
Others have also studied issue-level scheduling algorithms on GPUs. Lakshminarayana and Kim17 explore
numerous warp scheduling policies in the context of a GPU
without hardware-managed caches and show that, for applications that execute symmetric (balanced) dynamic instruction counts per warp, a fairness-based warp and DRAM
access scheduling policy improves performance. Several works
have explored the effect of two-level scheduling on GPUs.
A two-level scheduler exploits inter-wavefront locality while
ensuring wavefronts reach long latency operations at different times by scheduling groups of wavefronts together.
However, Figure 2 demonstrates that the highly cache-sensitive benchmarks we studied will benefit more from
exploiting intra-wavefront locality than inter-wavefront
locality. Previous schedulers do not take into account the
effect issuing more wavefronts has on the intra-wavefront
locality of those wavefronts that were previously scheduled.
In the face of L1D thrashing, the round-robin nature of their
techniques will cause the destruction of older wavefront’s
intra-wavefront locality. Our follow-up work on Divergence‑
Aware Warp Scheduling (DAWS)
23 builds upon the insights in
CCWS. DAWS attempts to proactively predict the cache footprint of each warp. It does this by considering the impact of
memory and control divergence while taking into account
the loop structure of computation kernels. In that work, we
demonstrate through an example that DAWS enables non-optimized GPU code, where locality is not managed by the
programmer, to perform within 4% of optimized code.
Current GPU architectures are excellent at accelerating
applications with copious parallelism, whose memory
access is regular and statically predicable. Modern CPUs
feature deep cache hierarchies and a relatively large
amount of cache available per-thread, making them bet-
ter suited for workloads with irregular locality. However,
CPU’s limited thread count and available memory band-
width prohibit their ability to exploit pervasive parallelism.
Each design has problems running the important class of
highly parallel irregular applications where threads access
data from disparate regions of memory. The question of
how future architectures can accelerate these applications
Many highly parallel, irregular applications are commer-
cially important and found in modern data centers. The
highly cache-sensitive benchmarks in our paper include
several such workloads, including Memcached,
10 a parallel
2 and a breadth-first search graph tra-
versal5 program. We demonstrate that these applications
are highly sensitive to L1 cache capacity when naive thread
schedulers are used. Furthermore, we show that a relatively
small L1 cache can capture their locality and improve per-
formance, provided an intelligent issue-level thread sched-
uling scheme is used.
Although our work focuses primarily on performance,
the impact of CCWS and consequently fine-grained thread
scheduling on power consumption is important. As mod-
ern chips become progressively more power limited, pre-
serving locality in the cache can be an effective way to
reduce power consumption. CCWS can be tuned to reduce
the number of data cache misses even further, at the
expense of some performance.
Intellectually, we feel this work offers a new perspective on fine-grained memory system management. We
believe that integrating the cache system with the issue-level thread scheduler to change the access stream
seen by the memory system opens up a new direction of
research in cache management. CCWS demonstrates that
a relatively simple, dynamically adaptive, feedback-driven
scheduling system can vastly improve the performance of
an important class of applications.
We thank Norm Jouppi, Yale N. Patt, Aamer Jaleel, Wilson
Fung, Hadi Jooybar, Inderpreet Singh, Tayler Hetherington,
Ali Bakhoda, and our anonymous reviewers for their insightful feedback. We also thank Rimon Tadros for his work on
the garbage collector benchmark. This research was funded
in part by a grant from Advanced Micro Devices Inc.
1. Bakhoda, A., Yuan, G., Fung, W.,
Wong, H., Aamodt, T. Analyzing
CUDA workloads using a detailed
GPU simulator. In Proceedings
of International Symposium on
Performance Analysis of Systems
and Software (ISPASS 2009),
2. Barabash, K., Petrank, E. Tracing
garbage collection on highly
parallel platforms. In Proceedings
of International Symposium on
Memory Management (ISMM
3. Beckmann, B.M., Marty, M.R.,
Wood, D. A. ASR: Adaptive selective
replication for CMP caches. In
Proceedings of International
Symposium on Microarchitecture
(MICRO 39) (2006), 443–454.
4. Belady, L.A. A study of replacement
algorithms for a virtual-storage
computer. IBM Syst. J. 5, 2 (1966),
5. Che, S., Boyer, M., Meng, J., Tarjan, D.,
Sheaffer, J., Lee, S. H., Skadron, K.
Rodinia: A benchmark suite