that are predicted to cause interference. Changing the thread
scheduler to improve cache effectiveness introduces a new
challenge not encountered by traditional cache management
techniques. The goal of the scoring system is to maximize
throughput, not just cache hit rate. Typically, massively multithreaded architectures hide long latency operations by issuing
instructions from more threads. The CCWS scoring system
balances the trade-off between increasing latency tolerance
and reducing the need for latency hiding. Increasing the number of actively interleaved threads improves latency tolerance,
while interleaving less threads decreases the frequency of
long latency events. This is like trying to find the right balance
between not doing too many things at once while still managing to do more than one thing at a time.
1. 3. The effect of intelligently managing
Intelligently managing how your attention is allocated can
increase your multitasking effectiveness. In the context of
our problem, this is equivalent to improving the cache’s
replacement policy. To contrast the effects of hardware
thread scheduling and replacement policy, consider Figures
3 and 4. They present the memory accesses created by two
different thread schedulers in terms of cache lines touched.
In this example, we assume each instruction generates four
memory requests and we are using a fully associative, four-entry cache with a least recently used (LRU) replacement policy. Figure 3 illustrates what happens when the scheduler
attempts to fill empty cycles by executing another wavefront.
In Figure 3, no wavefront will find its data in cache because
the data will have been evicted by requests from other wavefronts before the wavefront is scheduled again. In fact, even
if a Belady optimal replacement policy4 were used, only
4 hits in the cache are possible. The scheduler in Figure 4
is aware of the data reuse present in each wavefront’s access
stream and has rearranged the wavefront’s issue order to be
more cache-friendly. The scheduler in Figure 4 chooses to
schedule the memory accesses from wavefront 0 together,
even if it means leaving the processor idle while the requests
for wavefront 0 return. This results in 12 cache hits, capturing every redundant access made by the wavefronts.
1. 4. The effect of doing less all the time
Another simple way to decrease interference among tasks is
to place a hard limit on the number of tasks performed at
once. Our paper proposes a simple mechanism called Static
Wavefront Limiting (SWL) that does something similar. SWL is
implemented as a minor extension to the wavefront scheduling logic where a cap is placed on the number of wavefronts
that are actively scheduled when the program is launched.
Figure 5 shows the effect that limiting the number of wavefronts
actively scheduled on a core has on the cache performance and
system throughput of our highly cache-sensitive applications.
Figure 5 shows that peak throughput occurs at a multithreading value less than maximum concurrency, but
greater than the optimum for cache miss rate (which limits
concurrency to a single wavefront). In SWL, the programmer must specify a limit on the number of wavefronts when
launching the kernel. This technique is useful if the user
knows or can easily compute the optimal number of wavefronts prior to launching the kernel.
Our full paper22 demonstrates that the optimal number of
wavefronts is different for different benchmarks. Moreover,
we find this number changes in each benchmark when its
input data is changed. This dependence on benchmark and
input data makes an adaptive CCWS system that adapts to
locality in threads as they are run desirable.
2. GPU ARCHITECTURE
Our work studies modifications to the GPU-like accelerator
architecture illustrated in Figure 6. The workloads we study
are written in OpenCL or CUDA. Initially, an application
begins execution on a host CPU which launches a kernel
containing a large number of threads on the GPU. Our baseline system uses GPGPU-Sim 3.x.
Our work focuses on the decision made by the wavefront
issue arbiter (WIA) ( in Figure 6). An in-order scoreboard
( ) and decode unit ( ) control when each instruction in
an instruction buffer (I-Buffer ) is ready to issue. The WIA
decides which of these ready instructions issues next.
Figure 3. Example access pattern (represented as cache lines
accessed) resulting from a throughput-oriented round-robin
scheduler. The letters (A,B,C, . . .) represent cache lines accessed.
Wi indicates which wavefront generated this set of accesses. For
example, the first four accesses to cache lines A,B,C,D are generated
by one instruction in wavefront 0.
Cacheline 0hits A,B,C,D E,F,G,H I,J,K,L A,B,C,D E,F,G,H I,J,K,L
W0 W1 W2 W0 W1 W2
Figure 4. Example access pattern resulting from a hardware
scheduler aware of its effect on the caching system. The red boxes
highlight where scheduling improves locality by changing memory
A,B,C,D A,B,C,D E,F,G,H E,F,G,H I,J,K,L I,J,K,L
W0 W0 W1 W1 W2 W2
Cache line 12 hits
HHHH HHHH HHHH
Figure 5. Average misses per thousand instructions (MPKI) and
harmonic mean (HMEAN) performance improvement of highly
cache-sensitive benchmarks with different levels of multithreading.
Instructions per cycle (IPC) is normalized to 32 wavefronts.
0 5 10 15 20 25 30 35
Wavefronts actively scheduled
HMEAN normalized IPC MPKI