CACHING IN SHARED MEMORY

The SpMV algorithms outlined here are fairly simplistic. We can make a number of optimizations in both the CPU and GPU codes that can improve performance, including loop unrolling, matrix reordering, and register blocking. 10 The parallel kernels can also be reimplemented in terms of data-parallel scan operations. 11

One of the important architectural features exposed by CUDA is the presence of the per-block shared memory, a small on-chip memory with very low latency. Taking advantage of this memory can deliver substantial performance improvements. One common way of doing this is to use shared memory as a software-managed cache to hold frequently reused data, shown in figure 9.

In the context of sparse matrix multiplication, we observe that several rows of A may use a particular array element x[i]. In many common cases, and particularly when the matrix has been reordered, the rows using x[i] will be rows near row i. We can therefore implement a simple caching scheme and expect to achieve some performance benefit. The block of threads processing rows i through j will load x[i] through x[j] into its shared memory. We will unroll the multiply_row() loop and fetch elements of x from the cache whenever possible. The resulting code is shown in figure 9. Shared memory can also be used to make other optimizations, such as fetching Ap[row+ 1] from an adjacent thread rather than refetching it from memory.

Because the Tesla architecture provides an explicitly managed on-chip shared memory rather than an implicitly active hardware cache, it is fairly common to add this sort of optimization. Although this can impose some additional development burden on the programmer, it is relatively minor, and the potential performance benefits can be substantial. In the

example shown in figure 9, even this fairly simple use of shared memory returns a roughly 20 percent performance improvement on representative matrices derived from 3D surface meshes. The availability of an explicitly managed memory in lieu of an implicit cache also has the advantage that caching and prefetching policies can be specifically tailored to the application needs.

EXAMPLE: PARALLEL REDUCTION

Suppose that we are given a sequence of N integers that must be combined in some fashion (e.g., a sum). This occurs in a variety of algorithms, linear algebra being a common example. On a serial processor, we would write a simple loop with a single accumulator variable to construct the sum of all elements in sequence. On a parallel machine, using a single accumulator variable would create a global serialization point and lead to very poor performance. A well-known solution to this problem is the so-called parallel reduction algorithm. Each parallel thread sums a fixed-length subsequence of the input. We then collect these partial sums together, by summing

 

__global__ void plus_reduce(int *input, unsigned int N, int *total)

{ unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

 

// Each block loads its elements into shared memory, padding // with 0 if N is not a multiple of blocksize __shared__ int x[blocksize]; x[tid] = (i<N) ? input[i] : 0; __syncthreads();

// Every thread now holds 1 input value in x[] // // Build summation tree over elements. See attached figure. for(int s=blockDim.x/2; s>0; s=s/2) {

if(tid < s) x[tid] += x[tid + s];

__syncthreads();

}

// Thread 0 now holds the sum of all input values

// to this block. Have it add that sum to the running total if( tid == 0 ) atomicAdd(total, x[tid]);

}

FIG 10

References:

http://www.acmqueue.com

Archives