unsigned int blocksize = 128; // or any size up to 512
unsigned int nblocks = (num_rows + blocksize - 1) / blocksize;
csrmul_kernel<<<nblocks,blocksize>>>(Ap, Aj, Av, num_rows, x, y);
FIG 8
elements records the extent
of each row in the previous
arrays; the entries for row
i in Aj and Av extend from
index Ap[i] up to, but not
including, index Ap[i+ 1].
This implies that Ap[0] will
always be 0 and Ap[n] will
always be the number of nonzero elements in the matrix.
Figure 4 shows an example of the CSR representation of a
simple matrix.
Given a matrix A in CSR form, we can compute a
single row of the product y = Ax using the multiply_row()
procedure shown in figure 5.
Computing the full product is then simply a matter of
looping over all rows and computing the result for that
row using multiply_row(), as shown in figure 6.
This algorithm can be translated into a parallel
CUDA kernel quite easily. We simply spread the loop in
csrmul_serial() over many parallel threads. Each thread
will compute exactly one row of the output vector y.
Figure 7 shows the code for this kernel. Note that it
looks extremely similar to the serial loop used in the
csrmul_serial() procedure. There are really only two points
of difference. First, the row index is computed from the
block and thread indices assigned to each thread. Second,
we have a conditional that evaluates a row product only if
the row index is within the bounds of the matrix (this is
necessary since the number of rows n need not be a multiple of the block size used in launching the kernel).
Assuming that the matrix data structures have already
been copied to the GPU device memory, launching this
kernel will look like the code in figure 8.
The pattern that we see here is a common one. The
original serial algorithm is a loop whose iterations are
independent of each other. Such loops can be parallelized
quite easily by simply assigning one or more iterations of
the loop to each parallel thread. The programming model
provided by CUDA makes expressing this type of parallelism particularly straightforward.
This general strategy of decomposing computations
into blocks of independent work, and more specifically
breaking up independent loop iterations, is not unique to
CUDA. This is a common approach used in one form or
another by various parallel programming systems, including OpenMP and Intel’s Threading Building Blocks.
__global__
void csrmul_cached(unsigned int *Ap, unsigned int *Aj,
float *Av, unsigned int num_rows,
const float *x, float *y)
{
// Cache the rows of x[] corresponding to this block.
__shared__ float cache[blocksize];
unsigned int block_begin = blockIdx.x * blockDim.x;
unsigned int block_end = block_begin + blockDim.x;
unsigned int row = block_begin + threadIdx.x;
// Fetch and cache our window of x[].
if( row<num_rows) cache[threadIdx.x] = x[row];
__syncthreads();
if( row<num_rows )
{
unsigned int row_begin = Ap[row];
unsigned int row_end = Ap[row+ 1];
float sum = 0, x_j;
for(unsigned int col=row_begin; col<row_end; ++col)
{
unsigned int j = Aj[col];
// Fetch x_j from our cache when possible
if( j>=block_begin && j<block_end )
x_j = cache[j-block_begin];
else
x_j = x[j];
sum += Av[col] * x_j;
}