unsigned int blocksize = 128; // or any size up to 512 unsigned int nblocks = (num_rows + blocksize - 1) / blocksize; csrmul_kernel<<<nblocks,blocksize>>>(Ap, Aj, Av, num_rows, x, y);
elements records the extent of each row in the previous arrays; the entries for row i in Aj and Av extend from index Ap[i] up to, but not including, index Ap[i+ 1].
This implies that Ap[0] will always be 0 and Ap[n] will always be the number of nonzero elements in the matrix. Figure 4 shows an example of the CSR representation of a simple matrix.
Given a matrix A in CSR form, we can compute a single row of the product y = Ax using the multiply_row() procedure shown in figure 5.
Computing the full product is then simply a matter of looping over all rows and computing the result for that row using multiply_row(), as shown in figure 6.
This algorithm can be translated into a parallel CUDA kernel quite easily. We simply spread the loop in csrmul_serial() over many parallel threads. Each thread will compute exactly one row of the output vector y. Figure 7 shows the code for this kernel. Note that it looks extremely similar to the serial loop used in the csrmul_serial() procedure. There are really only two points of difference. First, the row index is computed from the block and thread indices assigned to each thread. Second, we have a conditional that evaluates a row product only if the row index is within the bounds of the matrix (this is necessary since the number of rows n need not be a multiple of the block size used in launching the kernel).
Assuming that the matrix data structures have already been copied to the GPU device memory, launching this kernel will look like the code in figure 8.
The pattern that we see here is a common one. The original serial algorithm is a loop whose iterations are independent of each other. Such loops can be parallelized quite easily by simply assigning one or more iterations of the loop to each parallel thread. The programming model provided by CUDA makes expressing this type of parallelism particularly straightforward.
This general strategy of decomposing computations into blocks of independent work, and more specifically breaking up independent loop iterations, is not unique to CUDA. This is a common approach used in one form or another by various parallel programming systems, including OpenMP and Intel’s Threading Building Blocks.
__global__ void csrmul_cached(unsigned int *Ap, unsigned int *Aj,
float *Av, unsigned int num_rows, const float *x, float *y)
{ // Cache the rows of x[] corresponding to this block. __shared__ float cache[blocksize];
unsigned int block_begin = blockIdx.x * blockDim.x; unsigned int block_end = block_begin + blockDim.x; unsigned int row = block_begin + threadIdx.x;
// Fetch and cache our window of x[]. if( row<num_rows) cache[threadIdx.x] = x[row]; __syncthreads();
if( row<num_rows )
{ unsigned int row_begin = Ap[row]; unsigned int row_end = Ap[row+ 1]; float sum = 0, x_j;
for(unsigned int col=row_begin; col<row_end; ++col)
{
unsigned int j = Aj[col];
// Fetch x_j from our cache when possible
if( j>=block_begin && j<block_end )
x_j = cache[j-block_begin];
else x_j = x[j];
sum += Av[col] * x_j;
}
References:
Archives