EXAMPLE: SPARSE MATRIX-
VECTOR PRODUCT
A variety of parallel algorithms can be written in
CUDA in a fairly straightforward manner, even
when the data structures
involved are not simple
regular grids. SpMV (sparse
matrix-vector multiplication) is a good example of
an important numerical
building block that can be
parallelized quite directly
using the abstractions
provided by CUDA. The
kernels we discuss here,
when combined with the
provided CUBLAS vector
routines, make writing
iterative solvers such as
the conjugate gradient9
method straightforward.
A sparse n × n matrix is
one in which the number
of nonzero entries m is
only a small fraction of the }
total. Sparse matrix rep- }
resentations seek to store
only the nonzero elements
of a matrix. Since it is fairly
typical that a sparse n × n
matrix will contain only
m=O(n) nonzero elements,
this represents a substantial savings in storage space
and processing time.
One of the most common representations for
general unstructured
sparse matrices is the CSR
(compressed sparse row)
representation. The m
nonzero elements of the
matrix A are stored in row-major order in an array Av.
A second array Aj records }
the corresponding column }
index for each entry of Av.
Finally, an array Ap of n+ 1
float multiply_row(unsigned int rowsize,
unsigned int *Aj, // column indices for row
float *Av, // non-zero entries for row
float *x) // the RHS vector
{
float sum = 0;
for(unsigned int column=0; column<rowsize; ++column)
sum += Av[column] * x[Aj[column]];
return sum;
}
FIG 5
void csrmul_serial(unsigned int *Ap, unsigned int *Aj,
float *Av, unsigned int num_rows,
float *x, float *y)
{
for(unsigned int row=0; row<num_rows; ++row)
{
unsigned int row_begin = Ap[row];
unsigned int row_end = Ap[row+ 1];
y[row] = multiply_row(row_end-row_begin, Aj+row_begin,
Av+row_begin, x);
FIG 6
__global__
void csrmul_kernel(unsigned int *Ap, unsigned int *Aj,
float *Av, unsigned int num_rows,
float *x, float *y)
{
unsigned int row = blockIdx.x*blockDim.x + threadIdx.x;
if( row<num_rows )
{
unsigned int row_begin = Ap[row];
unsigned int row_end = Ap[row+ 1];
y[row] = multiply_row(row_end-row_begin, Aj+row_begin,
Av+row_begin, x);
FIG 7