so-called saxpy kernel defined by the BLAS (basic linear algebra subprograms) library. The code for performing this computation on both a serial processor and in parallel using CUDA is shown in figure 1.
The __global__ declaration specifier indicates that the procedure is a kernel entry point. CUDA programs launch parallel kernels with the extended function-call syntax
tern. The serial code consists of a loop where each iteration is independent of all the others. Such loops can be mechanically transformed into parallel kernels: each loop iteration becomes an independent thread. By assigning a single thread to each output element, we avoid the need for any synchronization among threads when writing results to memory.
The text of a CUDA kernel is simply a C function for one sequential thread. Thus, it is generally straightforward to write and is typically simpler than writing parallel code for vector operations. Parallelism is determined clearly and explicitly by specifying the dimensions of a grid and its thread blocks when launching a kernel.
Parallel execution and thread management are automatic. All thread creation, scheduling, and termination are handled for the programmer by the underlying system. Indeed, a Tesla-architecture GPU performs all thread management directly in hardware. The threads of a block execute concurrently and may synchronize at a barrier by calling the __syncthreads() intrinsic. This guarantees that no thread participating in the barrier can proceed until all participating threads have reached the barrier. After passing the barrier, these threads are also guaranteed to see all writes to memory performed by participating threads before the barrier. Thus, threads in a block may communicate with each other by writing and reading per-block shared memory at a synchronization barrier.
Since threads in a block may share local memory and synchronize via barriers, they will reside on the same physical processor or multiprocessor. The number of
// Invoke serial SAXPY kernel thread blocks can, however, greatly exceed the number of
saxpy_serial(n, 2.0, x, y); processors. This virtualizes the processing elements and
gives the programmer the flexibility to parallelize at what-
ever granularity is most convenient. This allows intuitive
problem decompositions, as the number of blocks can
be dictated by the size of the data being processed rather
than by the number of processors in the system. This also
allows the same CUDA program to scale to widely varying
numbers of processor cores.
To manage this processing element virtualization and
if( i<n ) y[i] = alpha*x[i] + y[i]; provide scalability, CUDA requires that thread blocks exe- } cute independently. It must be possible to execute blocks in any order, in parallel or in series. Different blocks have
// Invoke parallel SAXPY kernel (256 threads per block) no means of direct communication, although they may int nblocks = (n + 255) / 256; coordinate their activities using atomic memory operations saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); on the global memory visible to all threads—by atomi-
FIG 1 cally incrementing queue pointers, for example.
This independence requirement allows thread blocks to be scheduled in any order across any number of cores, making the CUDA model scalable across an arbitrary
kernel<<<dimGrid, dimBlock>>>(... parameter list ...);
where dimGrid and dimBlock are three-element vectors of type dim3 that specify the dimensions of the grid in blocks and the dimensions of the blocks in threads, respectively. Unspecified dimensions default to 1.
In the example, we launch a grid that assigns one thread to each element of the vectors and puts 256 threads in each block. Each thread computes an element index from its thread and block IDs and then performs the desired calculation on the corresponding vector elements. The serial and parallel versions of this code are strikingly similar. This represents a fairly common pat-
void saxpy_serial(int n, float alpha, float *x, float *y)
{ for(int i = 0; i<n; ++i)
y[i] = alpha*x[i] + y[i]; }
Computing y ← ax + y in parallel using CUDA __global__ void saxpy_parallel(int n, float alpha, float *x, float *y) {
References:
Archives