pairs of partial sums in parallel. Each step of this pair-wise summation cuts the number of partial sums x[0] x[ 1] x[ 2] x[ 3] in half and ultimately produces the final sum after log N steps. Note that
2 this implicitly builds a tree structure over the initial partial sums. x[0] x[ 1] x[ 2] x[ 3]
In the example shown in figure 10, each thread simply loads one element of the input sequence (i.e., x[0] x[ 1] x[ 2] x[ 3] it initially sums a subsequence of length one). At the end of the reduction, x[0] x[ 1] x[ 2] x[ 3] we want thread 0 to hold the sum of all elements initially loaded by the threads of its block. We can achieve this in parallel by summing values in a tree-like pattern. The loop in this kernel implicitly builds a summation tree over the input elements. The action of this loop for the simple case of a block of eight threads is illustrated in figure 11. The steps of the loop are shown as successive levels of the diagram and edges indicate from where partial sums are being read.
At the end of this loop, thread 0 holds the sum of all the values loaded by this block. If we want the final value of the location pointed to by total to contain the total of all elements in the array, we must combine the partial sums of all the blocks in the grid. One strategy would be to have each block write its partial sum into a second array and then launch the reduction kernel again, repeating the process until we had reduced the sequence to a single value. A more attractive alternative supported by the Tesla architecture is to use atomicAdd(), an efficient atomic read-modify-write primitive supported by the memory subsystem. This eliminates the need for additional temporary arrays and repeated kernel launches.
Parallel reduction is an essential primitive for parallel programming and highlights the importance of per-block shared memory and low-cost barriers in making cooperation among threads efficient. This degree of data shuffling
x[ 4]
x[ 5]
x[ 6]
x[ 7]
x[ 4]
x[ 5]
x[ 6]
x[ 7] x[i] += x[i+ 4];
x[ 4]
x[ 5]
x[ 6]
x[ 7] x[i] += x[i+ 2];
x[ 4]
x[ 5]
x[ 6]
x[ 7] x[i] += x[i+ 1];
among threads would be prohibitively expensive if done in off-chip global memory.
CUDA is a model for parallel programming that provides a few easily understood abstractions that allow the programmer to focus on algorithmic efficiency and develop scalable parallel applications. In fact, CUDA is an excellent programming environment for teaching parallel programming. The University of Virginia has used it as just a short, three-week module in an undergraduate computer architecture course, and students were able to write a correct k-means clustering program after just three lectures. The University of Illinois has successfully taught a semester-long parallel programming course using CUDA to a mix of computer science and non-computer science majors, with students obtaining impressive speedups on a variety of real applications, including the previously mentioned MRI reconstruction example.
CUDA is supported on NVIDIA GPUs with the Tesla unified graphics and computing architecture of the GeForce 8-series, recent Quadro, Tesla, and future GPUs.
References:
Archives