• Understand the paradigm. What are data-parallel computation and the streaming-memory model?
• Understand your code. Which portions operate at which level of granularity?
• Understand the environment. How does it help solve the problem?
If targeting a GPU, are there operations that can leverage the existing graphics-related hardware? Are your data types small enough? GPUs are designed to operate on small data elements so media data (image/video pixels or audio samples) is a good fit. Or when sorting on the GPU, working with key-index pairs separately is often a win. Then the actual movement of data records can be done on the CPU, or on the GPU as a separate pass.
GPUs are optimized for work with 1D, 2D, or 3D arrays of similar data elements. Array operations are often faster using GPU hardware because it can transparently optimize them for spatially coherent access.
When reading such arrays, the GPU can easily linearly interpolate regular array data. This effectively enables a floating-point (fuzzy) array index. Many mathematical algorithms use either a simple linear interpolation of array elements or slightly higher-order schemes that can be implemented as a few linear interpolations. GPU hardware has a significant proportion of silicon allocated to optimizing the performance of these operations.
Algorithms that involve an accumulation or summation of values into a set of results (instead of just a write/ copy) can leverage yet another large chunk of special silicon on GPUs: the blender is designed for efficiently compositing or accumulating values into an array. Some matrix math algorithms and reduction operations have shown benefits here.
REGISTER PRESSURE
Some architectures (such as GPUs) are flexible in that they can assign variable numbers of threads to a core based on how many registers each thread uses. This enables more threads to be used when fewer temporary registers are needed, but reduces the threads available (and the paral-
lelism) for algorithms that need more registers. The key is to break tasks into simpler steps that can be executed across even more parallel threads. This is the essence of data-parallel programming.
For example, a standard 8x8 image DCT (discrete cosine transform) algorithm operates on transposed data for its second half. The transpose can takes dozens of registers to execute in place, but breaking it into two passes so that the transpose happens in the intervening I/O results in only a handful of registers needed for each half. This approach improved performance from far slower than a CPU to three times that of a highly optimized SSE assembly routine.
HINTS FOR REDUCTIONS
Reductions are common operations: find the total, average, min, max, or histogram of a set of data. The computations are easily data-parallel, but the output write is an example of cross-thread communication that must be managed carefully
Initial implementations allocated a single shared location for all the threads to write into, but execution was completely serialized by write contention to that location. Allocating multiple copies of the reduction destination and then reducing these down in a separate step was found to be much faster. The key is to allocate enough intermediate locations to cover the number of cores ( hundreds) and, therefore, performance level that you want to scale to.
PROGRAMMING THE MEMORY SUBSYSTEM
The data-parallel paradigm extends to the memory subsystem as well. A full data-parallel machine is able not only to process individual data elements separately, but also to read and write those elements in parallel. This characteristic of the memory subsystem is as important to performance as the execution model. For example, I/O ports are a shared resource, and performance is improved if multiple threads are not contending for the same one.
Data structures manipulated imply memory-access patterns. We have seen cases where switching from pointer-based data structures such as linked lists or sparse trees to data-parallel-friendly ones (regular arrays, grids, packed
References:
Archives