the first output element in a row r is r , while the second
0
is o = r + r , and the nth output element is o = r + r
101 n01
+ … + r ). Using this, a list of record sizes can be accu-
n
mulated to compute the absolute addresses where each
record element is to be written. Then the writes can occur
completely in parallel. Note that if the writes are done
in order, the memory-access pattern is still completely
sequential. 4
Before starting to write your code, check for tasks that are known data-parallel cases. Often you can find library routines already available for accelerating common tasks using data-parallel hardware. Most data-parallel programming environments include such libraries as a convenient way for users to begin adopting their technology.
If you need to write custom data-parallel code, the process is similar to a localized optimization effort. You can adopt data-parallel programming incrementally, since you can identify and optimize the key inner loops one at a time, without perturbing the larger-scale structure of the code base. Here are the basic steps for converting code to the data-parallel model: 1. Identify a key task that looks data-parallel. 2. Identify a data-parallel algorithm for this task. 3. Select a data-parallel programming environment. 4. Implement code. 5. Evaluate performance scaling rate. 6. Go to step 1.
Look for a segment of code that doesn’t rely greatly on cross communication between data elements, or conversely, a set of data elements that can be processed without requiring too much knowledge of each other. Look for data-access patterns that can be regularized, as opposed to arbitrary/random (such as linear arrays versus sparse-tree data structures).
While searching for candidates to parallelize, you can evaluate performance potential via Amdahl’s law: just comment out this candidate task (simulate infinite parallelism) and check to see total performance change. If there isn’t a significant improvement, going through the effort of parallelizing won’t pay off.
or in routines developed by Tera/Cray for its vector processors. For example, bitonic sorts were identified as interesting before computers were developed, but fell out of favor during the rise of current cache-based machines. Other examples are radix sorts, and prefix sum (scan) operations used for packing sparse data.
Many data-parallel programming environments are available today. Many of the criteria to use in evaluation are the same as for any development environment. The areas to look for are:
• Abstraction level. Do you need a library, a set of data-abstraction utilities, or a language?
• Syntax clarity. Are limitations of the implementation explicit in the syntax or hidden by it?
• Maintainability. Would the resulting code complexity be manageable?
• Support. Are there user groups or support services?
• Availability. How broadly distributed is the environment or any hardware that it requires?
• Compatibility. Is the environment compatible with a broad range of systems or only a specific subset?
• Lifespan. Is the environment compatible with future hardware, even from the same vendor?
• Documentation. Do the docs make sense? Are the samples useful?
• Cost. How much will it cost users of your product to get any required hardware or software?
STEP 4: IMPLEMENT CODE Code it up, at least at the pseudocode level. If implementation turns out to require more than one or two places where interthread communication is required, then this may not be a sufficiently data-parallel algorithm. In that case, it may be necessary to look for another algorithm (step 2) or another task to parallelize (step 1).
STEP 5: EVALUATE PERFORMANCE SCALING Performance at a given core count is interesting but not the key point. (If you are going to check that, be sure to compare using a realistic “before” case.) A more important metric to check is how the new code scales with increasing core count. If there is no sign of a performance plateau, the system will have some scaling headroom. After all, absolute performance relative to a single core is not as relevant as how it scales with core-count growth over time.
References:
Archives