DATA-PARALLEL PROGRAMMING

Given the difficulty of finding enough subsystem tasks to assign to dozens of cores—the only elements of which there are a comparable number are data elements—the data-parallel approach is simply to assign an individual data element to a separate logical core for processing. Instead of breaking code down by subsystems, we look for fine-grained inner loops within each subsystem and parallelize those.

For some tasks, there may be thousands to millions of data elements, enabling assignment to thousands of cores. (Although this may turn out to be a limitation in the future, it should enable code to scale for another decade or so.) For example, a modern GPU can support hundreds of ALUs (arithmetic logic units) with hundreds of threads per ALU for nearly 10,000 data elements on the die at once.

The history of data-parallel processors began with the efforts to create wider and wider vector machines. Much of the early work on both hardware and data-parallel algorithms was pioneered at companies such as MasPar, Tera, and Cray.

Today, a variety of fine-grained or data-parallel programming environments are available. Many of these have achieved recent visibility by supporting GPUs. They can be categorized as follows:

Older languages (C*, MPL, Co-Array Fortran, Cilk, etc.). Several languages have been developed for fine-grained parallel programming and vector processing. Many add only a very small difference in syntax from well-known languages. Few of them support a variety of platforms and they may not be available commercially or be supported long term as far as updates, documentation, and materials.

Newer languages (XMT-C, CUDA, CAL, etc.). These languages are being developed by the hardware company involved and therefore are well supported. They are also very close to current C++ programming models syntactically; however, this can cause problems because the language then provides no explicit representation of the unique aspects of data-parallel programming or the processor hardware. Although this can reduce the changes required for an initial port, the resulting code hides the parallel behavior, making it harder to comprehend, debug, and optimize. Simplifying the initial port of serial code through syntax is not that useful to begin with, since for best performance it is often an entire algorithm that must be replaced with a data-parallel version. Further, in the interest of simplicity, these APIs may not

expose the full features of the graphics-specific silicon, which implies an underutilized silicon area.

Array-based languages (RapidMind, Acceleware, Microsoft Accelerator, Ct, etc.). These languages are based on array data types and specific intrinsics that operate on them. Algorithms converted to these languages often result in code that is shorter, clearer, and very likely faster than before. The challenge of restructuring design concepts into array paradigms, however, remains a barrier to adoption of these languages because of the high level of abstraction at which it must be done.

GPU evolution has
been driven by 3D rendering,
an embarrassingly
data-parallel problem.

Graphics APIs (OpenGL, Direct3D). Recent research in GPGPU (general-purpose computing on graphics processing units) has found that while the initial ramp-up of using graphics APIs can be difficult, they do provide a direct mapping to hardware that enables very specific optimizations, as well as access to hardware features that other approaches may not allow. For example, work by Naga Govindaraju1 and Jens Krüger2 relies on access to fixed-function triangle interpolators and blending units that the newer languages mentioned here often do not expose. Further, there is good commercial support and a large and experienced community of developers already using them.

GPUS AS DATA-PARALLEL MACHINES The GPU is the second-most-heavily used processor in a typical PC. It has evolved rapidly over the past decade to reach performance levels that can exceed the CPU by

References:

http://www.acmqueue.com

Archives