tributing data. Our guarded optimism
regarding existing languages extends
even to parallelization if, by that, we
mean finding concurrency. If data distribution is needed to achieve high-end performance, however, new programming languages or constructs
seem that much more crucial.
HPC seldom uses locks. More typically, concurrency is related to data-parallel loops—for example, time
stepping all particles or grid elements
concurrently. Meanwhile, clusters of
commodity computers have become
the price-performance winners in
HPC. Therefore, parallelization also
involves the decomposition of data
over cluster nodes. Nodes share data
in HPC typically through explicit message passing, for example with MPI.
Consider the alternating direction implicit (ADI) method for solving
partial differential equations. Specifically, consider a 3D rectangular grid,
such as that shown in the accompanying figure on the right. Physically, the
information on any grid cell propagates throughout the 3D volume, ultimately influencing all other cells.
Computationally, we can restrict data
propagation along only the x-axis in
one phase of computation, later along
the y-axis, and finally along the z-axis.
Ultimately, the computed physics
should remain unchanged.
Such an algorithm organizes computation along “pencils” of cells. For
example, in the first phase, all cells in
an X-aligned pencil can be updated
based solely on data values within this
pencil. Indeed, all X-aligned pencils
can be updated concurrently; then
all Y-aligned pencils; and finally, all
Z-aligned pencils. If there are N3 elements in the grid, then there are N2
pencils in each of the X, Y, or Z phases.
That is to say, there is considerable
3D grid showing pencils of cells.
concurrency. The BT and sPPM codes
both are organized like this, as are
multidimensional FFTs. The pseudocode might look like Table 2.
Each subroutine call can be made
concurrently with all other calls in
the same FORALL statement. There
is, however, no way of distributing the
elements of MYDATA onto multiple
processors so that each processor has
all the data it needs for all stages of
computation. If a particular processor “owns” MYDATA(X,Y,Z), then to
process an X-aligned pencil of data it
needs all MYDATA(:, Y,Z) values. Then,
to process Y-aligned pencils of data, it
needs all M YDATA(:,:,Z) values. Finally, to process Z-aligned pencils of data,
it needs all MYDATA(:,:,:) values.
Therefore, while concurrency in
this example is rife, distributed-memory systems face a great challenge
both in exchanging data between processors explicitly and in distributing
data so that such costly exchanges are
minimized.
Similar issues arise even in shared-memory systems. It may be possible
for all processors to access all elements in place, but these accesses
must be coordinated, whether to prevent race conditions or to deal with
cache coherency. Even shared-memory systems benefit from spatial locality
since processors can then deal with
complete cache lines.
If we focus on the relatively easier
problem of concurrency, we could in
the long term help keep the HPC programmer from having to parallelize
explicitly. We would benefit from improvements in software. Existing compilers already identify some opportunities for automatic parallelization.
This includes progress on autoscop-ing—that is, automatically analyzing
source code to determine the usage
(private, shared, read-only shared,
replicated, and so on) of variables so
that a loop could be parallelized. Automatic analysis would be aided by
whole-program or interprocedural
analysis.
Runtime management of concurrency would also help. Loops might
be nested, or loop iterations might be
unbalanced. Loop counts and processor counts might not be known until
runtime. Static analysis alone cannot
balance computational loads or judge
the balance between fine-grained parallelism (for maximum concurrency)
and coarse-grained parallelism (to
amortize the costs of parallelization).
Simpler concurrency for the HPC
programmer would also benefit from
hardware improvements. Large, globally addressable memories help.
Processors run faster with cached
data, however, so coherency must
be managed. Hardware can support
concurrency with atomic operations,
transactional memory, and active
messages.
While concurrency seems relatively
simpler, managing data distribution
seems a much more difficult task. This
is one area where new programming
languages could really offer help.
For example, the NPB BT benchmark has a cousin, BT I/O, which adds
I/O to the test. This offers a test of
adaptive maintenance—that is, adding functionality to an already written
program. The comparison was almost
a joke: setting up I/O in the original,
distributed-memory version of the
code added 144 source lines, while
the rewritten, shared-memory version
needed only one extra line!
algorithmic complexity
Performance and parallelization are
not the only pressures causing large
y
z
x