distributions of particles. Let N be the number of particles
and let p be number of processes. The number of octants is
proportional to the number of particles. The first communication cost is associated with the parallel sort of the input
particles. We have used a comparison sort instead of binning, so its time complexity is (
combination of sample sort and bitonic sort).
6 Exchanging the “ghost”
octants has the same complexity as the reduce-broadcast
algorithm described in Section 3. 3, that is, , where m
is the maximal number of “shared” octants between two processes. For a uniform grid, m can be estimated as O( (N/p)2/3)
divided by p. The communication also includes the exchange
of source densities. Assuming that the bandwidth of the system is reasonably high, we can neglect all lower order communication terms. In summary (assuming large enough
bandwidth but assuming nothing about latency), the overall
complexity of the setup phase is . For
the evaluation, we have .
Algorithm 3. Reduce and Scatter
Input: partial upward densities of “shared” octants at “
contributor” processes
Input: r (rank of current process); assume communicator size
is 2d
output: upward densities of “shared” octants at “user”
processes
//Define shared octants (for each process):
S = {b ∈ LET : #(Pu(b) ∪ Pc(b ) ) > 1}
//Loop over communication rounds (hypercube dimensions)
Fori:=d−1to0
//Process s is our partner during this communication round
1. s := r XOR 2i
2. us = s AND (2d − 2i)
3. ue = s OR (2i − 1)
4. Send to s :
5. qs = r AND (2d − 2i)
6. qe = r OR (2i − 1)
7. Delete
//Reduction
8. Recv from s and append S
9. Remove duplicates for S
10. Sum up densities for duplicate octants.
For a nonuniform particle distribution, the complexity
estimates for the setup and evaluation phases include an
additional term. Since we do not have a (nontrivial)
bound on m, this result is worse than what is theoretically
possible by Teng’s algorithm. (He partitions the tree using
both work and communication costs instead of just using
the work at the leaf octants.)
4. iNtRA-NoDe PARALLeLism
A key ingredient to the efficiency of the overall algorithm is highly tuned intra-node performance. Many
current supercomputing systems have heterogeneous
nodes, meaning they have both conventional general-purpose multicore CPU processors and more specialized
high-speed co-processors, such as GPUs. Consequently,
our implementation can exploit both CPUs (via OpenMP)
and GPUs (via CUDA).
Shared memory parallelization is common for either
type of processor. For the S2U, D2T, ULI, WLI, VLI, and XLI
steps, we can visit the octants in an embarrassingly parallel
way. Moreover, all octant visits include octant-to-octant or
octant-to-particle interactions expressed as matrix–vector
multiplications. The matrices are dense except for the VLI
calculations, which correspond to a diagonal translation.
Thus, overall there are two levels of parallelism: across
octants and across the rows of the corresponding matrix.
The same approach to parallelism applies on both CPU and
GPU co-processors, because they have essentially the same
architectural features for exploiting parallelism: shared
memory address spaces, a multilevel memory hierarchy, and
vector units for regular data parallelism. The U2U and D2D
remain sequential in our current implementation, though
they could in principle be parallelized using rake and compress methods.
14
Although the architectural features are similar between
CPUs and GPUs, differences in their scale directly affect
the algorithm and implementation. The main algorithmic
tuning parameter for the FMM is the maximum number of
particles per octant, q. Increasing q makes the leaf octants
larger and the overall tree shorter, which in turn increases
the number of flops performed by the compute-bound ULI,
WLI, and XLI steps, while decreasing the flop-cost of the
memory-bound VLI and other phases. Figure 3(a) shows
an example of this behavior. A processor’s relative balance
of performance and bandwidth will change these curves,
thereby changing the optimal value of q.
Indeed, this phenomenon gives rise to the overall performance behavior shown in Figure 3(b), where we compare
the single-socket execution times for three single-socket
implementations: (i) single-threaded CPU with explicit
vectorization using SSE intrinsics; (ii) multithreaded CPU
with four threads and SSE; and (iii) GPU using CUDA. Note
that combined threading and vectorization yields a 7. 4×
improvement over the single-threaded code. The GPU
code is 32× faster than the single-threaded CPU code and
4. 3× faster than the multithreaded CPU code, even including time to send/receive data from the GPU. Interestingly,
the GPU win comes not just from parallelization and tuning but because faster processing enables better
algorithmic tuning of q. In particular, the faster we can perform
the compute-intensive ULI step, the larger we can make q
thereby reducing the VLI and other memory-bound steps.
In other words, there is a synergy between faster processing and algorithmic tuning in which we can trade more
flops (ULI step) for less memory communication (e.g., VLI
step). Such techniques are likely to apply more generally as
many-core processors increase in use.
5. NumeRiCAL exPeRimeNts
This section evaluates the overall scalability of our implementation on two different architectures and in both strong
and weak scaling regimes. (We have used the Stokes kernel
instead of the Laplacian. K(r) is defined as 1/|r|( 1 − r ⊗ r)/