Algorithm 1.
// APPROXIMATE INTERACTIONS
// ( 1) S2U: source-to-up step
∀b ∈ L : source to up densities interaction
// ( 2) U2U: up-to-up step (upward)
Postorder traversal of T
∀b ∈ T : interact(b, P(b ))
// (3a) VlI : V-list step
∀b ∈ T : ∀a ∈ V(b ) interact(b, a);
// (3b) XlI : X-list step
∀b ∈ T : ∀a ∈ X(b ) : interact(b, a);
// ( 4) D2D: down-to-down step (downward)
Preorder traversal of T
∀b ∈ T : interact(b, a);
// (5a) WlI: W-list step
∀b ∈ L : ∀a ∈ W(b ) : interact(b, a);
// (5b) D2T : down-to-targets step
∀b ∈ L : evaluate potential on xi ∈ b;
//DIRECT INTERACTIONS
// UlI: U-list step (direct sum)
∀b ∈ L : ∀a ∈ Ub : interact(b, a);
Our overall strategy is to use MPI-based distributed memory parallelism to construct the global tree and to partition
it into overlapping subtrees (the LETs) in order to remove
dependencies between steps (3a)/(5a) and (3b)/(5b). We handle
the concurrent writes explicitly and we use shared-memory-based parallelism on GPUs within each MPI process to
accelerate the direct interactions and steps ( 1), ( 3), ( 5) of the
indirect interactions in Algorithm 1. In the following sections, we give the details of our approach.
3. DistRiButeD memoRy PARALLeLism
The main components of our scheme are ( 1) the tree
construction, in which the global FMM tree is built and each
process receives its locally essential tree and a set of leaves
for which it assumes ownership and ( 2) the evaluation, in
which each process evaluates the sum at the particles of the
leaf octants it owns. This sum has two components: direct
interactions (evaluated exactly) and indirect interactions
(evaluated approximately).
The input consists of the particles and their source densities. The output of the algorithm is the potential at the
particles. Before we describe the algorithm, we need the following definition:
27
locally essential tree (le T): Given a partition of L across
processes so that Lk is the set of leaf-octants assigned to the
process k (i.e., the potential in these octants is computed by
process k), the LET for process k is defined as the union of
the interaction lists of all owned leaves and their ancestors:
The basic idea27 for a distributed memory implementation of the FMM algorithm is to partition the leaves of the FMM
tree across processes, construct the LET of each process,
and then compute the N-body sum in parallel. There are two
communication-intensive phases in this approach: the first
phase is the LET construction and the second phase is an allreduce during evaluation. Next, we discuss the main components in these two phases.
3. 1. tree construction
The inputs for the tree construction procedure are the
particles. The output is the local essential tree on each
process, which is subsequently used in the computation,
along with geometrical domain decomposition of the unit
cube across MPI processes. The latter is used throughout
the algorithm. The tree construction involves ( 1) the construction of a distributed linear complete octree that contains only the leaf octants and ( 2) the construction of the
per-process LETs.
We start by creating the distributed and globally Morton-sorted array containing all the leaves from the global tree.
The algorithms for this task are known (we use the one
described in Sundar et al.;
24 see also Hariharan and S.
Aluru13), and their main ingredient is the parallel Morton-sort of particles. This sort determines the overall complexity
of the tree construction.
The distribution of the leaves between processes induces
a geometric partitioning of the unit cube: each process controls the volume covered by the leaves it owns. By Wk, we will
denote the region “controlled” by process k. Each process
stores the overall geometric partitioning: we use an MPI_
AllGather to exchange the first and last octants of their
region, which is all is needed to define the partition.
Next, each process adds all ancestor octants to its local
leaves, thus creating a local tree. The task of exchanging
information about “ghost” octants still remains to be com-
pleted. To accomplish that, let us introduce the following
definitions:
• “Contributor” processes of an octant b ∈ T:
Let Ikk′ be the set of octants to which process k contributes
and which process k′ uses. Then, process k must send all
octants in Ikk′ to MPI process k′. Figure 2 provides an illustration of this procedure. That is, each process contributor of
each octant sends the octant data to all users of that octant.
Once the exchange of the Ikk′ lists has been completed, all
MPI processes insert received octants into their local trees.
This concludes the construction of the per-process LETs.
We summarize the construction of the local essential trees
in Algorithm 2.
The correctness of the construction is based on the
direct relation between LET and FMM. Consider the potential generated by the sources enclosed by some octant b.
In order to evaluate this potential outside the volume covered by C(P(b )), one does not need information regarding sources or upward densities associated with b, since
the upward density of some ancestor of b would be used
instead. This observation can be used to formalize the correctness of the LET construction. See Lashuk et al.
17 for a
more detailed discussion.