figure 2. Communication of ghost octants. Process 0 sends green
octants to process 1, red octants to process 2, and brown octants
to both 1 and 2. White octants in lower-left corner are “internal” to
process 0 and not sent to anyone. the procedure is applied to both
leaves and non-leaf octants. (a) finer level of octree. (b) Coarser
level of octree.
2
2
2
2
1
1
(a)
(b)
Algorithm 2. LET construction
Input: distributed set of particles x;
output: LET on each process k
1. Lk = Points20ctree(x)
2. Bk = Lk ∪ A(Lk)
3. Ikk′ := {b Î Bk : P(b ) or C(P(b) ) overlaps with Ωk′}
4. "k¢ : k¢ ¹ k
Send Ikk′ to process k¢
Recv Ik′k from process k¢
Insert Ik′k in Bk
5. Return Bk
//MPI
//MPI
//MPI
In the last step of Algorithm 2, every process independently builds U, V, W, and X lists for the octants in its LET
which enclose the local particles (where the potential is to be
evaluated). All necessary octants are already present in the
LET, so no further communication is required in this step.
3. 2. Load balancing
Assigning each process an equal chunk of leaves may lead to
a substantial load imbalance during the interaction evaluation for nonuniform octrees. In order to overcome this difficulty, we use the following load-balancing algorithm.
After the LET setup, each leaf is assigned a weight, based
on the computational work associated with its U, V, W, and
X lists (which are already built at this particle). Then, we
repartition the leaves to ensure that total weight of the leaves
owned by each process is approximately equal. We use
Algorithm 1 from Sundar et al.
24 to perform this repartitioning in parallel. Note that similarly to the previous section,
each process gets a contiguous chunk of global (distributed)
Morton-sorted array of leaves. In the final step, we rebuild
the LET and the U, V, W, and X lists on each process.
Note that we repartition the leaves based solely on
work balance ignoring the communication costs. Such an
approach is suboptimal, but is not expensive to compute
and works reasonably well in practice. In addition to leaf-based partitioning, the partitioning at a coarser level can
also be considered;
25 we have not tested this approach.
3. 3. fmm evaluation
Three communication steps are required for the potential
evaluation. The first is to communicate the source densities
for the U-list calculation. This communication is “local” in a
sense that each process typically communicates only with its
spatial neighbors. Thus, a straightforward implementation
(say, using MPI_Isend) is acceptable.
The second communication step is to sum up the upward
densities of all the contributors of each octant, the U2U step.
During this step (bottom-up traversal), each process only
builds partial upward densities of octants in its LET. The
partial upward densities of an octant do not include contributions from descendants of the octant that belong to other MPI
processes. For this reason, we need a third step, a communication step, to collect the upward densities to the users of
each octant. This communication step must take place after
the U2U step and before the VLI and XLI steps. Once this step
is completed, every process performs a top-down traversal of
its LET without communicating with other processes.
Algorithm 3 describes a communication procedure that
combines the second and the third steps mentioned above.
This algorithm resembles the standard “AllReduce” algorithm on a hypercube. What follows is the informal description of the algorithm.
At the start, each processor r forms a pool S of octants
(and their upward densities) which are “shared”, that is, not
used by r alone. The MPI communicator (set of all MPI processes) is then split into two halves (for simplicity, we shall
assume that the communicator size is a power of two) and
each processor from the first half is paired with a “peer” processor from the other half.
Now consider those octants in S which are “used” by
some processor from the other half of the communicator (not necessarily a peer of r). If such octants exist, r
sends them (together with upward densities) to its “peer”.
According to the same rule, the “peer” will send some (
possibly none) octants and densities to r. The received octants
will be merged into S, eliminating duplicate octants while
summing densities for duplicate octants.
At this point, no further communication between the
two halves is required, and the algorithm proceeds recursively by applying itself to each of the halves. We finally note
that after each communication round, S is purged from the
“transient” octants which are no longer necessary for communications and not used locally at r.
The time complexity of this algorithm is not worse than .
To be more specific, assuming that no process uses more than
m shared octants and no process contributes to more than m
shared octants, for a hypercube interconnect, the communication complexity of Algorithm 3 is ,
where ts and tw are the latency and the bandwidth constants,
respectively. See Lashuk et al.
17 for a proof.
After the three communication steps, all the remaining steps of Algorithm 1 can be carried out without further
communication.
3. 4. Complexity for uniform distributions of particles
We have all the necessary ingredients to derive the overall
complexity of the distributed memory algorithm for uniform