interaction approximation scheme. Our implementation
is based on the kernel-independent FMM.
28, 29 However, this
choice affects only the construction of certain operators
used in the algorithm. It does not affect the overall “flow”
of the algorithm or its distributed-memory parallelization.
(It does affect single-core performance as vectorization and
fine-grain parallelism depend on the FMM scheme.)
Our parallel algorithm can be summarized as follows.
Given the particles, distributed in an arbitrary way across
MPI processes, and their associated source densities, we
seek to compute the potential at each point. (For simplicity, in this paper, we assume that source and target points
coincide.) First, we sort the points in Morton order,
24 a kind
of space-filling curve, and then redistribute them so that
each process owns a contiguous chunk of the sorted array.
Second, we create the so-called locally essential tree (LET,
defined in Section 3), for each process in parallel. Third, we
evaluate the sums using the LETs across MPI processes and
using GPU and OpenMP acceleration within each process.
All the steps of the algorithm are parallelized, including tree
construction and evaluation.
related work: There is a considerable literature on parallel algorithms for N-body problems. A work–depth analysis of the algorithm gives O(N) work and O(log N) depth for
particle distributions that result in trees that have depth
bounded by O(log N). Greengard and Gropp analyzed and
implemented the algorithm using the concurrent-read,
exclusive-write (CREW) PRAM model for the case of uniform
particle distributions.
9 The implementation is straightforward using level-by-level traversals of the tree and using data
parallelism. The complexity of the algorithm, omitting the
accuracy-related term, is O(N/p)+O(log p), where p is the
number of threads. This result extends to message passing
implementations.
The implementation, parallel scalability, and complexity
analysis become harder when one deals with nonuniform
distributions of particles. Callahan and Kosaraju2 propose
an exclusive read exclusive write (EREW)-PRAM algorithm
with p = N processors which is work optimal, takes log p time,
and does not depend on the distribution of the points. To our
knowledge, that algorithm has not been used in practice. Our
algorithm uses many ideas that first appeared in the seminal
work of Warren and Salmon, who introduced space-filling
curves using Morton-ordering for partitioning the points
across processors, the notion of local essential trees (LET),
and a parallel tree construction algorithm.
27 These ideas
were widely adopted and analyzed in other works.
7, 23
Teng was the first to provide a complexity analysis that
accounts for communication costs for particle distributions
whose tree depth is bounded by log N (in finite floating point
precision, all particle distributions satisfy this assumption).
25 The key idea is to build a communication graph in
which graph nodes correspond to tree octants and edges to
interactions between octants. Teng shows that FMM communication graphs have a bisector whose edge cut (in three
dimensions) is bounded by O(N2/3(log N)4/3); to compare,
the bisector for a uniform grid is O(N2/3). This result shows
that scalable FMM calculations are theoretically possible.
Teng outlines a divide-and-conquer parallel algorithm to
construct and partition the tree. Morton-order sorting is
used to build the tree and repartitioning is done using a geometric graph partitioning method that guarantees low edge-cut and good load balancing.
18 Assuming ( 1) a parallel radix
sort is used to partition the points and ( 2) the depth of the
FMM tree grows logarithmically with N, the overall algorithm
scales as O(N/p) + O(log p). The constants in the complexity
estimates involve flop rate, memory latency, and bandwidth
parameters of the underlying machine architecture.
Work that is focused more on actual implementations
and performance analysis includes Kurzak and Pettitt15
and Ogata et al.
19 Several groups have been working on
GPU acceleration.
5, 11, 12, 20, 30 Finally, for nearly uniform
distributions of particles, one can use fast Fourier transforms (particle-in-cell methods) to calculate the sums in
O(N log N) time.
Contributions: To our knowledge, there are no FMM
implementations that demonstrate scalability on thousands
of MPI processes for highly nonuniform particle distributions. Our main contribution is to produce and demonstrate
the scalability of such an implementation that exploits all
levels of parallelism. In Section 3, we discuss the distributed
memory parallelism and introduce a novel FMM-specific allreduce algorithm that uses hypercube routing. In Section 4,
we introduce the shared memory parallelism of our implementation, and in Section 5, we discuss our experiments
and the scalability results.
2. outLiNe of fmm
In this section, we describe the FMM data structures, the
main algorithmic steps, and the parallel scalability of the
method. Without loss of generality for the rest of the paper,
we assume that the source particles yj and the target particles xi coincide. The FMM tree construction ensures that
every leaf octant contains no more than q particles. The
sequential algorithm is simple: the root octant is chosen to
be a cube that contains all the particles; we insert the particles in the tree one by one; we subdivide an octant when it
contains more than q particles. The parallel construction is
more involved, and we describe it in Section 3.
After the tree construction, for each octant b we need
to construct its interaction lists. These lists contain other
octants in the tree. Each list represents a precisely defined
spatial neighborhood around b, and for every list we have to
compute “interactions” between b and the octants in that
list. By “interactions,” we refer to floating point operations
that correspond to matrix–matrix or matrix–vector multiplications. Once the interaction lists have been constructed, the
tree is traversed twice: fist bottom-up and then top-down. In
each traversal for each octant, we perform calculations that
involve other octants in its interaction lists.
One difficulty in FMM codes is the efficient mathematical representation and implementation of the
octant–octant interaction matrices so that we achieve algorithmic efficiency without compromising accuracy. The size
of the matrices varies, but typical dimensions are 100–1000,
depending on the implementation. FMM algorithm designers seek to reduce the size of the matrices, to employ sym-metries to reduce storage needs, to use precomputation,