A Massively Parallel Adaptive
Fast Multipole Method on
Heterogeneous Architectures
Abstract
We describe a parallel fast multipole method (FMM) for
highly nonuniform distributions of particles. We employ
both distributed memory parallelism (via MPI) and shared
memory parallelism (via OpenMP and GPU acceleration) to
rapidly evaluate two-body nonoscillatory potentials in three
dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with
up to 30 billion particles on 196,608 cores on the AMD/
CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF’s Keeneland at Georgia Tech/ORNL), we observed
30× speedup over a single core CPU and 7× speedup over a
multicore CPU implementation. By combining GPUs with
MPI, we achieve less than 10ns/particle and six digits of
accuracy for a run with 48 million nonuniformly distributed
particles on 192 GPUs.
( 1)
and complexity bounds? And how do we implement these
approximations efficiently?
One of the first successful attempts toward fast schemes
was the Barnes–Hut method1 that brought the complexity
down to O(N log N). Greengard and Rokhlin3, 10, 22 solved the
problem. They introduced the fast multipole method (FMM)
that uses a rapidly convergent approximation scheme that
comes with rigorous error bounds and reduces the overall
complexity (in three dimensions) to O(log(1/)3N), where
is the desired accuracy of the approximation to the exact
sum. This result is impressive: we can evaluate the sum in
Equation ( 1) in arbitrary precision in linear time.
So how does FMM work? Here is a sketch. Given the positions of particles for which we want to compute all pairwise
interactions, we build an octree (henceforth we just consider
FMM in three dimensions) so that any leaf node (or octant)
has approximately a prescribed number of particles; then
we perform tree traversals to evaluate the sum. We perform
a postorder (bottom-up) traversal followed by a preorder traversal (top-down). The postorder traversal consists of calculations that involve an octant and its children. The preorder
traversal is more complex, as it involves several octants in a
neighborhood around the octant being visited.
At this level of abstraction, we begin to see how we might
parallelize the FMM. When the distribution of particles is
uniform, the analysis is relatively straightforward: we perform the traversals in a level-by-level manner, using data
parallelism across octants in each level. The longest chain of
dependencies is roughly O(log N). The challenge lies in the
case of nonuniform particle distributions, in which even the
sequential case becomes significantly more complicated.
3
Summary of the method: We describe and evaluate an
implementation of the FMM for massively parallel distributed memory systems, including heterogeneous architectures based on multicore CPU and GPU co-processors. We
use the message passing interface (MPI) for expressing
distributed memory parallelism, and OpenMP and CUDA
for shared memory and fine-grain parallelism. There exist
many variants of FMM, depending on the kernel and the
1. iNtRoDuCtioN
N-body problems are the computational cornerstone of
diverse problems in mathematical physics,
16 machine learning,
8 and approximation theory.
4 Formally, the problem is
how to efficiently evaluate
which is essentially an O(N2) computation of all pairwise
interactions between N particles. We refer to f as the “
potential,” s as the “source density,” x as the coordinates of the
target particles, y as the coordinates of the source particles, and K as the interaction kernel. For example, in electrostatic or gravitational interactions K(x, y)=K(r)= with
r = x − y and |r| its Euclidean norm. Equation ( 1) expresses
an N-body problem.