and to use fast approximations (e.g., fast Fourier transforms
or low-rank singular value decompositions). All we need to
remember here is that once we fix the desired accuracy, the
computation of the interactions between an octant b and the
octants in its interaction lists can be completed in O( 1) time.
Now let us give the precise definition of these lists. For
each octant b in the tree, we create four lists of octants, the
so-called U-, V-, W- and X-list.
3 The U-list is defined only for
leaf octants. For a leaf octant b, the U-list of b consists of all
leaf octants adjacent to b, including b itself. (We say that a is
adjacent to b if b and a share a vertex, an edge, or a face.) For
any octant b, its colleagues are defined as the adjacent octants
that are in the same tree level. The V-list of an octant b (leaf
or non-leaf) consists of those children of the colleagues of b’s
parent octant, P(b), which are not adjacent to b. The W-list
is only created for a leaf octant b and contains an octant a
if and only if a is a descendant of a colleague of b, a is not
adjacent to b, and the parent of a is adjacent to b. The X-list of
an octant b consists of those octants a which have b on their
W-list. These definitions are summarized in Table 1.
For each octant b ∈ T, we store its level and two vectors,
u and d. The u vector is an approximate representation of the
potential generated by the source densities inside b. This
representation is sufficiently accurate only if the evaluation
particle is outside the volume covered by b and the colleagues
of b. For the u vector we will use the name “upward equivalent density” or simply “upward density.” This terminology
is motivated by Ying et al.
28, 29
The d vector is an approximate representation of the
potential generated by the source densities outside the volume
covered by b and colleagues of b. This approximate representation is sufficiently accurate only if the evaluation particle is
enclosed by b. For the d vector we will use the name “
downward equivalent density” or simply “downward density.”
For leaf octants (b ∈ L), we also store x, s, and f. Given the above
table 1. Notation. here we summarize the main notation used to
define the interaction lists for each octant in the fmm tree. (the
symbol “\” denotes standard set exclusion.) Note that a OEW(b ) need
not be a leaf octant; conversely, since W(b) exists only when b OE L,
a OEX(b ) implies that a OE L. We say that a is adjacent to b if b and a
share a vertex, an edge, or a face. see figure 1 for an example of the
u, V, W, and x lists in two dimensions.
a, b
q
octant lists
T
L
P(b)
A(b)
K(b)
D(b)
C(b) (colleagues)
J(b)
U-list: a Î U(b)
v-list: a Î V(b)
W-list: a Î W(b)
X-list: a Î X(b)
I(b) (interaction)
octants in the FMM tree
FMM tree level
Maximum number of particles/octant
The FMM tree
All leaf octants
Parent octant of b (/ 0 for root of T )
All ancestor octants of b
The eight children octants of b (/ 0 for b Î T )
All descendant octants of b in T
Adjacent octants to b, same level
Adjacent octants to b (arbitrary level)
a Î L adjacent to b (note: b Î U(b))
a Î K(C(P(b) ) )\C(b)
a Î D(C(b) )\J(b), P(a) Î J(b)
iff b Î W(a)
V(b) ∪ U(b) ∪ W(b) ∪ X(b)
definitions, approximately evaluating the sum in Equation
( 1) involves an upward computation pass of T, followed by a
downward computation pass of T, seen more specifically in
Algorithm 1. Let us reiterate that the details of the “
interact” step in Algorithm 1 depend on list type and the underlying FMM scheme.
2. 1. Parallelizing the evaluation phase
Understanding the dependencies in Algorithm 1 and devising efficient parallel variants can be accomplished by
using distributed and shared memory programming models. These models can then be implemented using MPI,
OpenMP, and GPU application programming interfaces.
There are multiple levels of concurrency in FMM: across
steps (e.g., the S2U and ULI steps have no dependencies),
within steps (e.g., a reduction on octants of an octant b during the VLI step), and vectorizations of the per-octant calculations. In fact, one can pipeline the S2U, U2U, and VLI steps
to expose even more parallelism. We do not take advantage
of pipelining because it reduces the modularity and generality of the implementation.
The generic dependencies of the calculations outlined in
Algorithm 1 are as follows: The Approximate Interactions
(except for steps (5a) and (5b)) and Direct Interactions
parts can be executed concurrently. For the approximate
interaction calculations, the order of the steps denotes
dependencies, for example, step ( 2) must start after step ( 1)
has been completed. Steps (3a) and (3b) can be executed in
any order. However, concurrent execution of steps (3a) and
(3b) requires a concurrent write with accumulation. This is
also true for the steps (5a) and (5b), they are independent up
to concurrent writes.
figure 1. fmm lists. here we show an example of the u, V, W, and x
lists in two dimensions for a tree octant B. one can verify that I(B) is
inside the domain enclosed by C(P(B) ).
U
V
U
V
U
V
V
V
V
V
U
B
U
X
V
U
WW
U WW
UU
WW
W WW
WU
V
V
V
V
V
V
V
V
X