figure 3. intra-node optimization. (a) optimal particles per octant: increasing the leaf sizes increases the relative costs of uLi, WLi, and xLi
while decreasing the costs of the other phases. (b) implementation comparison: solid horizontal bars indicate the observed wall-clock time
on each platform. in the hybrid CPu+GPu implementation, uLi runs on the GPu while the other steps run on the CPu, yielding an overall
benefit from overlapping.
90
85
80
75
70
65
60
55
Time (s)
50
45
40
35
30
25
20
15
10
5
0
Step
ULI
WLI
XLI
VLI
Up
Down
95
100
90
85
80
75
70
65
60
Time (s)
55
50
45
40
35
30
25
20
15
10
5
0
89 s ( 1.0;)
Step
Others
ULI
12 s ( 7. 4;)
2. 8 s ( 32;)
50
PointsPerBox_str
(a)
100 250 500
750
CPU+SSE CPU+SSE+MT
GPU
(b)
r2), where ⊗ is the outer product between two three-dimensional vectors.)
Platforms and architectures: We tested our code on two
supercomputing platforms, one based on conventional
CPUs and the other based on hybrid CPU+GPU nodes.
The CPU system is Jaguar PF at the National Center for
Computational Sciences (UT/ORNL), ( http://www.nccs.
gov/computing-resources/jaguar/) a Cray XT5 with 224,256
cores (using 2.6-GHz hex-core AMD Opteron CPUs, 2GB/
core), and a three-dimensional-torus network ( 9. 6 GB/s link
bandwidth). Jaguar is ranked third on the Top 500 List (www.
top500.org) as of June 2011. The GPU system is the Keeneland
initial delivery system, also at UT/ORNL (http://keeneland.
gatech.edu). Keeneland is based on the Hewlett-Packard
SL390 servers accelerated with NVIDIA Tesla M2070 GPUs.
Keeneland has 120 compute nodes, each with dual-socket,
six-core Intel X5660 2.8-GHz Westmere processors and 3
GPUs per node, with 24GB of DDR3 host memory. Nodes are
interconnected with single rail QDR Infiniband.
MPI, strong scalability tests on Jaguar: Figure 4(a)
summarizes the strong scaling experiments. The problem has 100 million unknowns. We use a nonuniform
(line) distribution. On 24,576 cores, the evaluation phase
takes just 1. 7 s, which constitutes a speedup of 205× when
increasing cores by a factor of 512×. The setup phase
(tree construction) begins to dominate at this scale in the
strong-scaling regime.
MPI, weak scalability tests on Jaguar: Figure 4(b) summa-
rizes the weak scaling results. The problem size per core is
kept fixed to approximately 672,000 unknowns. The FMM
evaluation phase and setup phase (tree construction) times
increase because of several factors. First, since we have
highly nonuniform distributions (the max leaf-level is 23
and the min is 4—for a uniform tree, the leaf level would
be 9), as we increase the problem size, the tree gets deeper
and the cost per core increases (i.e., we observe a O(N log N)
scaling), as we have not reached the asymptotic O(N) phase
of the FMM. Secondly, for nonuniform trees, it is difficult to
load balance all phases of FMM. The solution to these scal-
ing problems is to employ the hypercube-like tree-broadcast
algorithm for all of the phases of FMM. (Currently it is used
only in the postorder communication phase of the evalua-
tion phase.) Finally, the setup phase is not multithreaded,
which is the subject of future work. Nevertheless, we still
attain a reasonably good utilization of compute resources:
the evaluation phase sustains over 1. 2 GFlopsper core (in
double-precision).
6. DisCussioN AND CoNCLusioN
We have presented several algorithms that taken together
expose and exploit concurrency at all stages of the fast multipole algorithms and employ several parallel programming