figure 4. strong and weak scalability on Jaguar Pf. (a) strong scaling on Jaguar Pf. the strong scalability result for 22m particles. each
bar is labeled by its absolute wall-clock time (seconds) and speedup relative to the 48-core case. there are six cores (and six openmP
threads) per mPi process. the finest level of the octree is nine and the coarsest is three. the evaluation and setup phases show 205× and
48× speedups, respectively, when increasing the core count by 512×. (b) Weak scaling on Jaguar Pf. the weak scalability of the simulation to
196,608 cores, with approximately 672,000 unknowns per core. each bar is labeled by wall-clock time (seconds) and parallel efficiency. We
use one mPi process per socket and all of the six openmP threads in each mPi process. the finest to coarsest octree levels range from 24 to
4. our implementation maintains parallel efficiency levels of 60% or more at scale.
1.05
1.00
62 s
10 s
4.0 s
193 s
0.95
459 s
( 1.0;)
( 7. 3;)
( 45;)
( 115;)
(0.64)
0.90
0.85
0.80
1. 7 s
152 s
0.75
(0.81)
100 s
0.70
6. 2 s
(205;)
43 s
(0.67)
0.65
Time
0.60
349 s
( 1.0;)
( 56;)
123 s
( 8.0;)
( 1.0)
75 s
0.55
(0.89)
0.50
67 s
0.45
0.40
( 1.0)
0.35
0.30
2. 3 s
0.25
93 s
0.20
4.0 s
( 48;)
77 s
19 s
(0.60)
0.15
0.10
110 s
( 1.0;)
( 27;)
56 s
(0.74)
( 5. 7;)
( 1.0)
0.05
0.00
Time (s)
80
90
100
110
120
130
140
150
160
170
180
190
200
70
60
50
40
30
20
10
0
48
( 1;)
No. cores
384
( 8;)
3072
( 64;)
Step
Evaluation
(a)
24576
(512;)
24,576
( 1;)
196,608
( 8;)
Setup
No. cores
(b)
98,304
( 4;)
figure 5. GPu strong scaling. We show strong scaling for fmm
evaluation of 48 million particles on up to 192 GPus of Keeneland
system, which completes in as few as 0.47 s. the solid point-line
shows the measured wall-clock time, and the dashed line shows
the ideal time under perfect speedup. each point is labeled by its
parallel efficiency (ideal = 1).
102
1.0
101
0.91
0.76
Time (s)
0.60
0.60
0.52
100
0.55
10–1
3
No. of GPUs ( 1 MPI task/GPU)
6
12
24
48
96
paradigms. We showed that we can efficiently scale the tree
setup with the major cost being the parallel sort, which in
turn exhibits textbook scalability. We described a new reduction scheme for the FMM algorithm and we demonstrated
overall scalability. We explored per-core concurrency using
the streaming paradigm on GPU accelerators with excellent speedups. FMM is a highly nontrivial algorithm with
several different phases, a combination of multiresolution
data structures, fast transforms, and highly irregular data
access. Yet, we were able to achieve significant speedups on
heterogeneous architectures.
On our largest run on 196,608 cores, we observed about
220 TFlops/s in the evaluation, about 15% of the Linpack
sustained performance. Assuming a 7× speedup on a
hybrid machine of similar size means that our FMM would
exceed one PetaFlop without further modifications. Our
main conclusion is that it is possible to construct highly
efficient scalable N-body solvers, at least up to hundreds of
thousands of cores. But what about one million or many
million cores?
One important lesson for us is that hybrid parallelism is
absolutely necessary; once we reached several thousands
of MPI processes, MPI resource management in the communication-intensive phases of the algorithm became
an issue. Employing shared memory parallelism within
the socket and combining with accelarators results in a
more scalable code. The maintainability overhead, however, can be significant especially for the GPU case and for