a complex code like FMM. Currently, we are working on
introducing shared-memory parallelism in the tree construction, sharing the workload between CPUs and GPUs
(essentially treating the GPU as another MPI process), and
introducing shared-memory parallelism in the upward
and downward computations.
Acknowledgments
This work was supported in part by the U.S. National
Science Foundation (NSF) grants CNS-0929947, CCF-
0833136, CAREER-0953100, OCI-0749285, OCI-0749334,
OCI-1047980, CNS-0903447 and Semiconductor Research
Corporation (SRC) award 1981, U.S. Department of Energy
(DOE) grant DEFC02-10ER26006/DE-SC0004915, and a
grant from the U.S. Defense Advanced Research Projects
Agency (DARPA). Any opinions, findings and conclusions
or recommendations expressed in this material are those
of the authors and do not necessarily reflect those of NSF,
SRC, DOE, or DARPA. Computing resources on the TeraGrid
systems were provided under the TeraGrid allocation grants
ASC-070050N, CCR-090024, ASC-100019, and MCA-04N026.
We would like to thank the TeraGrid support staff, and also
the staff and consultants at NCSA, TACC, and NICS from
whom we have received significant assistance.
References
1. barnes, J., hut, P. a hirerachical o(n
logn) force-calculation algorithm.
Nature 324, 4 (december 1986),
446–449.
2. callahan, P.b., kosaraju, S.r.
algorithms for dynamic closest
pair and n-body potential fields.
in Proceedings of the 6th Annual
ACM-SIAM Symposium on Discrete
Algorithms, SODA ’ 95 (Philadelphia,
Pa, USa, 1995), Society for
industrial and applied Mathematics,
263–272.
3. carrier, J., greengard, l., rokhlin, V.
a fast adaptive multipole algorithm
for particle simulations. SIAM J. Sci.
Stat. Comput. 9, 4 (1988), 669–686.
4. cherrie, J., beatson, r.k., newsam,
g.n. fast evaluation of radial basis
functions:methods for generalized
multiquadrics in rn. SIAM J. Sci.
Comput. 23, 5 (2002), 1549–1571.
5. darve, e., cecka, c., takahashi, t. the
fast multipole method on parallel
clusters, multicore processors, and
graphics processing units. Comptes
Rendus Mécanique 339( 2–3) (2011),
185–193.
6. grama, a., gupta, a., karypis, g., kumar,
V. an introduction to Parallel computing:
design and analysis of algorithms, 2nd
edn, addison Wesley, 2003.
7. grama, a.y., kumar, V., Sameh, a.
Scalable parallel formulations of
the barnes–hut method for n-body
simulations. in Proceedings of the
1994 conference on Supercomputing,
Supercomputing ’ 94 (los alamitos,
ca, USa, 1994), ieee computer
Society Press, 439–448.
8. gray, a., Moore, a. n-body’problems
in statistical learning. Adv. Neural
Inform. Process Syst. (2001),
521–527.
9. greengard, l., gropp, W.d. a parallel
version of the fast multipole method.
Comput. Math. Appl. 20, 7 (1990),
63–71.
10. greengard, l., rokhlin, V. a fast
algorithm for particle simulations.
J. Comput. Phys. 73 (1987),
325–348.
11. gumerov, n.a., duraiswami, r. fast
multipole methods on graphics
processors. J. Comput. Phys. 227, 18
(2008), 8290–8313.
12. hamada, t., narumi, t., yokota,
r., yasuoka, k., nitadori, k., taiji,
M. 42 tflops hierarchical n-body
simulations on gPUs with applications
in both astrophysics and turbulence.
in Proceedings of SC09, the Scxy
conference Series (Portland, oregon,
november 2009), acM/ieee.
13. hariharan, b., aluru, S. efficient
parallel algorithms and software for
compressed octrees with applications
to hierarchical methods. Parallel
Comput. 31 ( 3–4) (2005), 311–331.
14. JáJá, J. an introduction to Parallel
algorithms, addison Wesley, 1992.
15. kurzak, J., Pettitt, b.M. Massively
parallel implementation of
a fast multipole method for
distributed memory machines. J. Parallel
Distr. Comput. 65, 7 (2005), 870–881.
16. greengard, l. fast algorithms for
classical physics. Science 265, 5174
(1994), 909–914.
17. lashuk, i. et al. a massively parallel
adaptive fast-multipole method
on heterogeneous architectures.
in Proceedings of the Conference
on High Performance Computing
Networking, Storage and Analysis, SC
’09 (new york, ny, USa, 2009), acM,
58:1–58: 12.
18. Miller, g., teng, S., thurston, W.,
Vavasis, S. Separators for sphere-packings and nearest neighbor graphs.
J. ACM (JACM)
44, 1 (1997), 1–29.
19. ogata, S. et al. Scalable and portable
implementation of the fast multipole
method on parallel computers.
Comput. Phys. Comm. 153, 3 (2003),
445–461.
20. Phillips, J.c., Stone, J.e., Schulten, k.
adapting a message-driven parallel
application to gPU-accelerated
clusters. in SC ’08: Proceedings of
the 2008 ACM/IEEE Conference on
Supercomputing (2008), 1–9.
Ilya Lashuk ( lashuk2@llnl.gov),
research scientist, institute for Scientific
computing research, lawrence livermore
national laboratory, livermore, ca.
Aparna Chandramowlishwaran
( aparna@cc.gatech.edu), graduate
research assistant, computational Science
and engineering division, college of
computing, atlanta, ga.
harper Langston ( harper@cc.gatech.edu),
postdoctoral associate, computational
Science and engineering division, college
of computing, atlanta, ga.
Tuan-Anh nguyen ( tuananh@cc.gatech.edu),
graduate research assistant, computational
Science and engineering division, college of
computing, atlanta, ga.
Rahul Sampath ( rahul.sampath@gmail.com),
postdoctoral associate, oak ridge national
laboratory, oak ridge, tn.
and load balancing algorithms for
parallel adaptive n-body simulation.
SIAM J. Sci. Comput. 19, 2 (1998).
26. teyssier, r. et al. full-sky weak-lensing simulation with 70 billion
particles. Astron. Astrophys. 497, 2
(2009), 335–341.
27. Warren, M.S., Salmon, J.k. a parallel
hashed octtree n-body algorithm. in
Proceedings of Supercomputing, the
Scxy conference Series (Portland,
oregon, november 1993), acM/ieee.
28. ying, l., biros, g., Zorin, d. a kernel-independent adaptive fast multipole
method in two and three dimensions.
J. Comput. Phys. 196, 2 (2004),
591–626.
29. ying, l., biros, g., Zorin, d., langston,
h. a new parallel kernel-independent
fast multipole algorithm. in
Proceedings of SC03, the Scxy
conference Series (Phoenix, arizona,
november 2003), acM/ieee.
30. yokota, r., bardhan, J., knepley, M.,
barba, l., hamada, t. biomolecular
electrostatics using a fast multipole
beM on up to 512 gPUs and a billion
unknowns. Comput. Phys. Comm. 182, 6
(Jun. 2011), 1272–1283.
Aashay Shringarpure (aashay.
shringarpure@gmail.com).
Richard Vuduc ( richie@cc.gatech.edu),
assistant professor, computational
Science and engineering division,
college of computing, atlanta, ga.
Lexing Ying ( lexing@math.utexas.edu),
associate professor, Mathematics, the
University of texas at austin, tX.
Denis Zorin ( dzorin@cs.nyu.edu),
professor, courant institute of
Mathematical Sciences. new york
University, new york, ny.
George Biros ( gbiros@acm.org), W.a.
“tex” Moncrief, Jr. Simulation-based
engineering Sciences chair, institute
of computational engineering and
Sciences, the University of texas at
austin, tX.