4. ACCELERATING FULL CODES WITH RMA
To compare our protocols and implementation with the state
of the art, we analyze a 3D FFT code as well as the MIMD
Lattice Computation (MILC) full production application
with several hundred thousand lines of source code that
performs quantum field theory computations. Other application case-studies can be found in the original SC13 paper,
they include a distributed hashtable representing many big
data and analytics applications and a dynamic sparse data
exchange representing graph traversals and complex modern
scientific codes such as n-body methods.
In all the codes, we keep most parameters constant to compare the performance of PGAS languages, message passing,
and MPI RMA. Thus, we did not employ advanced concepts,
such as MPI datatypes or process topologies, which are not
available in all designs (e.g., UPC and Fortran 2008).
4. 1. 3D fast Fourier transform
We now discuss how to exploit overlap of computation and
communication in a 3D Fast Fourier Transformation. We use
Cray’s MPI and UPC versions of the NAS 3D FFT benchmark.
Nishtala et al. 12 and Bell et al. 1 demonstrated that overlap of
computation and communication can be used to improve
the performance of a 2D-decomposed 3D FFT. We compare
the default “nonblocking MPI” with the “UPC slab” decomposition, which starts to communicate the data of a plane as
soon as it is available and completes the communication as
late as possible. For a fair comparison, our foMPI implementation uses the same decomposition and communication
scheme like the UPC version and required minimal code
changes resulting in the same code complexity.
Figure 5 illustrates the results for the strong scaling class D
benchmark (2048 × 1024 × 1024). UPC achieves a consistent
speedup over message passing, mostly due to the communication and computation overlap. foMPI has a some-what
lower static overhead than UPC and thus enables better overlap (cf. Figure 3b) and slightly higher performance.
4. 2. MIMD lattice computation
The MIMD Lattice Computation (MILC) Collaboration stud-
ies Quantum Chromodynamics (QCD), the theory of strong
interaction. 2 The group develops a set of applications,
known as the MILC code, which regularly gets one of the
largest allocations at US NSF supercomputer centers. The
su3_rmd module, which is part of the SPEC CPU2006 and
SPEC MPI benchmarks, is included in the MILC code.
The program performs a stencil computation on a 4D
rectangular grid and it decomposes the domain in all four
dimensions to minimize the surface-to-volume ratio. To
keep data consistent, neighbor communication is performed in all eight directions. Global allreductions are
done regularly to check the solver convergence. The most
time-consuming part of MILC is the conjugate gradient
solver which uses nonblocking communication overlapped
with local computations.
Figure 6 shows the execution time of the whole application for a weak-scaling problem with a local lattice
of 43 × 8, a size very similar to the original Blue Waters
Petascale benchmark. Some computation phases (e.g.,
CG) execute up to 45% faster, yet, we chose to report
full-code performance. Cray’s UPC and foMPI exhibit
essentially the same performance, while the UPC code
uses Cray-specific tuning15 and the MPI- 3 code is portable to different architectures. The full-application
performance gain over Cray’s MPI- 1 version is more
than 15% for some configurations. The application was
scaled successfully to up to 524,288 processes with all
implementations. This result and our microbenchmarks
demonstrate the scalability and performance of our
protocols and that the MPI- 3 RMA library interface can
achieve speedups competitive to compiled languages
such as UPC and Fortran 2008 Coarrays while offering all
of MPI’s convenient functionalities (e.g., Topologies and
Datatypes). Finally, we illustrate that the new MPI- 3 RMA
semantics enable full applications to achieve significant
speedups over message passing in a fully portable way.
Since most of those existing codes are written in MPI,
a step-wise transformation can be used to optimize most
critical parts first.
5. RELATED WORK
PGAS programming has been investigated in the context of
UPC and Fortran 2008 Coarrays. For example, an optimized
UPC Barnes Hut implementation shows similarities to MPI- 3
RMA programming by using bulk vectorized memory transfers combined with vector reductions instead of shared
pointer accesses. 17 Highly optimized PGAS applications
often use a style that can easily be adapted to MPI- 3 RMA.
200
400
800
1600
1024 4096 16384 65536
Per
for
m
anc
e[
G
Fl
o
p/
s]
Transport Layer
18.4% 23.8% 10.3%
40.0% 39.6%
45.7%
101.8% FOMPI MPI− 3
Cray UPC
Cray MPI− 1
Number of Processes
Figure 5. 3D FFT Performance. The annotations represent the
improvement of FOMPI over message passing.
100
200
400
800
4k 8k 16k 32k 64k 128k 256k 512k
FOMPI MPI− 3
Cray UPC
Cray MPI− 1
Transport Layer
7.9% 6.5% 10.3% 13.2%
14.8%
5.3% 15.2%
13.8%
Number of Processes
App
li
c
at
io
n
C
omp
le
ti
on
Ti
me
[
s]
Figure 6. Full MILC code execution time. The annotations represent
the improvement of FOMPI over message passing.