communication and thus degrade performance.
The dominance of message passing has recently been
questioned as novel hardware mechanisms are introduced,
enabling new high-performance programming models.
Specifically, network interfaces evolve rapidly to implement a growing set of features directly in hardware. A key
feature of today’s high-performance networks is remote
direct memory access (RDMA), enabling a process to directly
access virtual memory at remote processes without involvement of the operating system or activities at the remote side.
RDMA is supported by on-chip networks in, for example,
Intel’s SCC and IBM’s Cell systems, as well as off-chip networks such as InfiniBand, IBM’s PERCS or BlueGene/Q,
Cray’s Gemini and Aries, or even RDMA over Ethernet/TCP
(RoCE/iWARP).
The RDMA support gave rise to Remote Memory Access
(RMA), a powerful programming model that provides the
programmer with a Partitioned Global Address Space (PGAS)
abstraction that unifies separate address spaces of processors while preserving the information on which parts are
local and which are remote. A fundamental principle behind
RMA is that it relaxes synchronization and communication and
allows them to be managed independently. Here, processes
use independent calls to initiate data transfer and to ensure
the consistency of data in remote memories and the notification of processes. Thus, RMA generalizes the principles from
shared memory programming to distributed memory computers where data coherency is explicitly managed by the programmer to ensure highest speedups.
Hardware-supported RMA has benefits over message pass-
ing in the following three dimensions: ( 1) time by avoiding
synchronization overheads and additional messages in ren-
dezvous protocols, ( 2) energy by eliminating excessive copy-
ing of eager messages, and ( 3) space by removing the need for
receiver-side buffering. Several programming environments
embrace RMA principles: PGAS languages such as Unified
Parallel C (UPC) or Fortran 2008 Coarrays and libraries such
as Cray SHMEM or MPI- 2 One Sided. Significant experience
with these models has been gained in the past years1, 12, 17 and
several key design principles for RMA programming evolved.
Based on this experience, MPI’s standardization body, the
MPI Forum, has revamped the RMA (or One Sided) interface
in the latest MPI- 3 specification. 11 MPI- 3 RMA supports the
newest generation of RDMA hardware and codifies existing
RMA practice. A recent textbook4 illustrates how to use this
interface to develop high-performance large-scale codes.
However, it has yet to be shown how to implement the new
library interface to deliver highest performance at lowest
memory overheads. In this work, we design and develop scal-
able protocols for implementing MPI- 3 RMA over RDMA
networks, requiring O(log p) time and space per process on
p processes. We demonstrate that the MPI- 3 RMA interface
can be implemented adding negligible overheads to the per-
formance of the utilized hardware primitives.
In a wider sense, our work answers the question if the
MPI- 3 RMA interface is a viable candidate for moving towards
exascale computing. Moreover, it illustrates that RMA principles provide significant speedups over message passing
in both microbenchmarks and full production codes running on more than half a million processes. Finally, our work
helps programmers to rigorously reason about application
performance by providing a set of asymptotic as well as
detailed performance models of RMA functions.
2. SCALABLE PROTOCOLS FOR RMA
We now describe protocols to implement MPI- 3 RMA based
on low-level RDMA functions. In all our protocols, we assume
that we only have small bounded buffer space at each process
(O(log p) for synchronization, O( 1) for communication), no
remote software agent, and only put, get, and some basic
atomic operations (atomics) for remote accesses. Thus, our
protocols are applicable to all current RDMA networks and
are forward-looking towards exascale network architectures.
We divide the RMA functionality of MPI into three separate concepts: ( 1) window creation, ( 2) communication
functions, and ( 3) synchronization functions.
Figure 1a shows an overview of MPI’s synchronization
functions. They can be split into active target mode, in which
the target process participates in the synchronization, and
passive target mode, in which the target process is passive.
Figure 1b shows a similar overview of MPI’s communication
functions. Several functions can be completed in bulk with
bulk synchronization operations or using fine-grained request
objects and test/wait functions. However, we observed that the
completion model only minimally affects local overheads
and is thus not considered separately in the rest of this work.
passive target active target
: {} → T
P P : {} →T
: {} → T
P: {k} → T
Sync Flush
Fence Post/Start/ Complete/Wait Lock/Unlock Lock_all/ Unlock_all
P
P
: {p} → T
: {} → T
: {} → T
Flush_local Flush_all Flush_local_all
: {}→T : {} →T
(a) (b)
accumulate
: { s, o} → T : {s, o} → T : {s} → T
: {s }→ T : {s} → T
: { s, o} → T
bulk completion
fine grained
completion
Accumulate Get_accumulate Fetch_and_op CAS
Get Put
PPPP
P
P PP P
P
Figure 1. An overview of MPI- 3 RMA and associated cost functions. The figure shows abstract cost functions for all operations in terms of
their input domains. (a) Synchronization and (b) Communication. The symbol p denotes the number of processes, s is the data size, k is the
maximum number of neighbors, and o defines an MPI operation. The notation P: {p} → T defines the input space for the performance (cost)
function P. In this case, it indicates, for a specific MPI function, that the execution time depends only on p. We provide asymptotic cost
functions in Section 2 and parametrized cost functions for our implementation in Section 3.