Figure 1 also shows abstract definitions of the performance models for each synchronization and communication operation. The performance model for each function
depends on the exact implementation. We provide a detailed
overview of the asymptotic as well as exact performance properties of our protocols and our implementation in the next
sections. The different performance characteristics of communication and synchronization functions make a unique
combination of implementation options for each specific
use-case optimal. Yet, it is not always easy to choose this best
variant. The exact models can be used to design close-to-opti-mal implementations (or as input for model-guided autotun-ing) while the simpler asymptotic models can be used in the
algorithm design phase as exemplified by Karp et al. 7
To support post-petascale computers, all protocols need
to implement each function in a scalable way, that is, consuming O(log p) memory and time on p processes. For the
purpose of explanation and illustration, we choose to discuss
a reference implementation as a use-case. However, all protocols and schemes discussed in the following can be used
on any RDMA-capable network.
2. 1. Use-case: Cray DMAPP and XPMEM
Our reference implementation used to describe RMA protocols and principles is called foMPI (fast one sided MPI).
foMPI is a fully functional MPI- 3 RMA library implementation
for Cray Gemini (XK5, XE6) and Aries (XC30) 3 systems. In
order to maximize asynchronous progression and minimize
overhead, foMPI interfaces to the lowest-level available
For inter-node (network) communication, foMPI uses the
RDMA API of Gemini and Aries networks: Distributed Memory
Application (DMAPP). DMAPP offers put, get, and a limited set
of atomic memory operations for certain 8 Byte datatypes.
For intra-node communication, we use XPMEM, 16 a portable
Linux kernel module that allows to map the memory of one
process into the virtual address space of another. All operations can be directly implemented with load and store instructions, as well as CPU atomics (e.g., using the x86 lock prefix).
foMPI’s performance properties are self-consistent (i.e.,
respective foMPI functions perform no worse than a combination of other foMPI functions that implement the same
functionality) and thus avoid surprises for users. We now proceed to develop algorithms to implement the window creation routines that expose local memory for remote access.
After this, we describe protocols for communication and
synchronization functions over RDMA networks.
2. 2. Scalable window creation
An MPI window is a region of process memory that is made
accessible to remote processes. We assume that communication memory needs to be registered with the communication subsystem and that remote processes require a remote
descriptor that is returned from the registration to access
the memory. This is true for most of today’s RDMA interfaces including DMAPP and XPMEM.
Traditional Windows. These windows expose existing
user-memory by specifying an arbitrary local base address.
All remote accesses are relative to this address. Traditional
windows are not scalable as they require Ω( p) storage on
each of the p processes in the worst case. Yet, they are useful when the library can only access user-specified memory.
Memory addresses are exchanged with two MPI_Allgather
operations: one for DMAPP and one for XPMEM.
Allocated Windows. These windows allow the MPI library
to allocate window memory and thus use identical base
addresses on all nodes requiring only O ( 1) storage. This can
be done with a system-wide symmetric heap or with the following POSIX-compliant protocol: ( 1) a leader process chooses a
random address and broadcasts it to other processes in the
window, and ( 2) each process tries to allocate the memory
with this specific address using mmap(). Those two steps are
repeated until the allocation was successful on all the processes
(this can be checked with MPI_Allreduce). This mechanism
requires O (log p) time (with high probability).
Dynamic Windows. Here, windows can be dynamically
resized by attaching or detaching memory regions with local
MPI_Win_attach and MPI_Win_detach calls. They can be
used in, for example, dynamic RMA-based data structures.
In our implementation, the former call registers a memory
region and inserts the information into a linked list; the latter
removes a region from the list. Both calls require O ( 1) memory
per region. The access to the list on a target is purely one sided.
We use a local cache to reduce the number of remote accesses;
a simple protocol uses gets to ensure the cache validity and to
update local information if necessary.
Shared Memory Windows. These windows are only valid
for intra-node communication, enabling efficient load and
store accesses. They can be implemented with POSIX shared
memory or XPMEM with constant memory overhead per
core. 5 We implement the intra-node case as a variant of allocated windows, providing identical performance and full
compatibility with shared memory windows.
2. 3. Communication functions
Communication functions map nearly directly to low-level
hardware functions, enabling significant speedups over message passing. This is a major strength of RMA programming.
In foMPI, put and get simply use DMAPP put and get for
remote accesses or local memcpy for XPMEM accesses.
Accumulates either use DMAPP atomics (for common integer
operations on 8 Byte data) or fall back to a simple protocol
that locks the remote window, gets the data, accumulates it
locally, and writes it back. This fallback protocol ensures that
the target is not involved in the communication for true passive mode. It can be improved if we allow buffering (enabling
a space-time trade-off18) and active messages to perform the
remote operations atomically.
We now show novel protocols to implement synchronization modes in a scalable way on pure RDMA networks without
2. 4. Scalable window synchronization
MPI defines exposure and access epochs. A process starts
an exposure epoch to allow other processes access to its
foMPI can be downloaded from