DOI: 10.1145/3264413
Enabling Highly Scalable Remote
Memory Access Programming
with MPI- 3 One Sided
By Robert Gerstenberger,* Maciej Besta, and Torsten Hoefler
Abstract
Modern high-performance networks offer remote direct
memory access (RDMA) that exposes a process’ virtual
address space to other processes in the network. The
Message Passing Interface (MPI) specification has recently
been extended with a programming interface called MPI- 3
Remote Memory Access (MPI- 3 RMA) for efficiently exploiting state-of-the-art RDMA features. MPI- 3 RMA enables a
powerful programming model that alleviates many message
passing downsides. In this work, we design and develop
bufferless protocols that demonstrate how to implement
this interface and support scaling to millions of cores with
negligible memory consumption while providing highest
performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for
RMA functions that enable rigorous mathematical analysis of application performance and facilitate the development of codes that solve given tasks within specified time
and energy budgets. We validate the usability of our library
and models with several application studies with up to half a
million processes. In a wider sense, our work illustrates how
to use RMA principles to accelerate computation- and data-intensive codes.
1. INTRODUCTION
Supercomputers have driven the progress of various society’s domains by solving challenging and computationally intensive problems in fields such as climate modeling,
weather prediction, engineering, or computational physics.
More recently, the emergence of the “Big Data” problems
resulted in the increasing focus on designing high-performance architectures that are able to process enormous
amounts of data in domains such as personalized medicine, computational biology, graph analytics, and data
mining in general. For example, the recently established
Graph500 list ranks supercomputers based on their ability
to traverse enormous graphs; the results from November
2014 illustrate that the most efficient machines can process up to 23 trillion edges per second in graphs with more
than 2 trillion vertices.
Supercomputers consist of massively parallel nodes,
each supporting up to hundreds of hardware threads in a
single shared-memory domain. Up to tens of thousands
of such nodes can be connected with a high-performance
network, providing large-scale distributed-memory parallelism. For example, the Blue Waters machine has >700,000
cores and a peak computational bandwidth of > 13 petaflops.
Programming such large distributed computers is far
from trivial: an ideal programming model should tame the
complexity of the underlying hardware and offer an easy
abstraction for the programmer to facilitate the development of high-performance codes. Yet, it should also be able
to effectively utilize the available massive parallelism and
various heterogeneous processing units to ensure highest
scalability and speedups. Moreover, there has been a growing need for the support for performance modeling: a rigorous
mathematical analysis of application performance. Such
formal reasoning facilitates developing codes that solve
given tasks within the assumed time and energy budget.
The Message Passing Interface (MPI) 11 is the de facto standard API used to develop applications for distributed-memory
supercomputers. MPI specifies message passing as well as
remote memory access semantics and offers a rich set of features that facilitate developing highly scalable and portable
codes; message passing has been the prevalent model so far.
MPI’s message passing specification does not prescribe specific ways how to exchange messages and thus enables flexibility in the choice of algorithms and protocols. Specifically,
to exchange messages, senders and receivers may use eager
or rendezvous protocols. In the former, the sender sends a
message without coordinating with the receiver; unexpected
messages are typically buffered. In the latter, the sender waits
until the receiver specifies the target buffer; this may require
additional control messages for synchronization.
Despite its popularity, message passing often introduces
time and energy overheads caused by the rendezvous control
messages or copying of eager buffers; eager messaging may
also require additional space at the receiver. Finally, the
fundamental feature of message passing is that it couples
communication and synchronization: a message both trans-
fers the data and synchronizes the receiver with the sender.
This may prevent effective overlap of computation and
The original version of this paper was published in the
Proceedings of the Supercomputing Conference 2013 (SC’ 13),
Nov. 2013, ACM.
* RG performed much of the implementation during an internship at UIUC/
NCSA while the analysis and documentation was performed during a scientific
visit at ETH Zurich. RG’s primary email address is gerstenberger.robert@
gmail.com.