called “MPI+X” (such as MPI+CUDA),
combines traditional MPI with accelerator
programming models (such as CUDA,
OpenACC, and OpenMP) in a simple
way. In this model, MPI communication
is performed by the CPU. Yet this can
be inefficient and inconvenient. We
recently proposed a programming model
called distributed CUDA, or dCUDA, to
perform communication from within a
CUDA compute kernel [ 3]. This allows
use of the powerful GPU warp scheduler
for hiding communication latency.
In general, integrating accelerators
and communication functions is an
interesting research topic. Even a
model called MPI+MPI combining MPI
at different hierarchy levels seems very
useful in practice, depending, of course,
on the application [ 4].
We thank Torsten Hoefler for sharing
his knowledge, as well as Geoffrey Dillon
and Pedro Lopes for their comments
on an earlier version of this interview. I
also thank Efstratios Gallopoulos of the
University of Patras for bringing to my
attention the book review by Parlett [ 1]
of the University of California, Berkeley.
Any errors are solely the responsibility of
1. Parlett, B. N. Review of The Matrix Eigenvalue Problem:
GR and Krylov Subspace Methods by D. Watkins
and Numerical Methods for General and Structured
Eigenvalue Problems by D. Kressner.
SIAM Revie w 52, 4 (2010), 771–791.
2. Hoefler, T. and Gottlieb, S. Parallel zero-copy
algorithms for fast fourier transform and conjugate
gradient using MPI datatypes. In Proceedings of the
17th European MPI Users Group Meeting Conference
on Recent Advances in the Message Passing Interface
(Stuttgart, Germany, Sept. 12–15, 2010), 131–141.
3. Gysi, T., Baer, J., and Hoefler, T. dCUDA: Hardware
supported overlap of computation and
communication. In Proceedings of the ACM/IEEE
Supercomputing Conference (Salt Lake City, U T, Nov.
13–18). ACM Press, New York, 2016.
4. Hoefler, T., Dinan, J., Buntinas, D., Balaji, P., Barrett, B.,
Brightwell, R. et al. MPI+MPI: A new hybrid approach to
parallel programming with MPI plus shared memory.
Journal of Computing 95, 12 (2013), 1121–1136.
Vasileios Kalantzis is a Ph. D. candidate in the Computer
Science and Engineering Department of the University of
Minnesota, Minneapolis, MN. His research interests are in
scientific computing, more specifically, numerical linear
algebra and parallel computing. He is a member of SIAM,
ILAS, and ACM.
© 2017 ACM 1528-4972/17/03 $15.00
version 3, is scalable and has very basic
support for fault tolerance, making it
a great candidate for early exascale
systems. However, among the things that
could benefit very large-scale execution
are improved fault-tolerance support,
better interaction with accelerators, and
XRDS: We hear much lately about the era
of exascale computing and the extreme
computing power it will deliver. Given
the increasing gap bet ween the speed of
processors compared to bringing data
from memory, how should we proceed?
TH: Minimizing and scheduling
communication efficiently will be one
of the major challenges for exascale.
Very large-scale machines already have
relatively low communication bandwidth.
While minimizing communications is
a problem for algorithm developers,
communication scheduling is managed
by the MPI implementation. MPI- 3
added several interfaces to enable more
powerful communication scheduling, in,
say, nonblocking collective operations
and neighborhood collective operations.
Implementation of these interfaces has
spawn interesting research questions
that are being tackled by the community
XRDS: With the rise of data analytics,
new parallel computing paradigms have
emerged, including Spark and Hadoop.
How do you see MPI competing against
these paradigms in big data applications,
where other considerations (such as
length of the code, fault tolerance, and
non-CS scientists) are enabled?
TH: MPI originated from within
the HPC community, which runs
large simulation codes that have
been developed over decades on very
large systems. Much of the big data
community moved from single nodes
to parallel and distributed computing
to process larger amounts of data
using relatively short-lived programs
and scripts. Programmer productivity
thus played only a minor role in MPI/
HPC, while it was one of the major
requirements for big data analytics.
While MPI codes are often orders-of-
magnitude faster than many big data
codes, they also take much longer
to develop—which is most often a
good trade-off. I would not want to
spend weeks to write a data-analytics
application to peek at some large
dataset. Yet when I am using tens of
thousands of cores running the same
code, I would probably like to invest in
improving the execution speed. So I
don’t think the models are competing;
they’re simply designed for different
uses and can certainly learn much from
XRDS: MPI supports I/O operations, but
many applications of interest today
are in the big data regime where large
chunks of data have to be read/written
from/to the disk. What is the status of
MPI-I/O? Will anything new be part of the
TH: MPI I/O was introduced nearly
two decades ago to improve the handling
of large datasets in parallel settings.
It is used successfully in many large
applications and I/O libraries, including
HDF- 5. It could also be used to improve
large block I/O in MapReduce applications,
though it needs to be shown in practice. I
do not know of any major new innovations
in I/O planned for MPI- 4.
XRDS: Compute accelerators like GPUs
play an important role in modern data
analytics and machine learning. How has
MPI responded to this change?
TH: MPI predates the time when
the use of accelerators became
commonplace. However, when
accelerators are used in distributed-memory settings (such as computer
clusters), MPI is the common way to
program them. The current model, often
“I would say
is due to its clear
around a relatively