OCTOBER 2018 | VOL. 61 | NO. 10 | COMMUNICATIONS OF THE ACM 111
explicitly or rely on synchronization side effects of other
functions (e.g., allreduce).
Global Synchronization. Global synchronization is
performed in applications based on the Bulk Synchronous
Parallel (BSP) model. It is offered by fences in MPI. It can
be directly compared to Fortran 2008 Coarrays sync all and
UPC’s upc_barrier which also synchronize the memory at all
processes. Figure 4b compares the performance of foMPI
with Cray’s MPI- 2. 2, UPC, and Fortran 2008 Coarrays implementations. The performance function for foMPI’s fence
implementation is: Pfence = 2.9ms ⋅ log2(p).
General Active Target Synchronization (PSCW). This
mode may accelerate codes where the communication
graph is static or changes infrequently, for example stencil computations. Only MPI offers PSCW. Figure 4c shows
the performance for Cray MPI- 2. 2 and foMPI when synchronizing a ring where each process has exactly two
neighbors (k = 2). An ideal implementation would exhibit
constant time. We observe systematically growing overheads in Cray’s MPI as well as system noise (due to network
congestion, OS interrupts and deamons, and others) on
runs with >1000 processes with foMPI. We model the performance with varying numbers of neighbors and foMPI’s
PSCW synchronization costs involving k off-node neighbor are Ppost = Pcomplete = 350ns ⋅ k, and Pstart = 0.7ms, Pwait =
1.8ms (without noise).
Passive Target Synchronization. Finally, we evaluate lock-based synchronization that can be utilized to develop
high-performance distributed-memory variants of shared-memory lock-based codes. The performance of lock/unlock
is constant in the number of processes as ensured by our
protocols and thus not graphed. The performance functions
are Plock,excl = 5.4ms, Plock,shrd = Plock_all = 2.7ms, Punlock,shrd = Punlock_all
= 0.4ms, Punlock,excl = 4.0ms, Pflush = 76ns, and Psync = 17ns.
We demonstrated the performance of our protocols
and implementation using microbenchmarks comparing
to other RMA and message passing codes. The exact performance models can be utilized to design and optimize
parallel applications, however, this is outside the scope of
the paper. To demonstrate the usability and performance
of our design for real codes, we continue with a large-scale
Overlapping Computation. Overlapping computation
with communication is a technique in which computation is progressed while waiting for communication to be
finished. Thus, it reduces the number of idle CPU cycles.
Here, we measure how much of such overlap can be
achieved with the compared libraries and languages. The
benchmark calibrates a computation loop to consume
slightly more time than the latency. Then it places computation between communication and synchronization
and measures the combined time. The ratio of overlapped
computation is then computed from the measured communication, computation, and combined times. Figure
3b shows the ratio of the overlapped communication for
Cray’s MPI- 2. 2, UPC, and foMPI.
Message Rate. This benchmark is similar to the latency
benchmark. However, it benchmarks the start of 1000 transactions without synchronization to determine the overhead
for starting a single operation. Figure 3c presents the results
for the inter-node case. Here, injecting a single 8 Byte operation costs only 416ns.
Atomics. As the next step we analyze the performance of
various atomics that are used in a broad class of lock-free
and wait-free codes. Figure 4a shows the performance of the
DMAPP-accelerated MPI_SUM of 8 Byte elements, a non-accelerated MPI_MIN, and 8 Byte CAS. The performance
functions are Pacc,sum = 28ns ⋅ s + 2.4ms, Pacc,min = 0.8ns ⋅ s + 7.3ms,
and PCAS = 2.4ms. The DMAPP acceleration lowers the latency
for small operations while the locked implementation exhibits a higher bandwidth. However, this does not consider the
serialization due to the locking.
3. 2. Synchronization schemes
Finally, we evaluate synchronization schemes utilized
in numerous parallel protocols and systems. The different synchronization modes have nontrivial trade-offs.
For example PSCW performs better for small groups of
processes and fence performs best for groups that are
essentially as big as the full group attached to the window. However, the exact crossover point is a function of
the implementation and system. While the active target
mode notifies the target implicitly that its memory is consistent, in passive target mode, the user has to do this
1 8 64 512 4096 32768 262144
Number of Elements Number of Processes Number of Processes
Cray UPC aadd
Cray UPC CAS
2 8 32 128 512 2k 8k
Cray UPC barrier
Cray Fortran 2008 sync all
Cray MPI Win_fence
2 8 32 128 512 2k 8k 32k 128k
(a) (b) (c)
2. 41 us
Figure 4. Performance of atomic accumulate operations and synchronization latencies. (a) Atomic Operation Performance, (b) Latency for
Global Synchronization, and (c) Latency for PSCW (Ring Topology).