An acquisition of a shared global lock (MPI_Win_lock_all)
only involves the global lock on the master. The origin
(Process 1) fetches and increases the lock in one atomic
operation. Since there is no exclusive lock present, Process 1
can proceed. Otherwise, it would repeatedly (remotely) read
the lock until no writer was present; exponential back off
can be used to avoid congestion.
For a local exclusive lock, the origin needs to ensure two
invariants: ( 1) no shared global lock and ( 2) no local shared
or exclusive lock can be held or acquired during the local
exclusive lock. For the first part, the origin (Process 2) atomically fetches the global lock from the master and increases
the writer part to register for an exclusive lock. If the fetched
value indicates lock all accesses, the origin backs off. As
there is no global reader, Process 2 proceeds to the second
invariant and tries to acquire an exclusive local lock on
Process 1 using a compare-and-swap (CAS) with zero (cf.
Ref. 8). It succeeds and acquires the lock. If one of the two
steps fails, the origin backs off and repeats the operation.
When unlocking (MPI_Win_unlock_all) a shared
global lock, the origin only atomically decreases the
global lock on the master. The unlocking of an exclusive
lock requires two steps: clearing the exclusive bit of the
local lock, and then atomically decreasing the writer
part of the global lock.
The acquisition or release of a shared local lock (MPI_
Win_lock/MPI_Win_unlock) is similar to the shared global
case, except it targets a local lock.
If no exclusive locks exist, then shared locks (both local
and global) only take one remote atomic operation. The
number of remote requests while waiting can be bound
by using MCS locks. 9 An exclusive lock will take in the best
case two atomic communication operations. Unlock operations always cost one atomic operation, except for the exclusive case with one extra atomic operation for releasing the
global lock. The memory overhead for all functions is O ( 1).
Flush. Flush guarantees remote completion and is
thus one of the most performance-critical functions on
MPI- 3 RMA programming. foMPI’s flush implementation
relies on the underlying interfaces and simply issues a
DMAPP remote bulk completion and an x86 mfence. All
flush operations (MPI_Win_flush, MPI_Win_flush_local,
MPI_Win_flush_all, and MPI_Win_flush_all_local) share the
same implementation and add only 78 CPU instructions (on
x86) to the critical path.
3. DETAILED PERFORMANCE MODELING
AND EVALUATION
We now analyze the performance of our protocols and
implementation and compare it to Cray MPI’s highly tuned
point-to-point as well as its relatively untuned one sided
communication. In addition, we compare foMPI with
two major HPC PGAS languages: UPC and Fortran 2008
Coarrays, both specially tuned for Cray systems. We execute all benchmarks on the Blue Waters supercomputer,
using Cray XE6 nodes. Each node contains four 8-core
AMD Opteron 6276 (Interlagos) 2.3GHz CPUs and is connected to other nodes through a 3D-Torus Gemini network.
Additional results can be found in the original SC13 paper.
3. 1. Communication
Comparing latency and bandwidth between RMA and point-to-point communication is not always fair since RMA communication may require extra synchronization to notify the
target. For all RMA latency results we ensure remote completion (the data is committed in remote memory) but no
synchronization. We analyze synchronization costs separately in Section 3. 2.
Latency and Bandwidth. We start with the analysis of
latency and bandwidth. The former is important in various latency-constrained codes such as interactive graph
processing frameworks and search engines. The latter represents a broad class of communication-intensive applications such as graph analytics engines or distributed
key-value stores.
We measure point-to-point latency with standard ping-pong techniques. Figure 3a shows the latency for varying
message sizes for inter-node put. Due to the highly optimized fast-path, foMPI has >50% lower latency than other
PGAS models while achieving the same bandwidth for
larger messages. The performance functions (cf. Figure 1)
are: Pput = 0.16ns ⋅ s + 1ms and Pget = 0.17ns ⋅ s + 1.9ms.
1
10
100
8 64 512 4096 32768 262144
Size [Bytes]
La
t
en
c
y
[u
s]
Transport Layer
FOMPI MPI− 3
Cray UPC
Cray MPI− 2. 2
Cray MPI− 1
Cray Fortran 2008 0
25
50
75
100
8 64 512 4096 32768 262144 2097152
Size [Bytes]
O
ver
lap
[%]
Transport Layer
FOMPI MPI− 3
Cray UPC
Cray MPI− 2. 2
0.001
0.010
0.100
1.000
8 64 512 4096 32768 262144
Message Size [Bytes]
M
es
s
age
Ra
t
e
[
Milli
on
M
es.
/S
e
c.]
Transport Layer
FOMPI MPI− 3
Cray UPC
Cray MPI− 2. 2
Cray MPI− 1
Cray Fortran 2008
1.0
1. 5
2.0
2. 5
8 16 32 64
(a) Latency Inter-Node Put (b) Overlap Inter-Node (c) Message Rate Inter-Node
1
2
3
4
8 16 32 64
DMAPP protocol
change
DMAPP protocol
change
DMAPP
protocol
change
Figure 3. Microbenchmarks: (a) Latency comparison for put with DMAPP communication. Note that message passing (MPI- 1) implies
remote synchronization while UPC, Fortran 2008 Coarrays, and MPI- 2.2/3 only guarantee consistency. (b) Communication/computation
overlap for put over DMAPP, Cray MPI- 2. 2 has much higher latency up to 64 KB (cf. a), thus allows higher overlap. (c) Message rate for put
communication.