A more detailed explanation can be found in our SC13 paper.
Lock Synchronization. We now sketch a low-overhead and
scalable strategy to implement shared global, shared process-local, and exclusive process-local locks on RMA systems (the
MPI specification does not allow exclusive global locks). These
mechanisms allows to synchronize processes and memories
at very fine granularities. We utilize a two-level lock hierarchy:
one global lock variable (at a designated process, called
master) and p local lock variables (one lock on each process).
Each local lock variable is used to implement a reader-writer lock that allows one writer (exclusive lock), but many
readers (shared locks). The highest order bit of the variable
indicates a write access; the other bits are used to count the
number of shared locks (cf. Ref. 8). The global lock variable
is split into two parts; they count the number of processes
holding a shared global lock in the window and the number
of exclusively locked processes, respectively. These variables
enable all lock operations to complete in O ( 1) steps if a lock
can be acquired immediately; they are pictured in Figure 2a.
Figure 2b shows an exemplary lock scenario for three processes. We omit a detailed description of the protocol due
to the lack of space (the source code is available online); we
describe a locking scenario to illustrate the core idea behind
the protocol. Figure 2c shows a possible execution schedule
for the scenario from Figure 2b. Please note that we permuted
the order of processes to ( 1, 0, 2) instead of the intuitive (0, 1, 2)
to minimize overlapping lines in the figure.
memory. To access exposed memory at a remote target, the
origin process has to be in an access epoch. Processes can
be in access and exposure epochs simultaneously. Exposure
epochs are only defined for active target synchronization (in
passive target, window memory is always exposed).
Fence. MPI_Win_fence, called collectively by all processes,
finishes the previous exposure and access epoch and opens
the next exposure and access epoch for the whole window.
All remote memory operations must be committed before
leaving the fence call. We use an x86 m fence instruction
(XPMEM) and DMAPP bulk synchronization (gsync) followed
by an MPI barrier to ensure global completion. The asymptotic memory bound is O( 1) and, assuming a good barrier
implementation, the time bound is O (log p).
General Active Target Synchronization. This mode (also
called “PSCW”) synchronizes a subset of processes of a window and thus enables synchronization at a finer granularity
than that possible with fences. Exposure (MPI_Win_post/
MPI_Win_wait) and access epochs (MPI_Win_start/MPI_
Win_complete) can be opened and closed independently.
A group argument is associated with each call that starts an
epoch; it states all processes participating in the epoch. The
calls have to ensure correct matching: if a process i specifies a process j in the group argument of the post call, then
the next start call at process j with i in the group argument
matches the post call.
Since our RMA implementation cannot assume buffer
space for remote operations, it has to ensure that all processes in the group argument of the start call have issued a
matching post before the start returns. Similarly, the wait
call has to ensure that all matching processes have issued
complete. Thus, calls to MPI_Win_start and MPI_Win_wait
may block, waiting for the remote process. Both synchronizations are required to ensure integrity of the accessed
data during the epochs. The MPI specification forbids
matching configurations where processes wait cyclically
We now describe a scalable matching protocol with a
time and memory complexity of O (k) if each process has at
most k neighbors across all epochs. We assume k is known
to the protocol. We start with a high-level description: process i that posts an epoch announces itself to all processes
j1,. .. , jl in the group argument by adding i to a list local to
the processes j1,..., jl. Each process j that tries to start an
access epoch waits until all processes i1,. .. , im in the group
argument are present in its local list. The main complexity
lies in the scalable storage of this neighbor list, needed for
start, which requires a remote free-storage management
scheme. The wait call can simply be synchronized with a
completion counter. A process calling wait will not return
until the completion counter reaches the number of processes in the specified group. To enable this, the complete
call first guarantees remote visibility of all issued RMA
operations (by calling mfence or DMAPP’s gsync) and then
increases the completion counter at all processes of the
If k is the size of the group, then the number of operations issued by post and complete is O (k) and zero for start
and wait. We assume that k ∈ O (log p) in scalable programs.
00000 00000 00000
0 0 0 local: local:
Proc 1 Proc 0 Proc 2
Shared Counter Exclusive Bit Exclusive Counter
only present at
a master process
Both lock types are 64-bit integer (with reserved
bit ranges), which can be atomically modified.
comm. + comp.
MPI_ Win_lock(EXCL, 1)
MPI_ Win_lock(EXCL, 1)
Proc 2 acquires an
exclusive lock on Proc 1
Proc 1 acquires a
shared global lock
on the whole window
Proc 1 releases a
shared global lock
000 000 000
Figure 2. Example of lock synchronization. (a) Data structures,
(b) Source code, and (c) A possible schedule.