figure 6. A multithreaded process built from one space per thread,
with a master space managing synchronization and memory
this way. To synchronize with a child and collect its results,
the parent calls Get with the Merge option, which merges
all changes that the child made to shared memory, since
its snapshot was taken, back into the parent. If both parent
and child—or the child and other children whose changes
the parent has collected—have concurrently modified
the same byte since the snapshot, the kernel detects and
reports this conflict.
Our runtime also supports barriers, the foundation of
data-parallel programming models like OpenMP.
each thread in a group arrives at a barrier, it calls Ret to
stop and wait for the parent thread managing the group.
The parent calls Get with Merge to collect each child’s
changes before the barrier, then calls Put with Copy and
Snap to resume each child with a new shared memory snapshot containing all threads’ prior results. While the private
workspace model conceptually extends to nonhierarchical
3 our prototype’s strict space hierarchy currently limits synchronization flexibility, an issue we intend
to address in the future. Any synchronization abstraction
may be emulated at some cost as described in the next
An application can choose which parts of its address
space to share and which to keep thread-private. By placing
thread stacks outside the shared region, all threads can reuse
the same stack area, and the kernel wastes no effort merging stack data. Thread-private stacks enable a child thread to
inherit its parent’s stack and run “inline” in the same C/C++
function as its parent, as in Figure 1. If threads wish to pass
pointers to stack-allocated structures, however, then they
may locate their stacks in disjoint shared regions. Similarly,
if the file system area is shared, then the threads share a common file descriptor namespace as in Unix. Excluding the file
system area from shared space and using normal file system
reconciliation (Section 4. 2) to synchronize it yields thread-private file tables.
4. 5. emulating legacy thread APis
For legacy code already parallelized using nondeterministic
synchronization, Determinator’s runtime can emulate the
standard pthreads API via deterministic scheduling.
8 In this
case, the process’s initial master space never runs application code directly, but acts as a scheduler supervising one
child space per application thread. The scheduler uses the
kernel’s instruction limit feature (Section 3. 2) to quantize
each thread’s execution. After allowing each thread to run
independently for one quantum, the scheduler uses Merge
to collect the thread’s shared memory changes, then restarts
the thread from the merged state for another quantum.
To emulate pthreads synchronization, a thread can end
its quantum prematurely to interact with the scheduler.
Each mutex, for example, always has an “owner” thread,
whether or not the mutex is locked. The owner can lock
and unlock the mutex without scheduler interactions, but
another thread needing the mutex must invoke the scheduler to obtain ownership. At the current owner’s next
quantum, the scheduler “steals” the mutex’s ownership
if the mutex is unlocked, otherwise placing the locking
thread on the mutex’s queue to be awoken once the mutex
While deterministic scheduling offers compatibility with legacy code, it has drawbacks. The master space,
required to serialize synchronization operations, may limit
scalability unless execution quanta are large. Large quanta
can increase the time threads waste waiting to interact with
another, however, to steal an unlocked mutex, for example.
Further, since the deterministic scheduler may preempt
a thread and propagate shared memory changes at any
point in application code, the programming model remains
nondeterministic. In contrast to our private workspace
model, if one thread runs “x = y” while another runs “y =
x” under the deterministic scheduler, the result may be
repeatable but is no more predictable to the programmer
than before. While rerunning a program with exactly identical inputs will yield identical results, if an input change
or code recompilation affects the length of any instruction
sequence, this change may cascade into a different execution schedule and trigger schedule-dependent if not timing-dependent bugs.
This section evaluates the Determinator prototype, first
informally, and then examines single-node and distributed
parallel processing performance and, finally, code size.
Determinator is written in C with small assembly fragments, and currently runs on the 32-bit x86 architecture.
Since our focus is on parallel compute-bound applications,
Determinator’s I/O capabilities are currently limited to text-based console I/O and a simple Ethernet-based protocol for
space migration (Section 3. 3). The prototype has no persistent disk storage: the runtime’s shared file system abstraction
currently operates in physical memory only.
5. 1. experience using the system
We find that a deterministic programming model simplifies debugging of both applications and user-level