Since our current focus is on emulating familiar abstractions and not on developing storage systems, Determinator’s
file system currently provides no persistence: it effectively
serves only as a temporary file system.
Our runtime mimics a weakly consistent, replicated file
system,
24 by maintaining a complete file system replica in
each process’s address space, as shown in Figure 5. When a
process creates a child via fork(), the child inherits a copy
of the parent’s file system in addition to the parent’s open file
descriptors. Individual open/close/read/write operations in a process use only that process’s file system replica,
so different processes’ replicas may diverge as they modify
files concurrently. When a child terminates and its parent
collects its state via wait(), the parent’s runtime uses file
versioning to propagate the child’s changes into the parent.
If a parallel make forks several compiler processes, for
example, each child writes its output .o file to its own file
system replica. The parent’s runtime merges these .o files
into the parent’s file system as the parent wait()s on each
child. This copying and reconciliation is not as inefficient as
it may appear, due to the kernel’s copy-on-write optimizations. Replicating a file system image among many spaces
copies no physical pages until user-level code modifies them,
so all copies of identical files consume only one set of pages.
As in any weakly consistent file system, processes that
perform unsynchronized, concurrent writes to the same file
may cause conflicts. When our runtime detects a conflict, it
simply discards one copy and sets a conflict flag on the file;
subsequent attempts to open() the file result in errors.
This behavior is intended for batch compute applications
for which conflicts indicate a bug, whose solution is to fix
the bug and rerun the job. Interactive use would demand a
policy that avoids losing data. The user-level runtime could
implement stronger consistency and avoid unsynchronized
concurrent writes, at higher synchronization costs.
4. 3. input/output and logging
Since unprivileged spaces can access external I/O devices
only indirectly via parent/child interaction within the space
hierarchy, our user-level runtime treats I/O as a special case
of file system synchronization. In addition to regular files, a
process’s file system contains special I/O files, such as console input and output files. Unlike Unix device special files,
figure 5. each process maintains a private replica of the shared file
system, using file versioning to reconcile replicas at synchronization
points.
File
System
File
System
Determinator Kernel
Root
Process
File System
Synchronization
Child
Process
File
System
Job Outputs Job Inputs
Child
Process
Determinator’s I/O files actually hold data in the process’
file system image: for example, the console input file accumulates all characters that the process has received from
the console, and the console output file holds all characters
written to the console. In the current prototype, console
or log files can eventually “fill up” and become unusable,
though a suitable garbage-collection mechanism could
address this flaw.
When a process does a console read(), the C library first
returns unread data already in the process’s local console
input file. When no more data is available, instead of returning an end-of-file condition, the process calls Ret to synchronize with its parent and wait for more console input (or in
principle any other form of new input) to become available.
When the parent does a wait() or otherwise synchronizes
with the child, it propagates any new input to the child.
When the parent has no new input for any waiting children,
it forwards all their input requests to its parent, and ultimately to the kernel via the root process.
When a process does a console write(), the runtime
appends the new data to its internal console output file as
it would append to a regular file. The next time the process
synchronizes with its parent, file system reconciliation
propagates these writes toward the root process, which
forwards them to the kernel’s I/O devices. A process can
request immediate output propagation by explicitly calling fsync().
Reconciliation handles “append-only” writes differently
from other writes, enabling concurrent writes to console
or log files without conflict. During reconciliation, if both
the parent and child have made append-only writes to the
same file, reconciliation appends the child’s latest writes to
the parent’s copy of the file, and vice versa. Each process’s
replica thus accumulates all processes’ concurrent writes,
though different processes may observe these writes in a different order. Unlike Unix, rerunning a parallel computation
from the same inputs with and without output redirection
yields byte-for-byte identical console and log file output.
4. 4. shared memory multithreading
Shared memory multithreading is popular despite its nondeterminism, in part because parallel code need not pack
and unpack messages: threads simply compute “in-place”
on shared variables and structures. Since Determinator
gives user spaces no physically shared memory other than
read-only sharing via copy-on-write, emulating shared memory involves distributed shared memory (DSM) techniques.
Adapting the private workspace model of Section 2. 2 to
thread-level shared memory reuses ideas from early parallel Fortran machines6 and release-consistent DSM systems,
1
although these prior designs did not offer determinism.
Our runtime uses the kernel’s Snap and Merge operations (Section 3. 2) to emulate shared memory in the private workspace model, using fork/join synchronization.
To fork a child, the parent thread calls Put with the Copy,
Snap, Regs, and Start options to copy the shared part of its
memory into a child space, save a snapshot of that memory
state in the child, and start the child running, as illustrated
in Figure 6. The master thread may fork multiple children