runtime code, since user-space bugs are always reproducible. Conversely, when we do observe nondeterministic
behavior, it can result only from a kernel (or hardware) bug,
immediately limiting the search space.
Because Determinator’s file system holds a process’s output until the next synchronization event (often the process’s
termination), each process’s output appears as a unit even if
the process executes in parallel with other output-generat-ing processes. Further, different processes’ outputs appear
in a consistent order across runs, as if run sequentially.
While race detection tools exist,
21 we find it convenient
that Determinator detects conflicts under “normal-case”
execution, without requiring a special tool. Determinator’s
model makes conflict detection as standard as detecting a
division by zero or illegal memory access.
A subset of Determinator doubles as PIOS, “Parallel
Instructional Operating System,” used in Yale’s OS course
for the past 2 years.
14 While determinism is not a primary
course topic, we found Determinator/PIOS to be a useful
tool for teaching operating systems in the multicore era,
due to its simple design and minimal kernel API. Partly
derived from and inspired by JOS,
17 PIOS includes an
instructional framework where students fill in missing
pieces of a “skeleton.” Students work in small groups to
reimplement all of Determinator’s core features: multiprocessor scheduling, virtual memory with copy-on-write
and Snap/Merge, user-level threads with fork/join, the
user-space file system with versioning and reconciliation,
and cross-node space migration. In their final projects, students extend the OS with features such as graphics, pipes,
and remote shells. The course is challenging and incurs a
heavy programming load, but anonymous student reviews
have been highly positive due to its educational value and
perceived relevance.
Speedup over 1 CPU
Figure 7 shows the performance of several shared-memory
parallel benchmarks on Determinator, relative to the same
benchmarks using standard pthreads on 32-bit Ubuntu Linux
9. 10. The coarse-grained benchmarks md5, matmult, qsort,
blackscholes, and fft perform comparably to nondeterministic
execution under Linux, while the fine-grained lu benchmarks
show a higher performance cost.
Figure 8 shows each benchmark’s speedup relative to
single-threaded execution on Determinator. The “
embarrassingly parallel” md5 and blackscholes scale well,
matmult and fft level off after four processors (but still perform
comparably to Linux), and the remaining benchmarks
scale poorly.
5. 3. Distributed computing performance
To evaluate cross-node space migration (Section 3. 3), we
changed the md5 and matmult benchmarks to distribute workloads across up to 32 uniprocessor nodes, while
preserving Determinator’s shared memory programming
model. Lacking a cluster suitable for native testing, we ran
Determinator under QEMU,
5 on a shared-use Linux cluster.
Figure 9 shows parallel speedup versus single-node
execution in Determinator, on a log-log scale. The md5-
circuit benchmark serially migrates to each node to fork
5. 2. single-node multicore performance
Since Determinator runs user-level code “natively” on the
hardware instead of rewriting user code,
8 its performance
is comparable to that of existing systems when running
coarse-grained parallel code, but incurs higher costs on
fine-grained parallel code because of the system calls, context switches, and virtual memory operations required at
synchronization events.
figure 8. Determinator parallel speedup relative to its own single-CPu
performance.
12
12
10
10
8
6
4
2
2
0
0
ideal
md5
blacks choles
matmult
fft
qsort
lu_cont
lu_noncont
86
Number of CPUs
4
figure 9. speedup of deterministic shared-memory benchmarks on
varying-size distributed clusters.
figure 7. Determinator performance relative to pthreads under Linux
using parallel benchmarks.
Speedup over Linux
2. 5
2
1. 5
1
0.5
0
ideal
md5-tree
md5-circuit
matmult-tree
md5
matmult
qsort
blackscholes
Benchmark
0.1
1
1 CPU
12 CPUs
10