table 3: exceptional events that may affect chunk construction.
Do not truncate
a Chunk
1. Interrupts
2. Traps
truncate a Chunk
Deterministically nondeterministically
1. Reach limit number of 1. Cache overflow
instructions attempt
2. Uncached accesses 2. Repeated chunk
(e.g., I/O initiation) collision
3. Special system
instructions
and induce chunk squashes. Consequently, we need to be
careful that chunks are still replayed deterministically.
Table 3 lists the exceptional events that might affect
chunk construction during the initial execution. A full
description of these events and the actions taken when they
occur is presented in Montesinos et al.
11 At a high level, there
are events that do not truncate the chunk, events that truncate it deterministically, and events that truncate it nondeterministically. The latter are the only ones that induce the
logging of an entry in the CS log. Such events are the attempt
to overflow the cache and repeated chunk collision. Overall,
as described in Montesinos et al.,
11 even in the presence of
all these types of exceptional events, DeLorean’s replay is
deterministic.
3. 6. evaluation
We used the SESC simulator14 to evaluate DeLorean. We
simulated a chip multiprocessor with eight cores clocked
at 5GHz. We ran the SPLASH- 2 applications as well as
SPECjbb2000 and SPECweb2005. In our evaluation, we
estimated DeLorean’s log size and its performance during
recording and replay. In this section, we show a summary of
the evaluation presented in Montesinos et al.
11
Figure 9 shows the size of the PI and CS logs in OrderOnly
in bits per kilo-instruction. We evaluate DeLorean configurations with standard chunk sizes of 1,000, 2,000, and 3,000
instructions. For each of them, we report the size of both
logs with and without compression. In the figure, the CS log
contribution is stacked atop the PI log’s. The SP2-G.M. bars
correspond to the geometric mean of SPLASH- 2.
The figure shows that our preferred 2,000-inst. OrderOnly
configuration uses on average only 2.1b (or 1.3b if compressed) per kilo-instruction to store both the PI and CS
logs. For comparison purposes, the estimated average size of
the compressed Memory Races Log in R TR under Sequential
figure 9: size of the Pi and Cs logs in OrderOnly. the numbers under
the bars are the standard chunk sizes in instructions.
CS Log (uncompressed) CS Log (compressed)
PI Log (uncompressed) PI Log (compressed)
Log size (bits/kilo-inst)
5
4
3
2
1
0
1000 2000 3000
SP2-G.M
1000 2000 3000
sjbb2k
1000 2000 3000
sweb2005
figure 10: size of the Cs log in PicoLog. Recall that PicoLog has no
Pi log. the numbers under the bars are the standard chunk sizes in
instructions.
CS Log (uncompressed) CS Log (compressed)
0.5
Log size (bits/kilo-inst)
0.4
0.3
0.2
0.1
0
1000 2000 3000
SP2-G.M
1000 2000 3000
sjbb2k
1000 2000 3000
sweb2005
Consistency (SC) from Xu et al.
17 is 8b per kilo-instruction.
We call this system Basic RTR and use it as a reference,
although we note that the set of applications measured here
and in Xu et al.
17 are different. This means that these compressed logs use only 16% of the space that we estimate is
needed by the compressed Memory Races Log in Basic RTR.
Figure 10 shows the size of the CS log in PicoLog. Recall
that PicoLog has no PI log. We see that the CS log needs
0.37b or fewer per kilo-instruction in all cases—even without compression. Our preferred 1,000-instruction PicoLog
configuration generates a compressed log with an average
of only 0.05b per kilo-instruction. To put this in perspective,
it implies that, if we assume an IPC of 1, the combined effect
of all eight 5GHz processors is to produce a log of only about
20GB per day.
Finally, we consider the speed of DeLorean during recording and replay. It can be shown that OrderOnly introduces
negligible overhead during recording, and that it enables
replay, on average, at 82% of the recording speed. Under
PicoLog, recording and replay speeds decrease, on average,
to 86% and 72%, respectively, of the recording speed under
OrderOnly.
4. COnCLusiOn
This paper presented two novel hardware-based approaches for deterministic replay of multiprocessor executions,
namely Wisconsin Rerun and Illinois DeLorean. Both approaches seek to enable deterministic replay by focusing
on recording how long threads execute without interacting.
Rerun makes few changes to standard multicore hardware,
while DeLorean promises much smaller log sizes and higher
replay speeds. Future work includes improving Rerun’s replay speed, generalizing DeLorean’s hardware design alternatives, and making the original multithreaded executions
more deterministic.
acknowledgments
We thank Norman Jouppi and David Patterson for suggesting
this article and Norman Jouppi for writing the Perspective.
Hower and Hill thank those acknowledged in the Rerun
paper, including NSF grants CCR-0324878, CNS-0551401,
and CNS-0720565. Hill has a significant financial interest in Sun Microsystems. Montesinos, Ceze, and Torrellas
acknowledge the support provided by NSF under grants
CCR-0325603 and CNS-0720593 and Intel and Microsoft for
funding this work under the Universal Parallel Computing
Research Center.