need to record its actual size because the uncached load
will reappear in the replay and truncate the chunk at the
same place. There are, however, a few events that truncate
a currently running chunk and are not deterministic. When
one such event occurs, the CS log adds an entry with: ( 1)
what chunk gets truncated (its position in the sequence of
chunks committed by the processor) and ( 2) its size. With
this information, the exact chunking can be reproduced
Consequently, OrderOnly generates a PI log with only processor IDs and very small per-processor CS logs. For the large
majority of chunks, steps
10 in Figure 7 are skipped.
Our second DeLorean execution mode, called PicoLog,
builds on OrderOnly and additionally eliminates the need for
a PI log by “predefining” the chunk commit interleaving during both initial execution and replay. This is accomplished by
enforcing a given commit policy—e.g., pick processors round-robin, allowing them to commit one chunk at a time. It needs
only the tiny per-processor CS log discussed for OrderOnly.
Thus, the Memory Interleaving Log is largely eliminated. The
dra wback is that, by delaying the commit of completed chunks
until their turn, PicoLog may slow down execution and replay.
Looking at Figure 7, PicoLog skips steps
7 and, typically,
10. The arbiter grants commit permission to processors according to a predefined order policy, irrespective of the
order in which it receives their commit requests. Note, however, that a processor does not stall when requesting commit
permission; it continues executing its next chunk(s). 3
Table 2 shows the PI and CS logs in each of the two execution modes and Order&Size.
33. 3. DeLorean implementation
Our DeLorean implementation uses a machine that supports a chunk-based execution environment with a generic
network and an arbiter. It augments it with the three typical mechanisms for replay: the Memory Interleaving Log
(consisting of the PI and CS logs), the input logs, and system
checkpointing (Figure 8).
The input logs are similar to those in previous replay
schemes. As shown in Figure 8, they include one shared
log (DMA log) and two per-processor logs (Interrupt and I/O
logs). The DMA acts like another processor in that, before it
updates memory, it needs to get commit permission from
the arbiter. Once permission is granted, the DMA log logs
the data that the DMA writes to memory. The per-processor
table 2: Pi and Cs logs in each execution mode.
Log entry When
Log entry When
98 COmmuniCatiOns Of the aCm | may 2009 | VoL. 52 | no. 6
figure 8: Overall DeLorean system implementation.
Structures also found in other multiprocessor replay proposals
DIR + MEM
Proc + Caches
Node N- 1
Proc + Caches
Interrupt log stores, for each interrupt, the time it is received,
its type, and its data. Time is recorded as the processor-local
chunkID of the chunk that initiates execution of the interrupt handler. The per-processor I/O log records the values
obtained by I/O loads. Like in previous replay schemes,
DeLorean includes system checkpointing support.
3. 4. DeLorean replay
During replay, processors must execute the same chunks
and commit them in the same order. In Order&Size, each
processor generates chunks that are sized according to its
CS log, while in OrderOnly and PicoLog, processors use the
CS log only to recreate the chunks that were truncated nondeterministically. In Order&Size and OrderOnly, the arbiter
enforces the commit order present in the PI log.
As an example, consider the log generated during initial
execution as shown in Figure 7. During replay, suppose that
P1 finishes its chunk before P0, and the arbiter receives message
1. The arbiter checks its PI log (or its predefined
order policy in PicoLog) and does not grant permission to
commit to P1. Instead, it waits until it receives the request
from P0 (message
1). At that point, it grants permission to
commit to P0 (
3) and propagates its commit (
5). The rest of
the operation is as in the initial execution but without logging. In addition, processors use their CS log to decide when
to finish each chunk (Order&Size) or those chunks truncated
nondeterministically during the initial execution (OrderOnly
Thanks to our chunk-based substrate, during replay all
processors execute concurrently. Moreover, each processor
fully reorders and overlaps its memory accesses within a
chunk. Chunk commit involves a fast check with the arbiter. 3 The processor overlaps such check with the computation of its next chunk.
33. 5. exceptional events
In DeLorean, the same instruction in the initial and the
replayed execution must see exactly the same full-system
architectural state. On the other hand, it is likely that structures that are not visible to the software such as the cache and
branch predictor will contain different state in the two runs.
Unfortunately, chunk construction is affected by the
cache state—through cache overflow that requires finishing
the chunk—and by the branch predictor—through wrong-path speculative loads that may cause spurious dependences