To each L2 cache bank, Rerun also adds a “memory”
timestamp register, MTS (e.g., 4B). This register holds the
maximum of all timestamps for victimized blocks that map
to its bank. A victimized block is one replaced from an L1
cache, and its timestamp is the timestamp of the core at the
time of victimization.
Finally, coherence response messages—data, acknowledgements, and writebacks—carry logical timestamps.
Book-keeping state, such as a per-core pointer to the end of
its log, is not shown.
rerun operation: During execution, Rerun monitors the no-conflict equation by comparing the addresses of incoming
coherence requests to those in RF and WF. When a conflict is
detected, Rerun writes the tuple < TS, REFS> to a per-thread
log, then begins a new episode by resetting REFS, WF, and
RF, and by incrementing the local timestamp TS according
to the algorithm in Section 2. 1.
By gracefully handling virtualization events, Rerun
allows programmers to view logs as per thread, rather
than per core. At a context switch, the OS ends the core’s
current episode by writing REFS and TS state to the log.
When the thread is rescheduled, it begins a new episode
with reset WF, RF, and REFS, and a timestamp equal to the
max of the last logged TS for that thread and the TS of the
core on which the thread is rescheduled. Similarly, Rerun
can handle paging by ensuring that TLB shootdowns end
episodes.
Rerun also ends episodes when implementation resources
are about to be exhausted. Ending episodes just before 64K
memory references, for example, allows REFS to be logged
in 2B.
2. 3. evaluation
methods: We evaluate the Rerun recording system using the
Wisconsin GEMS10 full system simulation infrastructure.
The simulator configuration matches the baseline shown
in Table 1 with the addition of Rerun hardware support.
Experiments were run using the Wisconsin Commercial
Workload Suite.
1 We tested Rerun with these workloads
and a microbenchmark, racey, that uses number theory
to produce an execution whose outcome is highly sensitive to memory race ordering (available at www.cs.wisc.
edu/∼markhill/ racey.html).
rerun Performance: Figure 4 shows the performance of
Rerun on all four commercial workloads. Rerun achieves an
uncompressed log size of about 4B logged per 1000 instructions. Importantly, we notice modest variation among
the log size of each workload, leading us to believe that
Rerun can perform well under a variety of memory access
patterns.
We show the relative performance of Rerun in comparison to the prior state of the art in memory race recording in
Figure 5. Rerun achieves a log size comparable to the most
efficient prior recorder (RTR17), but does so with a fraction of
the hardware cost (∼0.2KB per core vs. 24KB per core). Like
RTR, and unlike Strata,
12 Rerun scales well as the number of
cores in the system increases, due, in part, to the fact that
Rerun and RTR both write thread-local log entries rather
than a global entry with a component from each thread.
96 COmmuniCatiOns Of the aCm | may 2009 | VoL. 52 | no. 6
figure 4: Rerun absolute log size.
6
Bytes/kilo-instruction
4
2
0
apache
jbb
oltp
zeus
avg
figure 5: hardware cost comparison to RtR and strata.
58
108
30
Bytes/kilo-instruction
20
10
Rerun
RTR
Strata
0
2p
4p
8p
16p
3. DeLORean
Illinois DeLorean11 is a new approach to deterministic replay
that exploits the opportunities afforded by a new execution
substrate: one where processors continuously execute large
blocks of instructions atomically, separated by register
checkpoints.
3, 5, 9, 15 In this environment, to capture a multithreaded execution for deterministic replay, DeLorean only
needs to log the total order in which blocks from different
processors commit.
This approach has several advantages. First, it results
in a substantial reduction in log size compared to previous
schemes—at least about one order of magnitude. Second,
DeLorean can replay at a speed comparable to that of the
initial execution. Finally, in an aggressive operation mode,
where DeLorean predefines the commit order of the blocks