figure 1: an example of efficient race recording using (a) an explicit
transitive reduction and (b) independent regions. in (a), solid lines
between threads are races written to the log, while dashed lines are
those races implied through transitivity.
F = 1
F = 1
r1 = F
A = 5
B = 6
r1 = F
A = 5
B = 6
r1 = F
r1 = F
F = 0
r1 = F
F = 0
r2 = A
r3 = B
r1 = F
r2 = A
r3 = B
(a)
(b)
implied through the combination of previously logged races
and sequential program semantics. Figure 1a illustrates a
transitive reduction. Inter-thread races between instructions
accessing locations A and B, respectively, are not logged since
they are implied by the recorded race for location F.
While both the original16 and follow-on17 work by Xu
et al. were successful in achieving efficient log compression (∼1B/1000 instructions executed), they required a large
amount of hardware state, on the order of an additional
L1 cache per core, in order to do so. Subsequent work by
Narayanasamy et al.
12 on the Strata race recorder reduced this
hardware requirement but, as results in Hower and Hill6 show,
may not scale well as the number of hardware contexts in a
system increases. This is largely because Strata writes global
information to its log entries that contains a component
from each hardware thread context in the system.
A key observation, discovered independently by the
authors of this paper at the Universities of Illinois and
Wisconsin, is that by focusing on regions of independence,
rather than on individual dependencies, an efficient and
scalable memory race recorder can be made without sacrificing logging efficiency. Figure 1b illustrates this notion by
breaking the execution of Figure 1a into an ordered series of
independent execution regions. Because intra-thread dependencies are implicit and do not need to be recorded, the execution in Figure 1b can be completely described by the three
inter-thread dependencies, which is the same amount of
information required after a transitivity reduction shown in
Figure 1a.
The authors of this paper have developed two different
systems, called Rerun6 and DeLorean,
11 that both exploit the
same independence observation described above. These
systems, presented in the same session of ISCA 2008, exemplify different trade-offs in terms of logging efficiency and
implementation complexity. Rerun can be implemented
with small modifications to existing memory system architectures but writes a larger log than DeLorean. DeLorean
can achieve a greater log size reduction and a higher replay
speed but requires novel hardware to do so.
94 COmmuniCatiOns Of the aCm | may 2009 | VoL. 52 | no. 6
2. ReRun
Wisconsin Rerun6 exploits the concept of episodic race
recording to achieve efficient logging with only small modifications to existing memory system architectures. The Rerun
race recorder does not interfere with a running program in
any way; it is an impartial observer of a running execution,
and as such avoids artificially perturbing the execution
under observation.
2. 1. episodic memory race recording
This section develops insights behind Rerun. It motivates
Rerun with an example, gives key definitions, and explains
how Rerun establishes and orders episodes.
motivating Example and Key Ideas: Consider the execution
in Figure 2 that highlights two threads i and j executing on a
multicore system. Dynamic instructions 1–4 of thread i hap-
pen to execute without interacting with instructions running
concurrently on thread j. We call these instructions, collec-
tively labeled E , an episode in thread i’s execution. Similarly,
1
instructions 1–3 of thread j execute without interaction and
constitute an episode E for thread j. As soon as a thread’s
2
episode ends, a new episode begins. Thus, every instruction
execution is contained in an episode, and episodes cover the
entire execution (right side of Figure 2).
Rerun must solve two subproblems in order to ensure that
enough episodic information is recorded to enable deter-
ministic replay of all memory races. First, it must determine
when an episode ends, and, by extension, when the next
one begins. To remain independent, an episode E must end
when another thread issues a memory reference that conflicts
with references made in episode E. Two memory accesses
conflict if they reference the same memory block, are from
different threads, and at least one is a write. For example,
episode E in Figure 2 ends because thread j accesses the
1
variable F that was previously written (i.e., F is in the write
set of E ). Formally, for all combinations of episodes E and F
1
figure 2: an example of episodic recording. Dashed lines indicate
episode boundaries. in the blown up diagram of threads i and j, the
shaded boxes show the state of the episode as it ends, including the
read and write sets, memory reference counter, and the timestamp.
the shaded box in the last episode of thread i shows the initial episode state.
TT
ij
T
i
T
j
r5 := X
r4 := Q
S := r3
r5 := X
F := 1
r1 : = A
B: = 23
F := 0
Y : = 54
T : = r3
W := r4
r4 : = U
r3 := P
r2 : = I
H := r4
r8 : = X
r9 : = Y
Q := r8
1: F= 1
2: r1 = A
3: B = 23
4: F=0
E1
R: {A} W: {B,F}
REFS: 4
Timestamp: 43
...
R: {...} W: {...}
REFS: 97
Timestamp: 5
r6 := E
D := r7
S := r4
C := r3
W : = r10
D := r7
r1 := F
r2 : = B
Z := 34
r3 := 54
Initial State:
R: W:
REFS: 0
Timestamp: 44
1: D = r7
2: r1 = F
3: r2 = B
E2
R: {B, F} W: {D}
REFS: 3
Timestamp: 44