from different processors, DeLorean generates only a very
tiny log—although there is a performance cost. While
DeLorean’s execution substrate is not standard in today’s
hardware systems, the required changes are mostly concentrated in the memory system.
3. 1. the DeLorean idea
There have been several proposals for multiprocessors
where processors continuously execute blocks of consecutive dynamic instructions atomically and in isolation. 3, 5, 9, 15
In this environment, the updates made by a block of instructions (or Chunk) only become visible when the chunk commits.
When two chunks running concurrently on two different processors conflict—there is a data dependence across the two
chunks—the hardware typically squashes and retries one the
chunks. Moreover, after a chunk completes execution, there
is an optimized global commit step in an arbiter module that
informs the relevant processors that the chunk is committed.
The net effect is that the interleaving between the memory
accesses of different processors appears to occur only at chunk
boundaries.
In such environment, recording the execution for replay
simply involves logging the total sequence of chunk commits. This has two very important consequences for replay
systems. The first one is that the memory ordering log is
now very small. Indeed, rather than recording individual
dependences or groups of them like in all past proposals,
the log in a chunk-based system only needs to record the
total order in which chunks from different processors commit. This means that each log entry is short (the ID of the
committing processor, if all chunks have the same size),
and that the log is updated infrequently (chunks are thousands of instructions long).
The second consequence is that, because the memory
accesses issued by a processor inside a chunk are not visible
to the rest of the processors until the chunk commits, such
accesses can be fully reordered and overlapped. This means
that both execution and replay under DeLorean proceed at
a high speed.
DeLorean naturally combines multiple data dependences
between two or more processors into a single entry in the
log that records the memory interleaving—the Memory
Interleaving Log. An example is shown in Figure 6a, where
figure 6: Combining multiple dependences into a single log entry.
Chunk Dependence
Memory
P1 P2
interleaving
Log
P1 ID
(a)
Memory
P1 P2 P3 P4
interleaving
Log
P2 ID
Time
(b)
all the dependences between the accesses in the chunks
executed by processors P1 and P2 (shown with arrows in
the figure) are combined into a single entry in the log. The
figure also shows that such log entry is simply P1’s ID. In a
second example shown in Figure 6b, multiple dependences
across several processors are summarized in a single log
entry. Specifically, the single log entry inserted when the
chunk from P2 commits is enough to summarize the three
dependences.
33. 2. DeLorean execution modes
DeLorean provides two main execution modes, namely
OrderOnly and PicoLog. To understand them, we start by
describing a naive, third execution mode called Order&Size. In
Order&Size, each log entry contains the ID of the processor committing the chunk and the chunk size—measured in number
of retired instructions. During execution, an arbiter module
(a simple state machine that enforces chunk commit order3)
logs the sequence of committing processor IDs in a Processor
Interleaving (PI) log. At the same time, processors record the
size of the chunk they commit in a per-processor Chunk Size
(CS) log. The combination of a single PI log and per-processor
CS logs constitutes the Memory Interleaving Log.
Figure 7 shows DeLorean’s operation in Order&Size mode.
During the initial execution, when a processor such as P0 or P1
finishes a chunk, it sends a request-to-commit message to the
arbiter (steps
1 and
2). Such messages contain the processor
IDs plus Bloom-filter signatures that summarize the memory
footprint of the chunks3 (sig in the figure). Suppose that the
arbiter grants permission to P0 first (step
3). In this case, the
arbiter logs P0’s ID (
4) and propagates the commit operation
to the rest of the machine (
5). While this is in progress, if the
arbiter determines that both chunks can commit in parallel,
it sends a commit grant message to P1 (
6), logs P1’s ID (
7), and
propagates the commit (
8). As each processor receives commit permission, it logs the chunk size (
9 and
10).
Our first DeLorean execution mode, called OrderOnly,
omits logging chunk sizes by making “chunking”—i.e.,
the decision of when to finish a chunk—deterministic.
DeLorean accomplishes this by finishing chunks when a
fixed number of instructions have been committed. In reality, certain events truncate a currently running chunk and
force it to commit before it has reached its “expected” size.
This is fine as long as the event reappears deterministically
in the replay. For example, consider an uncached load to an
I/O port. The chunk is truncated but its log entry does not
figure 7: DeLorean’s operation.
Commit
58
Proc P0
sig, P0's ID
1
ok
3
9
Arbiter
Chunk
size
CS log
4
P1's ID
P0's ID
PI log
Directory + all caches
Proc P1
2
sig, P1's ID
ok
6
10
7
CS log
Chunk
size