When an Alloc request arrives from the mutator, the Stack
Top register is used to remove a pointer to a free object from
the Free Stack, and the stack pointer is decremented. If the
stack pointer falls below a certain level (we typically use 25%),
then a garbage collection is triggered by raising the GC signal
which is connected to the root snapshot engine (Figure 2).
The address popped from the Free Stack is returned to
the mutator on the Addr Alloc’d port. It is also used to set
the object’s entry in the Used Map, to 01, meaning “freshly
allocated” (and thus “black”). A value of 00 means “free”, in
which case the object is on the Free Stack.
When tracing is completed, sweeping begins in the next
cycle. Sweeping is a simple linear scan. The Sweep Pointer
is initialized to 1 (since slot 0 is null), and on every cycle
(except when preempted by allocation) the sweep pointer is
presented to both the Mark Map and the Used Map.
If an object is marked, its Used Map entry is set to 10.
If an object is not marked and its used map entry is 10 (the
and gate in the figure) then the used map entry is reset to
00. Although only 3 states are used, the particular choice of
bit pattern avoids unneeded logic in the critical path. The
resulting signal is also used to control whether the current
Sweep Pointer address is going to be freed. If so, it is pushed
onto the Free Stack and also output on the Addr to Clear
port, which is connected to the mark engine so that the data
values being freed are zeroed out.
Note that since clearing only occurs during sweeping, there
is no contention for the Pointer Memory port in the trace engine
between clearing and marking. Furthermore, an allocation
and a free may happen in the same cycle: the top-of-stack is
accessed using read-before-write mode and returned as the
Addr Alloc’d, and then the newly freed object is pushed back.
When an object is allocated, its entry in the Mark Map
is not set (otherwise an extra interlock would be required).
This means that the tracing engine may encounter newly
allocated objects in its marking pipeline (via newly installed
pointers in the heap), albeit at most once since they will then
be marked. This also affects WCET analysis, as we will see in
the next section.
values are non-null, they are fed through the MUX and thence
to the marking step.
The write barrier is implemented by using port A of the
Pointer Memory BRAM in read-before-write mode. When the
mutator writes a pointer, the old value is read out first and placed
into the Barrier Reg. This is subsequently fed through the MUX
and marked (the timing and arbitration is discussed below).
Given the three BRAMs involved in the marking process, processing one pointer requires 3 cycles. However, the
marking engine is implemented as a 3-stage pipeline, so it is
able to sustain a throughput of one pointer per cycle.
trace engine pairing. For objects with two pointers, two
trace engines are paired together to maximize resource
usage (this is not shown in the figure). Since each trace
engine only uses one port of the mark map, both engines
can mark concurrently.
The next item to mark is always taken from the longer
queue. When there is only one item to enqueue, it is placed
on the shorter queue. Using this design, we provision each
of the 2 queues to be of size 3 N/8 + R (where R is the maximum number of roots), which guarantees that the queues
will never overflow.
On each cycle, one pointer is removed from the queues,
and the two pointers in the object retrieved are examined
and potentially marked and enqueued.
To minimize interference due to write barrier, our design
has two write barrier registers. The write barrier values are not
processed until a pair is formed and coupled with the fact that
there are two mark queues, the mark engines can make progress every other cycle even if the application is performing one
write per cycle.
trace termination and WCet effects. The termination
protocol for marking is simple: once the last item from the
mark queues is popped (both mark queues become empty),
it takes 2 or 3 cycles for the trace engine to finish the current
pipeline. If the two pointers returned by the heap are null,
then the mark process is terminated in the second cycle as
there is no need to read the mark bits in this case. Otherwise
the mark bit for the non-null pointers are read to ensure that
both pointers are marked, in which case the mark phase is
terminated in the third cycle.
Write barrier values arriving after the first cycle of termination can be ignored, since by the snapshot property they
would either have to be newly allocated or else discovered by
tracing the heap.
However, note that some (realistic) data structures, in
particular linked lists, will cause a pathological behavior, in
which a pointer is marked, removed from the queue, which
will appear empty, and then 2 cycles later, the next pointer
from the linked list will be enqueued. So while the pipeline
can sustain marking one object per cycle, pipeline bubbles
will occur which reduce that throughput.
4. 4. Sweeping
Once tracing is complete, the sweep phase begins, in which
memory is reclaimed. The high-level design is shown in
Figure 4. The sweep engine also handles allocation requests
and maintains the stack of pointers to free memory (Free
Stack). The Mark Map here is the same Mark Map as in Figure 3.
figure 4. free stack and sweeping engine.
Address to Free