hard-wired to 000 (or “null”).
To free an object, the pointer is presented to the memory
manager (Addr to Free). The Stack Top register is used as
the address for the Free Stack BRAM set on port B, in write
mode, with the data value Addr to Free. Then the Stack Top
register is incremented. This causes the pointer to the freed
object to be pushed onto the Free Stack.
In order to read or write a field in the Pointer Memory,
the Addr to Read/Write is presented, and, if writing, a Pointer
to Write. This uses port A of the BRAM set in either read or
write mode, returning a value on the Pointer Value port in the
Note that this design, by taking advantage of the dual-porting of the BRAMs, allows a read or write to proceed in
parallel with an allocate or free.
threaded free list. A common software optimization
would be to represent the free objects not as a stack of
pointers, but as a linked list threaded through the unused
objects (that is, a linked list through the first pointer field).
Since the set of allocated and free objects are mutually
exclusive, this space optimization is essentially free modulo
cache locality effects.
However, in hardware, this causes resource contention on
the BRAM set containing the first pointer (since it is doing
double duty). Thus parallelism is reduced: read or write
operations on the first pointer cannot be performed in the
same cycle as malloc or free, and the latter require two
cycles rather than one. Our choice in using a stack is deliberate as the modest increased use of space is far less costly
than the potential resource contention.
4. GaRBaGe coLLectoR DeSiGn
We now describe the implementation of both a stop-the-world and a fully concurrent collector in hardware. In software, the architectures of these two styles of collector are
quite different. In hardware, the gap is much smaller, as the
same fundamental structures and interfaces are used.
The concurrent collector has a few extra data structures
(implemented with BRAMs) and requires more careful
allocation of BRAM ports to avoid contention, but these
features do not negatively affect its use by the stop-the-world
collector. Therefore, we will present the concurrent col-
lector design, and merely mention here that the stop-the-
world variant omits the shadow register(s) from the root
engine, the write barrier register and logic from the trace
engine, and the used map and logic from the sweep engine.
Our design comprises three separate components, which
handle the atomic root snapshot, tracing, and sweeping.
4. 1. Background: Yuasa’s snapshot algorithm
Before delving into the details of our implementation, we
describe Yuasa’s snapshot algorithm16 which is the basis
of our implementation. While the mechanics in hardware
are quite different, it is interesting to note that implementing in hardware allows us to achieve a higher degree of concurrency and determinism than state-of-the-art software
algorithms, but without having to incorporate more sophisticated algorithmic techniques developed in the interim.
The fundamental principle of the snapshot algorithm is
that when collection is initiated, a logical snapshot of the heap
is taken. The collector then runs in this logical snapshot, and
collects everything that was garbage at snapshot time.
In Yuasa’s original algorithm, the snapshot consisted of
the registers, stacks, and global variables. This set of pointers was gathered synchronously (since then, much research
has been devoted to avoiding the need for any global synchronization at snapshot time or during phase transitions2).
Once the roots have been gathered, the mutator is allowed
to proceed and the collector runs concurrently, marking the
transitive closure of the roots.
If the mutator concurrently modifies the heap, its only
obligation is to make sure that the collector can still find
all of the objects that existed in the heap at snapshot time.
This is accomplished by the use of a write barrier: before any
pointer is overwritten, it is recorded in a buffer and treated
as a root for the purposes of collection.
Objects that are freshly allocated during a collection are
not eligible for collection (they are “allocated black” in the
parlance of collector literature).
The advantages of the snapshot algorithm are simplicity
and determinism. Since it operates on a logical snapshot at
an instant in time, the invariants of the algorithm are easy to
describe. In addition, termination is simple and deterministic, since the amount of work is bounded at the instant that
4. 2. Root snapshot
The concurrent collector uses the snapshot-at-the-beginning
algorithm described above. Yuasa’s original algorithm
required a global pause while the snapshot was taken by
recording the roots; since then real-time collectors have
endeavored to reduce the pause required by the root snapshot. In hardware, we are able to completely eliminate the
snapshot pause by taking advantage of the parallelism and
synchronization available in the hardware.
The snapshot must take two types of roots into account:
those in registers, and those on the stack. Figure 2 shows the
Address to Clear
figure 1. memory module design for malloc/free interface, showing
a single pointer field. BRams are in yellow, with the dual ports (a/B)
shown. ovals designate pointer fields; those in blue are in use.