state. On the Xilinx FPGAs, which we use in this work, LUTs
and flip-flops are grouped into slices, which is the standard
unit in which resource consumption is reported for FPGAs.
Particularly important to this work are the discrete mem-
ory blocks, called Block RAMs (BRAMs), available on the
FPGA. BRAMs are dual-ported specialized memory structures
embedded within the FPGA for resource-efficient implemen-
tation of random- and sequential-access memories.
The Xilinx Virtex- 5 LX330T15 device that we use in this
paper has a total BRAM capacity of 1.45MB. A single BRAM in
a Virtex- 5 FPGA can store up to 36Kb of memory. An important
feature of BRAM is that it can be organized in various form fac-
tors (addressable range and data width combinations). For
example, a 36Kb BRAM can be configured as 36K × 1 (36K 1-bit
locations), 16K × 2, 8K × 4, 4K × 9, 2K × 18, or 1K × 36 memory.
A 36Kb BRAM can also be used as two independent 18Kb
BRAMs. Larger logical memory structures can be built by
cascading multiple BRAMs. Any logical memory structure in
the design which is not a multiple of 18Kb would lead to quan-
tization (or, in memory system parlance, “fragmentation”).
The quantization effect can be considerable depending on
the logical memory structure in the design (and is explored in
Section 7). A BRAM can be used as a true dual ported (TDP) RAM
providing two fully independent read-write ports. Furthermore,
each port supports the following three modes for the read-write
operation: read-before-write (previously stored data is available
on the read bus), read-after-write (newly written data is available
on the read bus), and no-change (read data bus output remains
unchanged). Our collector makes significant use of read-
before-write for features like the Yuasa-style write barrier. 16
BRAMs can also be configured for use as FIFOs rather
than as random access memories; we make use of this
feature for implementing the mark queues in the tracing
phase of the collector.
FPGAs are typically packaged on boards with dedicated
off-chip DRAM and/or SRAM which can be accessed via a
memory controller synthesized for the FPGA. Such memory
could be used to implement much larger heap structures.
However, we do not consider use of DRAM or SRAM in this
paper because we are focusing on high-performance designs
with highly deterministic (single cycle) behavior.
3. memoRY aRchitectuRe
The memory architecture—that is, the way in which object
fields are laid out in memory, and the free list is maintained—is common to our support of both malloc/free and
garbage-collected abstractions. In this section we describe
our memory architecture as well as some of the alternatives,
and discuss the trade-offs qualitatively. Some trade-offs are
explored quantitatively in Section 7.
Since memory structures within an FPGA are typically and
of necessity far more uniform than in a conventional software heap, we organize memory into one or more miniheaps,
in which objects have a fixed size and “shape” in terms of
division between pointer and data fields.
3. 1. miniheap interface
Each miniheap has an interface allowing objects allocation
(and freeing when using explicit memory management), and
operations for individual data fields to be read or written.
In this article, we consider only miniheaps with one or two
pointer fields and one or two data fields. This is sufficient for
implementing stacks, lists, queues, and tree data structures.
FPGA modules for common applications like packet process-
ing, compression, etc. are covered by such structures.
Our design allows an arbitrary number of data fields.
Increasing the number of pointer fields is straightforward
for malloc-style memory. However, for garbage collected
memory, the extension would require additional logic. We
believe this is relatively straightforward to implement (and
include details below) but the experimental results in this
paper are confined to one- and two-pointer objects.
3. 2. miniheap with malloc/free
There are many ways in which the interface in Section 3. 1 can
be implemented. Fundamentally, these represent a time/
space trade-off between the number of available parallel operations, and the amount of hardware resources consumed.
For FPGAs, one specifies a logical memory block with a
desired data width and number of entries, and the synthesis
tools attempt to allocate the required number of individual
Block RAMs as efficiently as possible, using various packing
strategies. We refer to the BRAMs for such a logical memory
block as a BRAM set.
In our design we use one BRAM set for each field in the
object. For example, if there are two pointer fields and one
data field, then there are three BRAM sets.
The non-pointer field has a natural width associated with its
data type (for instance 32 bits). However, for a miniheap of size
N, the pointer fields need only be élog2 Nù bits wide. Because
data widths on the FPGA are completely customizable, we use
precisely the required number of bits. Thus a larger miniheap
will increase in size not only because of the number of entries,
but because the pointer fields themselves become larger. As in
software, the pointer value 0 is reserved to mean “null”, so a
miniheap of size N can really only store N − 1 objects.
A high-level block diagram of the memory manager is
shown in Figure 1. It shows the primary data and control
fields of the memory module, although many of the signals have been elided to simplify the diagram. For clarity
of presentation it shows a single object field, of pointer
type (Pointer Memory), which is stored in a single BRAM
set. A second BRAM set (Free Stack) is used to store a
stack of free objects. Using the read-before-write property of BRAMs we are able to implement the two functions
(alloc and free) using port B of the BRAM set (Free Stack),
leaving port A unused.
For an object with f fields, there are f BRAM sets with
associated interfaces for the write and read values (but not
an additional address port). There is only a single free stack,
regardless of how many fields the object has.
The Alloc signal is a one-bit signal used to implement
the malloc operation. A register is used to hold the value
of the stack top. Assuming it is non-zero, it is decremented
and then presented on port B of the Free Stack BRAM
set, in read mode. The resulting pointer to a free field
is then returned (Addr Alloc’d), but is also fed to port B
of the Pointer Memory, in write mode with the write value