And Then There Were None:
A Stall-Free Real-Time Garbage
Collector for Reconfigurable
By David F. Bacon, Perry Cheng, and Sunil Shukla
The end of frequency scaling has driven architects and
developers to parallelism in search of performance.
However, general-purpose MIMD parallelism can be inefficient and power-hungry, with power rapidly becoming the
limiting factor. This has led the search for performance to
non-traditional chip architectures like GPUs and other more
radical architectures. The most radical computing platform of all is reconfigurable hardware, in the form of Field-Programmable Gate Arrays (FPGAs).
FPGAs are now available with over one million programmable logic cells and 8MB of on-chip “Block RAM,” providing a massive amount of bit-level parallelism combined
with single-cycle access to on-chip memory. Furthermore,
because that memory is distributed over the chip in distinct
36Kb units, the potential internal bandwidth is very high.
However, programming methodology for FPGAs has
lagged behind their capacity, consequently reducing their
applicability to general-purpose computing. The common
languages in this domain are still hardware description languages (VHDL and Verilog) in which the only abstractions
are bits, arrays of bits, registers, wires, and so on. The entire
approach to programming them is oriented around the
synthesis of a chip that happens to be reconfigurable, as
opposed to programming a general-purpose device.
Recent research has focused on raising the level of abstraction and programmability to that of high-level software-based
programming languages: the Scala-based Chisel framework, 4
C#-based Kiwi project10 and the Liquid Metal project, which
has developed the Lime language3 based on Java. However,
until now, whether programmers are writing in low-level HDLs
or high-level languages like Chisel and Lime, use of dynamic
memory management has only just begun to be explored, 8, 14
and use of garbage collection has been nonexistent.
In this article, we present a garbage collector synthesized
directly to hardware, capable of collecting a heap of uniform
objects completely concurrently. These uniform heaps (mini-
heaps) are composed entirely of objects of a fixed shape. Thus,
the size of the data fields and the location of pointers of each
object are fixed. We trade some flexibility in the memory layout
for large gains in collector performance. In the FPGA domain,
this trade-off makes sense: due to the distributed nature of
the memory, it is common to build pipelined designs where
each stage of the pipeline maintains its own internal data
structures that are able to access their local block RAM in
parallel with other pipeline stages. Furthermore, fixed data
layouts can provide order-of-magnitude better performance
because they allow designs which deterministically process
one operation per clock cycle.
Algorithmically, our collector is a Yuasa-style snapshot-
at-the-beginning collector, 16 with a linear sweep. By taking
advantage of hardware structures like dual-ported memories,
the ability to simultaneously read and write a register in a sin-
gle cycle, and to atomically distribute a control signal across
the entire system, we are able to develop a collector that
never interferes with the mutator. Furthermore, the mutator
has single-cycle access to memory. Arbitration circuits delay
some collector operations by one cycle in favor of mutator
operations, but the collector can keep up with a mutator even
when it performs a memory operation every cycle (allocations
are limited to one every other cycle).
Our stall-free collector can be used directly with programs hand-written in hardware description languages.
Alternatively, it can be part of a hardware “runtime system”
used by high-level language3, 10 systems including dynamic
memory allocation. For evaluation purposes, we have also
implemented a stop-the-world variant and a malloc/free-style system. Using a very allocation-intensive application,
the three collectors are compared in terms of memory usage,
clock frequency, throughput (cycles and wall-clock time),
and application stalls. We also present analytic closed-form
worst-case bounds for the minimum heap size required for
0-stall real-time behavior, which are empirically validated.
Further details, such as the energy consumption and additional benchmarking, are available in the original paper. 6
2. fPGa BacKGRounD
FPGAs are programmable logic devices consisting of many dis-
crete resources units that are enabled and connected based on
the user application. Resources include look-up tables (LUTs)
which can be used to implement combinational logic, and
flip-flops which can be used to implement sequential logic or
The original version of this paper was published in
Proceedings of the 33rd ACM SIGPLAN Conference on
Programming Language Design and Implementation
(PLDI’ 12), June 2012, ACM.