Note that both a and m can only be averaged over a time window guaranteed to be less than or equal to the phases which
they influence; m is a safe window size.
The largest inaccuracy is still due to pipeline stalls
during marking, for which worst- and average-case behavior can be very different. We therefore let B be the number
of pipeline stalls (0 ≤ B ≤ 2m), so an even more precise
bound on marking is T ″M = m+B+ 3 (and also improving
T ″X = a T ″M).
For a linked list, B = 2m; for three linked lists each with its
own root, B = 0. We hypothesize that for the heap considered
as a forest without back-edges, B is bounded by the number of levels of width 1 plus the number of levels of width 2
(when the width is 3 or greater, there is enough parallelism
to keep the 3-stage pipeline full and avoid stalls).
Using these application-specific estimates, we then are
able to bound the worst-case execution time (WCET) of
5. 2. minimum heap size
Once the worst-case execution time for collection is known,
we can solve for the minimum heap size in which the collector can run with real-time behavior (zero stalls). Note that if
a heap size is deliberately chosen below this minimum, the
allocator may experience an out-of-memory condition. As a
starting point, m objects must be available for the live data.
While a collection taking time Tmax takes place, another
a Tmax objects can be allocated. However, there may also be
a Tmax floating garbage from the previous cycle when a collection starts. Thus the minimum heap size is
Nmin = m + 2a Tmax ( 4)
and if we denote the non-size-dependent portion of Tmax
from equation ( 3) by
then we can solve for
6. eXPeRimentaL methoDoLoGY
Since we have implemented the first collector of this kind, we
cannot leverage a preexisting set of benchmarks for evaluation. Here, we present the results of a very allocation-intensive microbenchmark, a doubly-ended queue (deque), which
commonly occurs in many applications. In our HDL implementation, the doubly-linked list can be modified by pushes
and pops to either the front or back. The workload consists of
5. ReaL-time BehaVioR
First of all, we note that since the design of our real-time
collector allows mutation and collection to occur unconditionally together in a single cycle, the minimum mutator utilization (or MMU7), is 100% unless insufficient resources are
dedicated to the heap.
Furthermore, unlike software-based collectors, 5, 11 the
system is fully deterministic because we can analyze the
worst case behavior down to the (machine) cycle.
Given R is the maximum number of roots, N is the size
of the heap, then the worst-case time (in cycles) for garbage
T=TR+TM+TW+TX+TS+TA ( 1)
where TR is the time to snapshot the roots, TM is the time (in
cycles) to mark, TS is the time to sweep, and TW is the time
lost to write barriers during marking, TX is the time lost to
blackening newly allocated objects during marking, and TA
is time lost to allocations during sweeping.
In the worst case, without any knowledge of the
TR=R+ 2 TM=3N+ 3 TW=0 TX=0 TS=N
The reasoning for these quantities follows. During the snapshot phase, we can place one root into the mark queue every
cycle, plus one cycle to start and finish the phase, accounting for R + 2. During marking, there could be N objects in
the heap, configured as a linked list which caused the mark
pipeline to stall for two cycles on each object, plus 3 cycles to
terminate. Sweeping is unaffected by application characteristics, and always takes N cycles. Preemption of the collector
by mutator write barriers (TW) does not factor into the worst-case analysis because the write barrier work is overlapped
with the collector stalls. Extra mark operations to blacken
newly allocated objects (TX) also simply fill stall cycles.
Our design allows an allocation operation in every cycle,
but allocation preempts the sweep phase, meaning that
such an allocation rate can only be sustained in short
bursts. The largest sustainable allocation rate is 0.5—
otherwise the heap would be exhausted before sweeping completed. Thus TA = N and
T worst = R + 5N + 5 ( 2)
5. 1. application-specific analysis
Real-time analysis typically takes advantage of at least
some application-specific knowledge. This is likely to be
particularly true of hardware-based systems. Fortunately,
the structure of such systems makes it more likely that
such factors can be quantified to a high degree of precision, for example by looking at operations per clock cycle
in the synthesized design.
Let m be the average number of mutations per cycle
(m ≤ 1), a be the average number of allocations per cycle
(a < 0.5), and m be the maximum number of live data objects
in the heap at any one time (m < N). Then we can more precisely estimate