deCeMBer2013 | vOL. 56 | nO. 12 | communicationS of the acm 107
is noticeable but not a major factor, ranging from 11–31%
for Malloc and 11–53% for garbage collection. As before, at
larger heap sizes, the fragmentation decreases. Some wastage can be avoided by choosing heap sizes more carefully,
not necessarily a power of 2, by noting that BRAMs are available in 18Kb blocks. However, some fragmentation loss is
inherent in the quantization of BRAMs as they are chained
together to form larger memories.
Finally, Figure 6 shows the synthesized clock frequency at
different design points. Here we see a significant effect from
the more complex logic for garbage collection: even though
it consumes relatively little area, clock frequency for garbage
collection is noticeably slower (15–39%) than Malloc across
all design points. On the other hand, the difference between
STW and RTGC is small with RTGC often faster. Regardless
of the form of memory management, clock frequency
declines as the heap becomes larger. However, the overall
clock rate may very well be constrained by the application
logic rather than the collector logic, as we will see below.
7. 1. throughput
So far we have discussed the costs of memory management
in the absence of applications; we now consider what happens when the memory manager is “linked” to the Deque
microbenchmark. Unlike the previous section, where we
concentrated on the effects of a wide range of memory sizes
on static chip resources, here we focus on a smaller range of
sizes using a trace with a single maximum live data set of
m = 8192 as described previously. We then vary the heap
size N from m to 2m at fractional increments of m/10. As
we make memory scarce, the resolution is also increased
to m/100 to show how the system behaves at very tight conditions. Each design point requires a full synthesis of the
hardware design which can affect the frequency, power,
and execution time.
Figure 7 shows the throughput of the benchmark
as the heap size varies for all 3 schemes. To understand
the interaction of various effects, we not only examine
the throughput both in cycle duration but also, since the
synthesizable clock frequencies vary, in physical time.
The Deque benchmark shows a different behavior. With
much higher allocation and mutation rates (a = 0.07 and m =
0.13), it is much more sensitive to collector activity. As seen
in Figure 7(b), even at heap size N = 2m, STW consumes
noticeably more cycles, rising to almost double the cycles
a pseudorandom sequence of such operations while keeping
the maximum amount of live data close to but no more than
8192. Because there is almost no computation performed on
the list elements, this benchmark is very allocation-intensive.
Moreover, the linear nature of the data structure prevents the
collector from taking advantage of graph fanout. These two
factors combine to present a great challenge to the collector.
It is important to note that because of the very high degree
of determinism in the hardware, and in our collector implementation, such micro-benchmarks can provide a far more
accurate picture of performance than in typical evaluations
of CPU-based collectors running in software. There are no
cache effects, no time-slicing, and no interrupts. Because
these higher order effects are absent, the performance behavior presented to the mutator by the collector and vice versa
is completely captured by the memory management API at a
cycle-accurate level. We validate this experimentally by showing that the estimates for collection time and minimum real-time heap size (from Section 5. 1) are highly accurate.
A given micro-benchmark can be paired with one of the
three memory management implementations (Malloc,
stop-the-world GC, and real-time GC). Furthermore, these
are parameterized by the size of the miniheap, and for the
collectors, the trigger at which to start collection (although
for most purposes, we simply trigger when free space falls
below 25%). We call these design points.
Our experiments are performed by using the Xilinx ISE
13. 4 synthesis tools on a Xilinx Virtex- 5 LX330T, 15 which is
the largest FPGA within the Virtex5 LXT product line. The
LX330T has 51,840 slices and 11,664Kb ( 1.45MB) of Block
RAM. Fabricated in 65 nm technology, the chip is theoretically capable of being clocked at up to 550 MHz, but realistic designs generally run between 100 and 300 MHz.
We begin by examining the cost, in terms of static resources,
of the 3 memory managers—malloc/free (“Malloc”), stop-the-world collection (“STW”), and real-time concurrent
collection (“RTGC”). For these purposes we synthesize
the memory manager in the absence of any application.
This provides insight into the cost of the memory management itself, and also provides an upper bound on the performance of actual applications (since they can only use
more resources or cause the clock frequency to decline).
We evaluate design points at heap sizes (in objects) from
1K to 64K in powers of 2. For these purposes we use an object
layout of two pointers and one 32-bit data field. For brevity,
we omit detailed usage of non-BRAM logic resources. It is
enough to note that for all cases, the logic consumption is
under 1% for all 3 variants and at all heaps sizes.
Figure 5 shows BRAM consumption. Because we have
chosen powers of 2 for heap sizes, the largest heap size
only uses 60% of the BRAM resources (one is of course free
to choose other sizes). At the smaller heap sizes, garbage
collectors consume up to 80% more BRAMs than Malloc.
However, at realistic heap sizes, the figure drops to 24%.
In addition, RTGC requires about 2–12% more memory
than STW since it requires the additional 2-bit wide Used
Map to cope with concurrent allocation. Fragmentation
1024 2048 4096 8192
Heap Size (Objects)
16384 32768 65536
figure 5. Block Ram usage, including fragmentation.