iteration count (ranging from 8 to 20) leads to idle cores.
Other benchmarks such as 164.gzip, 197.parser, 181.mcf, and
188.ammp suffer from dependence waiting due to large
sequential segments. Finally, HCCv3 must sometimes add a
large number of wait and signal instructions (i.e., many
sequential segments) to increase TLP, as seen for 164.gzip,
197.parser, 181.mcf, and 256.bzip2.
7. RELATED WORK
To compare HELIX-RC to a broad set of related work, Table 1
summarizes different parallelization schemes proposed for
non-numerical programs organized with respect to the types
of communication decoupling implemented (register vs.
memory) and the types of dependences targeted (actual vs.
false). HELIX-RC covers the entire design space and is the
only one to decouple memory accesses from computation
for actual dependences.
7. 1. Multiscalar register file
Multiscalar processors19 extract both Instruction Level
Parallelism (ILP) and TLP from an ordinary application.
While a ring cache’s structure resembles a Multiscalar register file, there are fundamental differences. For the Multiscalar
register file, there is a fixed and relatively small number of
shared elements that must be known at compile time.
Furthermore, the Multiscalar register file cannot handle
memory updates by simply mapping memory to a fixed number registers without a replacement mechanism. In contrast,
the ring cache does not require compile-time knowledge to
handle an arbitrary number of elements shared between
cores (i.e., memory locations allocated at runtime) and can
readily handle register updates by deallocating a register to a
memory location. In other words, HELIX-RC proposes to use
a distributed cache to handle both register and memory
7. 2. Cache coherence protocols
The ring cache addresses an entirely different set of communication demands. Cache coherence protocols target relatively small amounts of data shared infrequently between
cores. Hence, cores can communicate lazily, but the resulting communication almost always lies in the critical sequential chain. In contrast, the ring cache targets frequent and
time-critical data sharing between cores.
7. 3. On-chip networks
Sequential segments. While more splitting offers higher
TLP (more sequential segments can run in parallel), it also
requires more synchronization at run time. Hence, the high
synchronization cost for conventional multicores discourages aggressive splitting of sequential segments. In contrast, the ring cache enables aggressive splitting to maximize
To analyze the relationship between splitting and TLP,
we computed the number of instructions that execute concurrently for the following 2 scenarios: (i) conservative
splitting constrained by a contemporary multicore processor with high synchronization penalty ( 100 cycles) and (ii)
aggressive splitting for HELIX-RC with low-latency communication (< 10 cycles) provided by the ring cache. In
order to compute TLP independent of both the communication overhead and core pipeline advantages, we used a
simple abstracted model of a multicore system that has no
communication cost and is able to execute 1 instruction at
a time. Using the same set of loops chosen by HELIX-RC
and used in Figure 5, TLP increased from 6. 4 to 14. 2
instructions with aggressive splitting. Moreover, the average number of instructions per sequential segment
dropped from 8. 5 to 3. 2 instructions.
Coverage. Despite all the loop-level speedups possible via
decoupling communication and aggressively splitting of
sequential segments, Amdahl’s law states that program coverage dictates the overall speedup of a program. Prior parallelization techniques have avoided selecting loops with small
bodies because communication would slow down execution
on conventional processors. 5, 20 Since HELIX-RC does not suffer from this problem, the compiler can freely select small
hot loops to cover almost the entirety of the original
6. 3. Analysis of overhead
To understand areas for improvement, we categorize every
overhead cycle preventing ideal speedup. Figure 7 shows the
results of this categorization for HELIX-RC, again implemented on a 16-core processor.
Most importantly, the small fraction of communication
overheads suggests that HELIX-RC successfully eliminates
the core-to-core latency for data transfer in most benchmarks. For several benchmarks, notably 175.vpr, 300.twolf,
256.bzip2, and 179.art, the major source of overhead is the
low number of iterations per parallelized loop (low trip
count). While many hot loops are frequently invoked, low
Figure 7. Breakdown of overheads that prevent achieving ideal speedup.
29.3% 0.9% 3.7% 58.4% 7.3% 0.0% 0.3% 15.1x
64.1% 8.0% 6.3% 7.4% 8.9% 2.2% 3.1% 12.5x
0.2% 0.0% 47.7% 24.8% 16.1% 0.0% 11.3% 10.5x
0.2% 0.0% 9.1% 1.5% 87.7% 0.0% 1.5% 10.1x
3.4% 3.4% 51.6% 0.1% 1.1% 19.7% 20.7% 12.0x
37.7% 10.4% 5.5% 1.2% 3.2% 20.9% 21.2% 8.7x
0.1% 0.2% 41.8% 1.4% 31.8% 0.0% 24.6% 7.6x
31.3% 24.3% 15.3% 5.0% 0.3% 11.6% 12.2% 7.3x
11.9% 0.4% 74.2% 12.4% 0.0% 0.5% 0.5% 6.1x
40.8% 8.1% 9.6% 4.5% 0.0% 18.1% 18.8% 3.0x