Non-Numerical Programs by
By Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks
Because of the high cost of communication between processors, compilers that parallelize loops automatically have
been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC
is a compiler/microprocessor co-design that opens those
loops to parallelization by decoupling communication from
thread execution in conventional multicore architecures.
Simulations of HELIX-RC, applied to a processor with 16
Intel Atom-like cores, show an average of 6. 85× performance
speedup for six SPEC CINT2000 benchmarks.
On a multicore processor, the performance of a program
depends largely on how well it exploits parallel threads.
Some computing problems are solved by numerical programs that are either inherently parallel or easy to parallelize. Historically, successful parallelization tools have been
able to transform the sequential loops of such programs
into parallel form, boosting performance significantly.
Most software, however, is still sequentially designed and
largely non-numerical, with irregular control and data flow.
Because manual parallelization of such software is error-prone and time-consuming, automatic parallelization of
non-numerical programs remains an important open
The last decade has seen impressive steps toward a solution, but when targeting commodity processors, existing
parallelizers still leave much of the latent parallelism in
loops unrealized. 5 The larger loops in a program can be so
hard to analyze accurately that apparent dependences often
flood communication channels between cores. Smaller
loops are more amenable to accurate analysis, and our work
shows that there is a lot of parallelism between the iterations of small loops in non-numerical programs represented
by SPECint2000 benchmarks. 4 But even after intense optimization, small loops typically include loop-carried dependences, so their iterations cannot be entirely
independent—they must communicate. Because the iterations of a small loop are short ( 25 clock cycles on average for
SPECint2000), their communications are frequent.
On commodity processors, communication relies on the
memory system and is reactive, triggered only when one
core asks for data from another. The resulting delay is lon-
ger than the average duration of an iteration, and it is hard
to overlap with computation, especially when the variance
of durations is high, as in non-numerical workloads. The
The original paper, “HELIX-RC: An Architecture-Compiler
Co-Design for Automatic Parallelization of Irregular
Programs,” was published in Proceedings of the International
Symposium on Computer Architecture, June 14–18, 2014,
benefits of automatic loop parallelization therefore satu-
rate at small numbers of cores for commodity processors.
Lowering the latency of inter-core communication would
help, but it can only go so far, if communication remains
reactive. We therefore propose a proactive solution, in which
the compiler and an architectural extension called ring cache
cooperate to overlap communication with computation and
lower communication latency. The compiler identifies data
that must be shared between cores, and the ring cache circulates that data as soon as it is generated.
To demonstrate this idea, we have developed HELIX-RC, a
co-design incorporating a parallelizing compiler and a
simulated chip multiprocessor extended with ring cache.
The HELIX-RC compiler builds on the original HELIX
code parallelizer for commodity multicore processors. 5
Because it relies on invariants of the code produced by the
compiler, ring cache is a lightweight, non-invasive extension of conventional multicore architecture. Because it
facilitates proactive, low-latency inter-core communication, ring cache allows HELIX-RC to outperform HELIX by
a factor of 3×.
2. OPPORTUNITIES AND CHALLENGES
OF SMALL LOOPS
2. 1. Opportunities
Prior loop parallelization techniques have avoided selecting
loops with small bodies because communication would
slow down execution on conventional processors. 5, 20 On
average, such techniques yield only about 60% coverage by
parallelized loops for non-numerical programs. Excluding
small loops limits overall speedup of such programs to less
than 3 times no matter how many cores are available,
because by Amdahl’s law, coverage dictates the overall
speedup of a program through parallelization.
Because the intricacy of control and data flow scales
down with code size, small loops are easier than larger ones
for a compiler to analyze, which reduces the proportion of
data dependences that must be accommodated at run time
because of conservative assumptions about possible pointer
aliases. As a result, the optimized bodies of small loops yield
relatively independent iteration threads. 5 So there could be