but also increases the geometric mean of speedups for
CFP2000 benchmarks from 11. 4× to almost 12×.
We now turn our attention to understanding where the
speedups come from.
Communication. Speedups obtained by HELIX-RC come
from decoupling both synchronization and data communication from computation in loop iterations, which significantly reduces communication overhead, allows the
compiler to split sequential segments into smaller blocks,
and cuts down the critical path of the generated parallel
code. Figure 6 compares the speedups gained by multiple
combinations of decoupling synchronization, register-, and
memory-based communication. As expected, fast register
transfers alone do not provide much speedup since most in-register dependences can be satisfied by re-computing the
shared variables involved. 4 Instead, most of the speedups
come from decoupling communication for both synchronization and memory-carried actual dependences. To the best
of our knowledge, HELIX-RC is the only solution that accelerates all 3 types of transfers for actual dependences.
Simulated ring cache. We extended XIOSim to simulate
the ring cache as described in Section 5. We used the following configuration: a 1 KB 8-way associative array size, one-word
data bandwidth, five-signal bandwidth, single-cycle adjacent core latency, and two cycles of core-to-ring-node injection latency to minimally impact the already delay-critical
path from the core to the L1 cache. A sensitiviy analysis of
these parameters as well as the evaluation of the ring cache
in out-of-order cores can be found in. 4 We use a simple bit
mask as the hash function to distribute memory addresses
to their owner nodes. To avoid triggering the cache coherence protocol, all words of a cache line have the same owner.
Lastly, XIOSim simulates changes made to the core to route
memory accesses either to the attached ring node or to the
Benchmarks. We use 10 out of the 15 C benchmarks from
the SPEC CPU2000 suite: 4 floating point (CFP2000) and 6
integer benchmarks (CINT2000). For engineering reasons,
the data dependence analysis that HCCv3 relies on4 requires
either too much memory or too much time to handle the
rest. This limitation is orthogonal to the results described in
Compiler. We extended the Intermediate Language
Distributed Just-In-Time (ILDJIT) compilation framework, 3
version 1. 1, to use LLVM 3.0 for backend machine code generation. We generated both single- and multi-threaded versions of the benchmarks. The single-threaded programs are
the unmodified versions of benchmarks, optimized (O3) and
generated by LLVM. This code outperforms GCC 4. 8. 1 by 8%
on average and under-performs ICC 14.0.0 by 1.9%. The
multi-threaded programs were generated by HCCv3 and the
HELIX compiler (i.e., compiler-only solution) to run on ring-cache-enhanced and conventional architectures, respectively. Both compilers produce code automatically and do
not require any human intervention. During compilation,
they use SPEC training inputs to select the loops to
Measuring performance. We compute speedups relative
to sequential simulation. Both single- and multi-threaded
runs use reference inputs. To make simulation feasible, we
simulate multiple phases of 100 M instructions as identified
6. 2. Speedup analysisc
In our 16-core processor evaluation system, HELIX-RC
boosts the performance of sequentially-designed programs
(CINT2000), assumed not to be amenable to parallelization.
Figure 5 shows that HELIX-RC raises the geometric mean of
speedups for these benchmarks from 2. 2× for a compiler-only solution to 6. 85×.
HELIX-RC not only maintains the performance of a compiler-only solution on numerical programs (SPEC CFP2000),
Figure 5. HELIX-RC triples the speedup obtained by a compiler-only solution for SPEC INT benchmarks. Speedups are relative to
sequential program execution.
INTGeomean 183.equake179.art 188.ammp177.mesa FPGeomean Geomean
164.gzip 175.vpr 197.parser 300.twolf 181.mcf 256.bzip2
decoupled reg. communication
decoupled reg. comm. and synch.
decoupled reg. and memory comm.
HELIX-RC (decoupled all communication)
synchronization B e n e
Figure 6. Breakdown of benefits of decoupling communication from
c As an aside, automatic parallelization features of ICC led to a geomean
slowdown of 2.6% across SPEC CINT2000 benchmarks, suggesting ICC
cannot parallelize non-numerical programs.
These speedups are possible even with a cache coherence latency of conventional processors (e.g., 75 cycles).