Enforcing memory-ordering: SEED uses a primarily
software approach to enforce memory-ordering. When the
compiler identifies dependent (or aliasing) instructions,
the program serializes these through explicit tokens. In this
example, the stores of n_val can conflict with the load from
the next iteration (e.g., when the linked list contains a loop),
and therefore, memory dependence edges are added.
Executing compound instructions: To mitigate communication overheads, the compiler groups primitive instructions (e.g., adds, shifts, switches, etc.) into subgraphs and
executes them on compound functional units (CFUs). These
are logically executed atomically. The example program contains four subgraphs, mapped to two CFUs.
3. 3. SEED microarchitecture
SEED achieves high instruction parallelism and simplicity
by using eight distributed computation units. Each of these
SEED units is organized around one CFU, and units communicate together over a network, as shown in Figure 6.
Compound functional unit (CFU): CFUs are composed
of a fixed network of primitive FUs (adders, multipliers, logical units, switch units, etc.), where unused portions of the
CFU are bypassed when not in use. Long latency instructions
(e.g., loads) can be buffered and passed by subsequent
instructions. Our design uses the CFU mix from existing
7 where CFUs contain 2–5 operations. CFUs which have
memory units will issue load and store requests to the host’s
memory management unit. Load requests access a store
buffer for store-to-load forwarding.
Instruction management unit (IMU): The IMU has three
responsibilities. First, it stores up to 32 compound instructions, each with a maximum of four operands each for up
to four dynamic loop iterations (equivalent to a 1024-entry
instruction window). Second, it selects instructions with
ready operands for execution on the CFU, giving priority to
the oldest instruction. Third, the IMU routes incoming values from the network to appropriate storage locations based
on the incoming instruction tag.
Communication: The ODU is responsible for distributing the output values and destination packets (SEED unit +
instruction location + iteration offset), to the bus network,
and buffering them during bus conflicts. A bus interconnect forwards output packets from the ODU to SEED unit
IMU’s which use the corresponding operands. Therefore,
dependent instructions communicating over the bus cannot execute in back-to-back cycles, a limitation of distributed dataflow.
4. SEED COMPILER DESIGN
The two main responsibilities of the compiler are determining which regions to specialize and scheduling instructions
into CFUs inside SEED regions.
Region selection: The compiler must find or create fully-
inlined nested-loop regions, which are small enough to
match SEED’s operand/instruction storage. Also, the inner
loop should be unrolled for instruction parallelism. An
Amdahl-tree based approach can be used to select regions.
Also, we should avoid regions where the OOO core (through
control speculation) or the SIMD units would have performed
better. One approach is to use simple heuristics, for exam-
ple, avoid control-critical regions. A dynamic approach can
be more flexible; for example, training online predictors to
give a runtime performance estimate based on per-region
statistics. Related work explores this in detail,
16, 18 and this
work simply uses a static oracle scheduler (see Section 5).
Instruction scheduling: The instruction scheduler forms
compound instructions and assigns them to units. Its job is
to balance communication cost by creating large compound
instructions, while also ensuring that combining instructions does not artificially increase the critical path length. To
solve this, we use integer linear programming, specifically
extending a general scheduling framework for spatial architectures17 with the ability to model instruction bundling.
5. EVALUATION METHODOLOGY
For evaluating SEED, OOO core specialization techniques,
and the other designs we compare to, we employ a TDG-based modeling methodology.
15 We use Mc-PAT11 with 22nm
technology to estimate power and area. Von Neumann core
configurations are given in Table 1.
The benchmarks we chose were from SPECint and Mediabench,
10 representing a variety of control and memory irregularity, as well as some regular benchmarks. To eliminate
compiler/runtime heuristics on when to use which architecture, we use an oracle scheduler, which uses previous runs
to decide when to use the OOO core, SEED, or SIMD.
6. EVALUATING DATAFLOW SPECIALIZATION
To understand the potentials and trade-offs of dataflow specialization, we explore the prevalence of required program
structure, per-region performance, and overall heterogeneous core benefits.
6. 1. Program structure
Nested loop prevalence: Figure 8 shows cumulative distributions of dynamic instruction coverage with varying dynamic
region granularity, assuming maximum 1024 instructions.
Considering regions with a duration of 8K dynamic instructions or longer (x-axis), nested loops can cover 60% of total
instructions, whereas inner loops cover only 20%. Nested
loops also greatly increase the region duration for a given
percentage coverage (1K–64K for 40% coverage).
Compound instruction prevalence: Figure 9 is a histogram
of per-benchmark compound instruction sizes which the
compiler created, showing on average 2–3 instructions. This
Table 1. Von Neuman core configurations.
Little (IO2) Dual issue, 1 load/store port.
Medium (OOO2) 64 entry ROB, 32 entry IW, LSQ: 16 ld/20 st,
Big (OOO4) 168 entry ROB, 48 entry IW, LSQ: 64 ld/36 st,
2 ld/st ports, speculative scheduling.
Common x86 ISA, 256-bit SIMD, 2-way 32KiB I$, 64KiB
L1D$ ( 4 cycle latency), 8-way 2MB L2$
( 22 cycle hit latency), 2GHz.