The performance implications can be seen in an example
in Figure 3(a), which has a single control decision labeled
as if . In (b), we show the program instruction order for one
iteration of this code, assuming the left branch was taken.
Figure 3(c) shows the ideal schedule of these instructions on
an ideal machine (one instruction per cycle). The key to the
ideal execution is both the reordering of dependent instructions ( c , d ) before the control decision is resolved, as well
as being able to execute many instructions in parallel.
A Von Neumann OOO machine has the advantage of speculative execution, but the disadvantage is the complexity of
implementing hardware for issuing multiple instructions per
cycle (issue width) when the dependences are determined
dynamically. Therefore, (d) shows how a dual-issue OOO takes
five cycles because there was not enough issue bandwidth for
both d and h before the third cycle.
A dataflow processor can easily be designed for high issue
width due to dependences being explicitly encoded into the
program representation. However, we assume here that the
dataflow processor does not perform speculation, because
of the difficulty of recovering when a precise order is not
maintained. Therefore, in Figure 3(e), the dataflow processor’s schedule, c and d ; must execute after the if .
Although the example suggests the benefits of control
specialization and wide issue widths are similar, in practice, the differences can be stark, which we can demonstrate
with slight modifications to the example. If we add several
instructions to the critical path of the control decision
(between b and if ), the OOO core can hide these through
control speculation. If instead we add more parallel instructions, the explicit-dataflow processor can execute these in
parallel, whereas these may be serialized in the OOO Von
Neumann machine. Explicit-dataflow can also be beneficial
if the if is unpredictable, and the OOO is anyway serialized.
trends mean that on-chip power is more limited than area;
this creates “dark-silicon,” portions of the chip that cannot be kept active due to power constraints. The two major
implications are that energy efficiency is the key to improving
scalable performance, and that it becomes rationale to add
specialized hardware which is only in-use when profitable.
With such a hardware organization, many open questions arise: Are the benefits of fine-grained interleaving of
execution models significant enough? How might one build
a practical and small footprint dataflow engine capable of
serving as an offload engine? Which types of GPP cores can
get substantial benefits? Why are certain program region-types suitable for explicit-dataflow execution?
To answer these questions we make the following contributions. Most importantly, we identify (and quantify) the potential of switching between OOO and explicit-dataflow at a fine
grain. Next, we develop a specialization engine for explicit-dataflow (SEED) by combining known dataflow-architecture techniques, and specializing the design for program characteristics
where explicit-dataflow excels as well as simplifying and common program structures (loops/nested loops). We evaluate the
benefits through a design-space exploration, integrating SEED
into little (in-order), medium (OOO2), and big (OOO4) cores.
Our results demonstrate large energy benefits over > 1. 5×, and
speedups of 1. 67×, 1. 33×, and 1. 14× across little, medium, and
big cores. Finally, our analysis illuminates the relationship
between workload properties and dataflow profitability: code
with high memory parallelism, instruction parallelism, and
control noncriticality is highly profitable for dataflow execution. These are common properties for many emerging workloads in machine learning and data processing.
2. UNDERSTANDING VON NEUMANN/DATAFLOW
SYNERGY
Understanding the trade-offs between a Von Neumann
machine, which reorders instructions implicitly, and a dataflow machine, which executes instructions in dependence
order, can be subtle. Yet, the trade-offs have profound implications. We attempt to distill the intuition and quantitative
potential of a heterogeneous core as follows.
2. 1. Intuition for execution model affinity
The intuitive trade-off between the two execution models is
that explicit-dataflow is more easily specializable for high
issue width and instruction window size (due to lack of need
to discover dependences dynamically), whereas an implicit-dataflow architecture is more easily specializable for speculation (due to its maintenance of precise state of all dynamic
instructions in total program order).
e
(d) Abstract 2-Issue OOO Sched. (e) Abstract Dataflow Sched.
Von Neumann enables efficient control spec.
a
(a) Control Flow Graph
if
d
e
c
b
f
g
i
j
(b) Original program order
(c) Ideal Schedule
j
e
a
b
d
if
h
c
a
b
h
i
j
j
ab cdeh i if j
i
if
hc
d
a
b
c
if
d
h
e
i
Dataflow enables efficient instruction parallelism.
Control dependence
removed through
speculation. .
OOO gains advantage if instructions
are added to control-critical path.
Dataflow gains advantage if more
independent instructions are added.
Basic
Blocks
Figure 3. Von Neumann and dataflow execution models.
Thousands to Millions of instructions
App. 1
App. 2
App. 3
(a) Logical Arch.
OOO
Core Explicit-
Dataflow
live-vals
Vector
Cache Hierarchy
Time
(b) Architecture Preference Over Time