R
e
l
a
t
iv
e
E
ne
rg
y
+In-place Loop
Little (IO2)
Medium (OOO2)
Big (OOO4)
Core Type:
+SEED
+Cons-Cores
Host GPP Core
0.85× perf.
2. 3× en. eff.
1. 14×
perf.
1. 54×
en. eff.
Design:
1.0 1. 4 1. 8 2. 2
Relative Performance
0.4
0.6
0.8
1.0
1. 2
1. 4
1. 6
Figure 11. Overall performance and energy benefit.
and microarchitectural approaches. For the little, medium, and
big cores, SEED provides 1. 65×, 1. 33×, and 1. 14× speedup,
and 1. 64×, 1. 7×, and 1. 53× energy efficiency, respectively.
The energy benefits come primarily from the prevalence of
regions where dataflow execution can match the host core’s
performance; this occurs 71%, 64%, and 42% of the time, for
the little, medium, and big Von Neumann cores, respectively.
Understanding disruptive trade-offs: Perhaps more interesting is the disruptive changes that explicit-dataflow specialization introduces for computer architects. First, the
OOO2+SEED is actually reasonably close in performance
to an OOO4 processor on average, within 15%, while reducing energy 2. 3×. Additionally, our estimates suggest that an
OOO2+SEED occupies less area than an OOO4 GPP core.
Therefore, a hybrid dataflow system introduces an interesting path toward a high-performance, low-energy microprocessor: start with an easier-to-engineer modest OOO core,
and add a simple, nongeneral-purpose dataflow engine.
An equally interesting trade-off is to add a hybrid dataflow unit to a larger OOO core—SEED+OOO4 has much
higher energy efficiency ( 1. 54×) with additional performance improvements of 1. 14×. This is a significant leap for
energy-efficiency, especially considering the difficulty of
improving the efficiency for complex, irregular workloads
such as SpecINT.
Overall, all cores can achieve significant energy benefits,
little and medium cores can achieve significant speedup,
and big cores receive modest performance improvement.
7. DISCUSSION
Dataflow specialization is a broadly applicable principle for
both general-purpose processors and accelerators. We outline our view on the potentially disruptive implications in
these areas as well as potential future directions.
7. 1. General purpose cores
In this work, we showed how a dataflow processor can
more efficiently take over and execute certain phases of
application workloads, based on their properties. This
can be viewed visually, as shown in Figure 12, where we
show architecture affinity for programs along dimensions
of control and memory regularity. Figure 12(a) shows how
prior programmable specialization techniques only focus
on a narrow range of workloads—for example, SIMD can
speedup highly regular program phases
1 only.
Figure 12(b) shows how dataflow specialization further
cuts into the space of programs that traditional architec-
tures are best at. Specifically, when the OOO processor’s
issue width and instruction window size limits the achiev-
able ILP (region
3 ), explicit-dataflow processors can exploit
this through distributed dataflow, as well as more efficient
execution under control unpredictability (region
4 ). Beyond
these region types, dataflow specialization can be applied to
create engines that target other behaviors, such as repeat-
able control
5, or to further improve highly regular regions
by combining dataflow with vector-communication
1.
Future directions: The disruptive potential of exploiting
common program phase behavior using a heterogeneous
dataflow execution model can have significant implications
leading to several important directions:
• Reduced importance of aggressive out-of-order: Dataflow
engines which can exploit high ILP phases can reduce
the need for aggressive and power-inefficient out-of-order
cores. As a corollary, the design of modest-complexity
loosely coupled cores should in principle be less design
effort than a complex OOO core. This could lower the
cost-of-entry into the general-purpose core market,
increasing competition and spurring innovation.
• Radical departure from status quo: The simple and modular integration of engines targeting different behaviors, combined with microarchitecture-level dynamic
compilation for dataflow ISAs22 can enable such designs
to be practical. This opens the potential of exploring
designs with radically different microarchitectures and
software interfaces, ultimately opening a larger and
more exciting design space.
• An alternative secure processor: An open question is how
to build future secure processors that are immune to
attacks such as Meltdown and Spectre.
9 One approach
is to simply avoid speculation; this work shows that an
in-order core plus SEED may only lose on average around
20% performance with respect to an OOO core alone, at
much lower energy.
7. 2. Accelerators
In contrast to general-purpose processors, accelerators are
purpose-built chips integrated at a coarse grain with computing systems, for workloads important-enough to the
market to justify their design and manufacturing cost. A persistent challenge facing accelerator design is that in order to
achieve desired performance and energy efficiency, accelerators often sacrifice generality and programmability, using
application or domain-specific software interfaces. Their
architecture and microarchitecture is narrowly tailored to
the particular domain and problem being solved.
The principle of heterogeneous Von Neumann/dataflow
architectures can help to create a highly efficient accelerator without having to give up on domain-generality. Inspired
by the insights here, we demonstrated that domain-specific
accelerators rely on fundamentally common specialization
principles: specialization of computation, communication,