Heterogeneous Von Neumann/
By Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam
General-purpose processors (GPPs), which traditionally rely
on a Von Neumann-based execution model, incur burdensome power overheads, largely due to the need to dynamically extract parallelism and maintain precise state. Further,
it is extremely difficult to improve their performance without increasing energy usage. Decades-old explicit-dataflow
architectures eliminate many Von Neumann overheads, but
have not been successful as stand-alone alternatives because
of poor performance on certain workloads, due to insufficient control speculation and communication overheads.
We observe a synergy between out-of-order (OOO) and
explicit-dataflow processors, whereby dynamically switching
between them according to the behavior of program phases
can greatly improve performance and energy efficiency. This
work studies the potential of such a paradigm of heterogeneous execution models, by developing a specialization
engine for explicit-dataflow (SEED) and integrating it with
a standard out-of-order (OOO) core. When integrated with
a dual-issue OOO, it becomes both faster ( 1. 33×) and dramatically more energy efficient ( 1. 70×). Integrated with an
in-order core, it becomes faster than even a dual-issue OOO,
with twice the energy efficiency.
As transistor scaling trends continue to worsen, power
limitations make improving the performance and energy
efficiency of general purpose processors (GPPs) ever more
intractable. The status quo approach of scaling processor structures consumes too much power to be worth the
marginal improvements in performance. On top of these
challenges, a series of recent microarchitecture level vulnerabilities (Meltdown and Spectre9) exploit the underlying
techniques which modern processors already rely on for
exploiting instruction-level parallelism (ILP).
Fundamental to these issues is the Von Neumann execution
model adopted by modern GPPs. To make the contract between
the program and the hardware simple, a Von Neumann
machine logically executes instructions in the order specified
by the program, and dependences are implicit through the
names of storage locations (registers and memory addresses).
However, this has the consequence that exploiting ILP
effectively requires sophisticated techniques. Specifically,
it requires ( 1) dynamic discovery of register/memory dependences, ( 2) speculative execution past unresolved control
flow instructions, and ( 3) maintenance of the precise program state at each dynamic instruction should it be need to
be recovered (e.g., an exception due to a context switch).
The above techniques are the heart of modern Von
Neumann out-of-order (OOO) processors, and each technique
The original version of this paper is entitled “Exploring
the Potential of Heterogeneous Von Neumann/Dataflow
Execution Models” and was published in ISCA 2015.
requires significant hardware overhead (register renaming,
instruction wakeup, reorder-buffer maintenance, speculation
recovery, etc.). In addition, the instruction-by-instruction
execution incurs considerable energy overheads in pipeline
processing (fetch, decode, commit, etc.). As for security, the
class of vulnerabilities known as Meltdown and Spectre all
make use of speculative execution of one form or another,
adding another reason to find an alternative.
Interestingly, there exists a well-known class of architectures that mitigate much of the above called explicit-dataflow
(e.g., Tagged Token Dataflow,
3 WaveScalar20). Figure 1
shows that the defining characteristic of this execution
model is how it encodes both control and data dependences
explicitly, and the dynamic instructions are ordered by these
dependences rather than a total order. Thus, a precise
program state is not maintained at every instruction. The
benefit is extremely cheap exploitation of instruction-level
parallelism in hardware, because no dynamic dependence construction is required.
However, explicit-dataflow architectures show no signs of
replacing conventional GPPs for at least three reasons. First,
control speculation is limited by the difficultly of implementing efficient dataflow-based squashing. Second, the
latency cost of explicit data communication can be prohibitive.
2 Third, compilation challenges for general workloads
have proven hard to surmount.
5 Although a dataflow-based
execution model may help many workloads, it can also significantly hamper others.
Unexplored opportunity: What is unexplored so far is the
fine-grained interleaving of explicit-dataflow with Von
Neumann execution—that is, the theoretical and practical limits of being able to switch with low cost between an
explicit-dataflow hardware/ISA and a Von Neumann ISA.
Figure 2(a) shows a logical view of such a heterogeneous
architecture, and Figure 2(b) shows the capability of this
architecture to exploit fine-grain (thousands to millions of
instructions) application phases. This is interesting now, as
Von Neumann Execution Model Dataflow Execution Model
Precise Instruction-Order Maintained Instructions Ordered by Dependences
Instructions can be locally re-ordered after
dynamically discovering dependences.