amine and make complex trade-offs and
optimizations will be advantaged.
This opportunity has already led to
a surge of architecture innovation, attracting many competing architectural
philosophies:
GPUs. Nvidia GPUs use many cores,
each with large register files, many
hardware threads, and caches;
4
TPUs. Google TPUs rely on large
two-dimensional systolic multipliers and software-controlled on-chip
memories;
17
FPGAs. Microsoft deploys field programmable gate arrays (FPGAs) in its
data centers it tailors to neural network
applications;
10 and
CPUs. Intel offers CPUs with many
cores enhanced by large multi-level
caches and one-dimensional SIMD instructions, the kind of FPGAs used by
Microsoft, and a new neural network
processor that is closer to a TPU than
to a CPU.
19
In addition to these large players,
dozens of startups are pursuing their
own proposals.
25 To meet growing demand, architects are interconnecting
hundreds to thousands of such chips to
form neural-network supercomputers.
This avalanche of DNN architectures makes for interesting times in
computer architecture. It is difficult to
predict in 2019 which (or even if any) of
these many directions will win, but the
marketplace will surely settle the competition just as it settled the architectural debates of the past.
Open Architectures
Inspired by the success of open source
software, the second opportunity in
computer architecture is open ISAs.
To create a “Linux for processors” the
field needs industry-standard open
ISAs so the community can create
open source cores, in addition to individual companies owning proprietary
ones. If many organizations design
processors using the same ISA, the
greater competition may drive even
quicker innovation. The goal is to
provide processors for chips that cost
from a few cents to $100.
The first example is RISC-V (called
“RISC Five”), the fifth RISC architecture
developed at the University of Califor-
nia, Berkeley.
32 RISC-V’s has a commu-
nity that maintains the architecture
under the stewardship of the RISC-V
Foundation ( http://riscv.org/). Being
open allows the ISA evolution to occur
in public, with hardware and software
experts collaborating before decisions
are finalized. An added benefit of an
open foundation is the ISA is unlikely to
expand primarily for marketing reasons,
sometimes the only explanation for ex-
tensions of proprietary instruction sets.
RISC-V is a modular instruction set.
A small base of instructions run the full
open source software stack, followed by
optional standard extensions designers
can include or omit depending on their
needs. This base includes 32-bit address
and 64-bit address versions. RISC-V can
grow only through optional extensions;
the software stack still runs fine even if
architects do not embrace new extensions. Proprietary architectures generally require upward binary compatibility, meaning when a processor company
adds new feature, all future processors
must also include it. Not so for RISC-V,
whereby all enhancements are optional
and can be deleted if not needed by an
application. Here are the standard extensions so far, using initials that stand
for their full names:
M. Integer multiply/divide;
A. Atomic memory operations;
F/D. Single/double-precision floating-point; and
C. Compressed instructions.
A third distinguishing feature of
RISC-V is the simplicity of the ISA.
While not readily quantifiable, here are
two comparisons to the ARMv8 architecture, as developed by the ARM com-
Fewer instructions. RISC-V has many
fewer instructions. There are 50 in
the base that are surprisingly similar
in number and nature to the original RISC-I.
30 The remaining standard
extensions—M, A, F, and D—add 53
instructions, plus C added another 34,
totaling 137. ARMv8 has more than
500; and
Fewer instruction formats. RISC-V
has many fewer instruction formats,
six, while ARMv8 has at least 14.
Simplicity reduces the effort to both
design processors and verify hardware
correctness. As the RISC-V targets range
from data-center chips to IoT devices,
design verification can be a significant
part of the cost of development.
Fourth, RISC-V is a clean-slate de-
sign, starting 25 years later, letting its
general-purpose processor. The main
computational unit is a matrix unit,
a systolic array22 structure that pro-
vides 256 × 256 multiply-accumulates
every clock cycle. The combination of
8-bit precision, highly efficient sys-
tolic structure, SIMD control, and
dedication of significant chip area to
this function means the number of
multiply-accumulates per clock cycle
is approximately 100× what a general-
purpose single-core CPU can sustain.
Rather than caches, the TPU uses a lo-
cal memory of 24 megabytes, approxi-
mately double a 2015 general-purpose
CPU with the same power dissipa-
tion. Finally, both the activation
memory and the weight memory (in-
cluding a FIFO structure that holds
weights) are linked through user-
controlled high-bandwidth memory
channels. Using a weighted arith-
metic mean based on six common
inference problems in Google data
centers, the TPU is 29× faster than a
general-purpose CPU. Since the TPU
requires less than half the power, it
has an energy efficiency for this work-
load that is more than 80× better than a
general-purpose CPU.
Summary
We have considered two different approaches to improve program performance by improving efficiency in the
use of hardware technology: First, by
improving the performance of modern
high-level languages that are typically
interpreted; and second, by building domain-specific architectures that greatly
improve performance and efficiency
compared to general-purpose CPUs.
DSLs are another example of how to improve the hardware/software interface
that enables architecture innovations
like DSAs. Achieving significant gains
through such approaches will require
a vertically integrated design team that
understands applications, domain-specific languages and related compiler technology, computer architecture
and organization, and the underlying
implementation technology. The need
to vertically integrate and make design
decisions across levels of abstraction
was characteristic of much of the early
work in computing before the industry
became horizontally structured. In this
new era, vertical integration has become
more important, and teams that can ex-