to the processor efficiently. Examples
of DSLs include Matlab, a language for
operating on matrices, TensorFlow, a
dataflow language used for programming DNNs, P4, a language for programming SDNs, and Halide, a language for image processing specifying
high-level transformations.
The challenge when using DSLs is
how to retain enough architecture independence that software written in
a DSL can be ported to different architectures while also achieving high
efficiency in mapping the software
to the underlying DSA. For example,
the XLA system translates Tensorflow
to heterogeneous processors that
use Nvidia GPUs or Tensor Processor
Units (TPUs).
40 Balancing portability
among DSAs along with efficiency is
an interesting research challenge for
language designers, compiler creators,
and DSA architects.
Example DSA: TPU v1. As an example
DSA, consider the Google TPU v1, which
was designed to accelerate neural net
inference.
17, 18 The TPU has been in
production since 2015 and powers applications ranging from search queries
to language translation to image recognition to AlphaGo and AlphaZero, the
DeepMind programs for playing Go and
Chess. The goal was to improve the performance and energy efficiency of deep
neural net inference by a factor of 10.
As shown in Figure 8, the TPU or-
ganization is radically different from a
DSAs. DSAs may also use VLIW ap-
proaches to ILP rather than specula-
tive out-of-order mechanisms. As men-
tioned earlier, VLIW processors are a
poor match for general-purpose code15
but for limited domains can be much
more efficient, since the control mech-
anisms are simpler. In particular, most
high-end general-purpose processors
are out-of-order superscalars that re-
quire complex control logic for both
instruction initiation and instruction
completion. In contrast, VLIWs per-
form the necessary analysis and sched-
uling at compile-time, which can work
well for an explicitly parallel program.
Second, DSAs can make more effective use of the memory hierarchy. Memory accesses have become much more
costly than arithmetic computations,
as noted by Horowitz.
16 For example,
accessing a block in a 32-kilobyte cache
involves an energy cost approximately
200× higher than a 32-bit integer add.
This enormous differential makes
optimizing memory accesses critical
to achieving high-energy efficiency.
General-purpose processors run code
in which memory accesses typically exhibit spatial and temporal locality but
are otherwise not very predictable at
compile time. CPUs thus use multilevel
caches to increase bandwidth and hide
the latency in relatively slow, off-chip
DRAMs. These multilevel caches often
consume approximately half the energy
of the processor but avoid almost all
accesses to the off-chip DRAMs that require approximately 10× the energy of a
last-level cache access.
Caches have two notable disadvantages:
When datasets are very large. Caches
simply do not work well when datasets
are very large and also have low temporal or spatial locality; and
When caches work well. When
caches work well, the locality is very
high, meaning, by definition, most
of the cache is idle most of the time.
In applications where the memory-
access patterns are well defined and
discoverable at compile time, which
is true of typical DSLs, programmers
and compilers can optimize the use of
the memory better than can dynami-
cally allocated caches. DSAs thus usu-
ally use a hierarchy of memories with
movement controlled explicitly by the
software, similar to how vector pro-
cessors operate. For suitable applica-
tions, user-controlled memories can
use much less energy than caches.
Third, DSAs can use less precision
when it is adequate. General-purpose
CPUs usually support 32- and 64-bit in-
teger and floating-point (FP) data. For
many applications in machine learn-
ing and graphics, this is more accuracy
than is needed. For example, in deep
neural networks (DNNs), inference
regularly uses 4-, 8-, or 16-bit integers,
improving both data and computation-
al throughput. Likewise, for DNN train-
ing applications, FP is useful, but 32
bits is enough and 16 bits often works.
Finally, DSAs benefit from targeting
programs written in domain-specific
languages (DSLs) that expose more
parallelism, improve the structure and
representation of memory access, and
make it easier to map the application efficiently to a domain-specific processor.
Domain-Specific Languages
DSAs require targeting of high-level operations to the architecture, but trying
to extract such structure and information from a general-purpose language
like Python, Java, C, or Fortran is simply too difficult. Domain specific languages (DSLs) enable this process and
make it possible to program DSAs efficiently. For example, DSLs can make
vector, dense matrix, and sparse matrix operations explicit, enabling the
DSL compiler to map the operations
Figure 9. Agile hardware development methodology.
Big Chip
Tape-Out
Tape-Out
Tape-In
ASIC Flow
FPGA
C++