An interesting research direction
concerns whether some of the performance gap can be closed with new compiler technology, possibly assisted by
architectural enhancements. Although
the challenges in efficiently translating
and implementing high-level scripting
languages like Python are difficult, the
potential gain is enormous. Achieving
even 25% of the potential gain could
result in Python programs running
tens to hundreds of times faster. This
simple example illustrates how great
the gap is between modern languages
emphasizing programmer productivity
and traditional approaches emphasizing performance.
Domain-specific architectures. A
more hardware-centric approach is to
design architectures tailored to a spe-
cific problem domain and offer signif-
icant performance (and efficiency)
gains for that domain, hence, the
name “domain-specific architectures”
(DSAs), a class of processors tailored
for a specific domain—programmable
and often Turing-complete but tai-
lored to a specific class of applica-
tions. In this sense, they differ from
application-specific integrated cir-
cuits (ASICs) that are often used for a
single function with code that rarely
changes. DSAs are often called acceler-
ators, since they accelerate some of an
application when compared to execut-
ing the entire application on a general-
purpose CPU. Moreover, DSAs can
achieve better performance because
they are more closely tailored to the
needs of the application; examples of
DSAs include graphics processing
units (GPUs), neural network proces-
sors used for deep learning, and pro-
cessors for software-defined networks
(SDNs). DSAs can achieve higher per-
formance and greater energy efficiency
for four main reasons:
First and most important, DSAs
exploit a more efficient form of par-
allelism for the specific domain. For
example, single-instruction multiple
data parallelism (SIMD), is more ef-
ficient than multiple instruction mul-
tiple data (MIMD) because it needs to
fetch only one instruction stream and
processing units operate in lockstep.
9
Although SIMD is less flexible than
MIMD, it is a good match for many
level languages with dynamic typing and
storage management. Unfortunately,
such languages are typically interpreted
and execute very inefficiently. Leiserson
et al.
24 used a small example—perform-
ing matrix multiply—to illustrate this
inefficiency. As in Figure 7, simply re-
writing the code in C from Python—a
typical high-level, dynamically typed lan-
guage—increases performance 47-fold.
Using parallel loops running on many
cores yields a factor of approximately
7. Optimizing the memory layout to ex-
ploit caches yields a factor of 20, and a
final factor of 9 comes from using the
hardware extensions for doing single in-
struction multiple data (SIMD) parallel-
ism operations that are able to perform
16 32-bit operations per instruction.
All told, the final, highly optimized ver-
sion runs more than 62,000× faster on
a multicore Intel processor compared
to the original Python version. This is of
course a small example, one might ex-
pect programmers to use an optimized
library for. Although it exaggerates the
usual performance gap, there are likely
many programs for which factors of 100
to 1,000 could be achieved.
Figure 8. Functional organization of Google Tensor Processing Unit (TPU v1).
P
CIe
Host
Int
erf
ace
14 GiB/s 30 GiB/s
30 GiB/s
14 GiB/s
Off-Chip I/O
Data Buffer
Control
Not to Scale
Computation
14 GiB/s
10 GiB/s
Control
Control Control
DDR3
Interfaces
Weight FIFO
(Weight Fetcher)
Unified Buffer
(Local
Activation
Storage)
Systolic
Array
Control
Matrix
Multiply Unit
(64K per cycle)
Accumulators D
R
A
M
port
ddr3
3%
CActivation
Normalize/Pool
Control Control
Inst
r
165
GiB/s
165 GiB/s
Lo
(96K