mance for the past 40 years. At the
current rate, performance on standard processor benchmarks will not
double before 2038.
Since transistors are not getting
much better (reflecting the end of
Moore’s Law), the peak power per mm2
of chip area is increasing (due to the
end of Dennard scaling), but the power
budget per chip is not increasing (due
to electro-migration and mechanical
and thermal limits), and chip design-
ers have already played the multi-core
card (which is limited by Amdahl’s
Law), architects now widely believe the
only path left for major improvements
in performance-cost-energy is domain-
specific architectures. 17 They do only a
few tasks but do them extremely well.
The synergy between the large datasets in the cloud and the numerous
computers that power it has enabled
remarkable advancements in machine
learning, especially in DNNs. Unlike
some domains, DNNs are broadly applicable. DNN breakthroughs include
reducing word error rates in speech
recognition by 30% over traditional approaches, the biggest gain in 20 years; 11
cutting the error rate in an image-recog-nition competition ongoing since 2011
from 26% to 3.5%; 16, 22, 34 beating a human
champion at Go; 32 improved search
ranking; and many more. A DNN architecture can benefit from a narrow focus
yet still have many applications.
Neural networks target brain-like
functionality and are based on a simple
artificial neuron—a nonlinear function
(such as max(0,value)) of a weighted
sum of the inputs. These artificial neurons are collected into layers, with the
outputs of one layer becoming the inputs of the next layer in the sequence.
The “deep” part of DNN comes from
going beyond a few layers, as the large
datasets in the cloud allow more accurate models to be built by using extra
and larger layers to capture higher-level
patterns or concepts, and GPUs provide
enough computing to develop them.
The two phases of a DNN are called
training (or learning) and inference (or
prediction) and refer to development vs.
production. Training a DNN takes days,
but a trained DNN can infer or predict in
milliseconds. The developer chooses the
number of layers and the type of DNN
in 2004 forced the industry to switch
from a single energy-hogging processor
per microprocessor to multiple efficient
processors or cores per chip.
A law that is just as true today as
when Gene Amdahl introduced it in
1967 demonstrates the diminishing
returns from increasing the number
of processors. Amdahl’s Law says the
theoretical speedup from parallelism
is limited by the sequential part of the
task; if ¹/8 of the task is serial, the maximum speedup is 8× the original performance, even if the rest is easily parallel
and the architect adds 100 processors.
Figure 1 indicates the effect of
these three laws on processor perfor-
Figure 1. Following Hennessy and Patterson, 17 we plotted highest SPECCPUint performance
per year for 32-bit and 64-bit processor cores over the past 40 years; the throughput-oriented SPECCPUint_rate reflects a similar profile, with plateauing delayed a few years.
CISC 2X/2.5 years
RISC 2X/1.5 years
End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year)
Amdahl’s Law ⇒ 2X/6 years (12%/year)
End of the Line ⇒ 2X/20 years (3%/yr)
1985 1990 1995 2000 2005 2010 2015
Table 1. Six DNN applications (two per DNN type) representing 95% of the TPU’s workload, as of July 2016.
TPU Ops /
Weight Byte TPU Batch Size
TPUs in 2016 FC Conv Vector Pool Total
MLP0100 5 5 ReLU 20M 200 200
MLP1 1,000 4 4 ReLU 5M 168 168
LSTM01,000 24 34 58 sigmoid,tanh 52M 64 64
LSTM11,500 37 19 56 sigmoid,tanh 34M 96 96
CNN01,000 16 16 ReLU 8M 2,888 8
CNN11,000 4 72 1389 ReLU 100M 1,750 32
The columns are the DNN name; the number of lines of code; the types and number of layers in
the DNN; FC is fully connected; Conv is convolution; Vector is binary element-wise operations; Pool
is pooling, which does nonlinear downsizing on the TPU; nonlinear function; number of weights;
operational intensity; batch size; and TPU application popularity, as of July 2016. One MultiLayer
Perceptron (MLP) is RankBrain; 9 one long short-term memory (LSTM) is a subset of GNM Translate; 37 one convolutional neural net (CNN) is Inception, and the other CNN is DeepMind AlphaGo. 19, 32
ReLU stands for Rectified Linear Unit and is the function max(0,value).