and the K80 run at just 42% and 37%,
respectively, of the highest throughput
achievable for MLP0 if the response
time limit is relaxed. These bounds
affect the TPU as well but at 80% operate much closer to the TPU’s greatest
MLP0 throughput. Compared to CPUs
and GPUs, the single-threaded TPU
has none of the sophisticated microarchitectural features that consume
transistors and energy to improve the
average case, but not the 99th percentile case; that is, there are no caches,
branch prediction, out-of-order execution, multiprocessing, speculative prefetching, address coalescing,
multithreading, context switching,
and so forth. Minimalism is a virtue of
Table 4 reports the bottom line of
relative inference performance per die,
including the host server overhead for
the two accelerators vs. the CPU, show-
ing the weighted mean of the relative
performance for the six DNN applica-
tions, suggesting the K80 die is 1. 9× the
speed of a Haswell die, that the TPU die
is 29. 2× as fast, and thus the TPU die is
15. 3× as fast as the GPU die.
When buying computers by the thousands, cost-performance trumps performance. The best cost metric in a
datacenter is total cost of ownership
(TCO). The actual price an organization (such as Google) might pay for
thousands of chips depends on negotiations among the companies involved.
For business confidentiality reasons,
we are unable publish such price information or data that might let them
be deduced. However, power is correlated with TCO, and we are allowed
to publish Watts per server, so we use
performance/Watt as our proxy for per-formance/TCO here. In this section,
we compare whole servers rather than
Figure 4 reports the mean performance/
Watt for the K80 GPU and TPU relative to
the Haswell CPU. We present two different calculations of performance/Watt.
The first—“total”—includes the power consumed by the host CPU server
when calculating performance/Watt
for the GPU and TPU. The second—
“incremental”—subtracts the host CPU
server power from the GPU and TPU.
For total-performance/Watt, the K80
server is 2. 1× that of Haswell. For incre-mental-performance/Watt, when Haswell server power is omitted, the K80
server is 2. 9× that of Haswell. The TPU
server delivers 34× better total-performance/Watt than Haswell, making the
TPU server 16× the performance/Watt
of the K80 server. The relative incremen-tal-performance/Watt—Google’s justification for a custom ASIC—is 83 for the
TPU, thus lifting the TPU to 29× the performance/Watt of the GPU.
Evaluation of an Alternative
Like an FPU, the TPU coprocessor is
relatively easy to evaluate, so we created
a performance model for our six applications. The differences between the
model results and the hardware performance counters average less than 10%.
We used the performance model
to evaluate a hypothetical TPU die—
TPU’—that could be designed in the
operational intensity means perfor-
mance is limited by memory bandwidth
rather than by peak compute. Five of
the six applications are happily bump-
ing their heads against the ceiling; the
MLPs and LSTMs are memory-bound,
and CNNs are computation-bound.
The six DNN applications are gener-
ally further below their ceilings for Has-
well and K80 than was the TPU in Figure
3. Response time is the reason. Many
of these DNN applications are parts of
end-user-facing services. Researchers
have demonstrated that even small in-
creases in response time cause custom-
ers to use a service less. While training
may not have hard response-time dead-
lines, inference usually does, or infer-
ence prefers latency over throughput. 28
For example, the 99th percentile response time limit for MLP0 was 7ms, as
required by the application developer.
(The inferences per second and 7ms
latency include the server host time, as
well as the accelerator time.) Haswell
Table 4. K80 GPU die and TPU die performance relative to CPU for the DNN workload.
The weighted mean uses the actual mix of the six apps in Table 1.
DNN LSTM CNN Weighted
Mean 0 1 0 1 0 1
GPU 2. 5 0.3 0.4 1. 2 1. 6 2. 7 1. 9
TPU 41.0 18. 5 3. 5 1. 2 40. 3 71.0 29. 2
Ratio 16. 7 60.0 8.0 1.0 25. 4 26. 3 15. 3
Figure 4. Relative performance/watt (TDP) of GPU server (blue bar) and TPU server
(red bar) to CPU server and TPU server to GPU server (orange bar). TPU′ is an improved
TPU using the K80’s GDDR5 memory.
(no host CPU)
(including host CPU)
TPU/CPU TPU/GPU TPU′/CPU TPU′/GPU
The green bar shows the improved TPU’s performance/watt
ratio to the CPU server, and the lavender bar shows its relation
to the GPU server. Total includes host-server power, though
incremental does not include host power.