“This means there is very little performance left on the table,” Catanzaro
says. “There are things that we would
like to do to scale to more GPUs, so
rather than using 8 or 16 GPUs, we
would like to use 128 GPUs, for example.” This translates into a need
for better interconnects, as well as
the ability to move from 32-bit floating point support to the throughput of 16-bit floating point support.
Nvidia’s next generation GPU, code-named Pascal, may address some of
Still another obstacle is better integrating GPUs with CPU/GPUs. Hwu
says those two types of processors are
not often integrated together, and
they usually do not have high bandwidth communication between the
two. This translates into a limited
number of applications and capabilities that run well on these systems.
“You really need to be able to give the
GPU a kind of a very big task and with
some amount of data and then let the
GPU crank on it for a while to make
this offloading process worthwhile,”
Current Nvidia GPUs are located on
separate chips. They are usually connected to the CPU via an I/O bus (PCIe).
This is the reason one needs to send
large tasks to the GPU. Future systems
will integrate GPUs and CPUs in one
tightly coupled package that supports
higher bandwidth, lower latency, and
cache coherent memory sharing across
CPUs and GPUs.
Keutzer expects that over time, as
CPUs and GPUs become better integrated, better cache coherence and
synchronization between the two
types of processors will result. In fact,
Nvidia and Intel are both focusing on
this space. Keutzer notes a new Intel
chip dubbed Knight’s Landing (KNL)
offers unprecedented computing
power in a Xeon Phi 72-core super-computing processor that integrates
both CPU and GPU characteristics.
This chip also offers 500 gigabyte-per-second processor-to-memory bandwidth that will erode GPU’s advantage
in this area, he says.
Hwu notes each of the KNL chip’s
72 cores can execute “a wide vector in-
struction (512 bytes). When translated
into double precision ( 8 bytes) and sin-
gle precision ( 4 bytes), the vector width
is 64 and 128 words; in that sense, it
has a similar execution model to that
The programming model for the
KNL chip is the traditional x86 model,
Hwu says, so programmers “need to
write code to either be vectorizable by
the Intel C Compiler, or use the Intel
AVX vector intrinsic library functions.”
The programming model for GPUs
is based on the kernel programming
model, he adds.
Also, X86 cores have cache coher-
ence for all levels of the cache hierar-
chy, Hwu says, “whereas GPU’s first-
level caches are not coherent. It does
come with a cost of reduced memory
bandwidth.” However, he says, “For
deep learning applications, cache co-
herence for the first level cache is not
very important for most algorithms.”
Over the next decade, a big wild-
card in all of this will be how devel-
opment cycles play out, Hwu says. He
believes Moore’s Law can continue
at its present rate for about three
more generations. At the same time,
he says, it will likely take about three
generations for system designers and
engineers to move away from mostly
discrete CPU and GPU systems to true
“If Moore’s Law stalls out, it could
dramatically impact the future of these
systems, and the way people use hard-
ware and software for deep learning
and other tasks,” Hwu points out. “Yet,
even if we solve the hardware problem,
certain deep learning tasks require
huge amounts of labeled data. At some
point, we will need a breakthrough in
generating labeled data in order to do
the necessary training, particularly in
areas such as self-driving cars.”
Over the next few years, Sutskever
says, machine learning will tap GPUs
extensively. “As machine learning
methods improve, they will extend
beyond today’s uses and ripple into
everything from healthcare and robot-
ics to financial services and user inter-
faces. These improvements depend on
faster GPUs, which greatly empower
machine learning research.”
Adds Catanzaro: “GPUs are a gate-
way to the future of computing. Deep
learning is exciting because it scales
as you add more data. At this point,
we have a pretty much insatiable de-
sire for more data and the computing
resources to solve complex problems.
GPU technology is an important part of
pushing the limits of computing.”
Raina, R., Madhavan, A, and Ng. A. Y.
Large-scale Deep Unsupervised Learning
using Graphics Processors, Proceedings
of the 26th International Conference on
Machine Learning, Montreal, Canada, 2009.
Wu, G., Greathouse, J.L., Lyashevsky, A.,
Jayasena, N., and Chiou, D.
GPGPU Performance and Power Estimation
Using Machine Learning. Electrical and
Computer Engineering, The University of
Texas at Austin, 21st IEEE International
Symposium on High Performance
Coates, A., Huval, B., Wang, T., Wu, D.J.,
Ng, A. Y., and Catanzaro, B.
Deep learning with COTS HPC systems.
Proceedings of the 30th International
Conference on Machine Learning, Atlanta,
Georgia, USA, 2013. JMLR: W&CP volume 28.
Chen, X., Chang, L., Rodrigues, C.I.,
Lv, J., Wang, Z., and Hwu, W.
Adaptive Cache Management for Energy-Efficient GPU Computing, MICRO- 47
Proceedings of the 47th Annual IEEE/
ACM International Symposium on
Microarchitecture, 343-355, IEEE
Computer Society, 2014.
Samuel Greengard is an author and journalist based in
West Linn, OR.
© 2016 ACM 0001-0782/16/09 $15.00
as CPUs and GPUs
between the two
types of processors