have consistently shown performance
and power-efficiency growth over the
past few hardware generations. This
growth is facilitated by discrete GPUs
residing on a standalone, add-on
peripheral device, giving designers
much greater hardware design flexibility than integrated systems. The
hardware designs of discrete GPUs
rely heavily on a fully dedicated,
multibillion-transistor budget, tight
integration with specialized high-throughput memory, and increased
thermal design power. As a result,
discrete GPUs offer the highest compute performance and compute performance per watt, making them the
computational accelerator of choice
in data centers and supercomputers.
In contrast, hybrid GPUs are allocated only a small fraction of the silicon and power resources available to
discrete processors and thus offer an
order-of-magnitude-lower computing capacity and memory bandwidth.
Discrete architectures have been
so successful that manufacturers continue to migrate functions to the GPU
that previously required CPU-side
code; for example, Nvidia GPUs support nested parallelism in hardware,
allowing invocation of new GPU kernels from GPU code without first stopping the running kernel. Similarly,
modern GPUs provide direct access
to peripheral devices (such as storage
and network adapters), eliminating
the CPU from the hardware data path.
Future high-throughput processors5
are expected to enable more efficient
Indications of this trend are already apparent; for example, the
AMD Graphics Core Next 1. 1 used
in all modern AMD GPUs contains a
scalar processing unit. In addition,
Nvidia and IBM announced (
November 2013) a partnership that aims to
integrate Nvidia GPUs and IBM Power
CPUs targeting data-center environments. These trends reinforce the
need for high-level services on GPUs
themselves. Besides making GPUs
easier to program, these services will
naturally exploit emerging hardware
capabilities and avoid performance
and power penalties of switching between the CPU and the GPU to perform I/O calls.
Intel’s Xeon-Phi represents an extreme example of GPUs gaining more
CPU-like capabilities. Xeon-Phi shares
many conceptual similarities with discrete GPUs (such as slow sequential
performance and fast local memory).
However, it uses more traditional CPU
cores and runs a full Linux operating
system, providing a familiar execution environment for the programs it
executes. Xeon-Phi’s software architecture supports standard operating
system services. However, the current
Xeon-Phi system does not allow efficient access to host files and network,
and programmers are encouraged to
follow a more traditional coprocessor programming model, as in GPUs.
The recently announced next processor generation, Knight’s Landing, is
expected to serve as the main system
CPU, eliminating the host-accelerator
separation. The new processor memory subsystem will include high-band-width, size-limited 3D stacked memory. We expect this stacked memory
will have exaggerated NUMA properties, though the ideal system stack
design on such memory remains to
be seen. Meanwhile, many aspects of
GPU system abstractions described
here (such as NUMA-aware file cache
locality optimizations) will be relevant
to the coming and future generations
of these processors.
GPU productivity efforts. Recent developments in GPU software make
it much easier for programmers to
accelerate computations on GPUs
without writing any GPU code. Commercially available comprehensive
STL-like libraries of GPU-accelerated
13 efficient domain-specific
11 and offloading compilers12 parallelize and execute specially annotated loops on GPUs.
These and other GPU productivity
projects use the GPU as a coprocessor and passive consumer of data. Applications that must orchestrate data
movement are cumbersome for programmers to implement because GPU
code cannot perform I/O calls directly.
Systemwide support for operating
system services, as demonstrated by
GPUfs, alleviates this basic constraint
of the programming model and could
benefit many GPU applications, including those developed with the help
of other GPU productivity tools.
1. Bratt, I. HSA Queuing. Hot Chips 2013 tutorial, 2013;
2. Han, S., Jang, K., Park, K., and Moon, S. PacketShader:
A GPU-accelerated software router. SIGCOMM
Computer Communication Review 40, 4 (Aug. 2010),
3. Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A.,
Satyanarayanan, M., Sidebotham, R.N., and West, M.J.
Scale and performance in a distributed file system.
ACM Transactions on Computing Systems 6, 1 (Feb.
4. Kato, S., Mc Throw, M., Maltzahn, C., and Brandt, S.
Gdev: First-class GPU resource management in the
operating system. In Proceedings of the USENIX
Annual Technical Conference (Boston, June 13–15).
USENIX Association, Berkeley, CA, 2012.
5. Keckler, S. W., Dally, W.J., Khailany, B., Garland, M., and
Glasco, D. GPUs and the future of parallel computing.
IEEE Micro 31, 5 (Sept.-Oct. 2011), 7–17.
6. Khronos Group. The OpenCL Specification, 2013;
7. Kim, S., Huh, S., Hu, Y., Zhang, X., Wated, A., Witchel, E.,
and Silberstein, M. GPUnet: Networking abstractions
for GPU programs. In Proceedings of the International
Conference on Operating Systems Design and
Implementation (Broomfield, CO, Oct. 6–8). USENIX
Association, Berkeley, CA, 2014.
8. Kirk, D.B. and Hwu, W.-m. Programming Massively
Parallel Processors: A Hands-On Approach. Morgan
Kaufmann, San Francisco, 2010.
9. Lehavi, D. and Schein, S. Fast RegEx parsing on GPUs.
Presentation at NVIDIA Global Technical Conference
(San Jose, CA, 2012); http://on-demand.gputechconf.
10. Mostak, T. An Overview of MapD (Massively Parallel
Database). Technical Report. Map-D, 2013; http://
11. Myer, T.H. and Sutherland, I.E. On the design of
display processors. Commun. ACM 11, 6 (June 1968),
12. Nvidia. GPU-Accelerated High-Performance Libraries;
13. Nvidia. Nvidia Thrust library; https://developer.nvidia.
14. Nvidia. Popular GPU-Accelerated Applications; http://
15. The Portland Group. PGI Accelerator Compilers
with OpenACC Directives; http://www.pgroup.com/
16. Rossbach, C.J., Currey, J., Silberstein, M., Ray, B., and
Witchel, E. PTask: Operating system abstractions to
manage GPUs as compute devices. In Proceedings
of the 23rd ACM Symposium on Operating Systems
Principles (Cascais, Portugal, Oct. 23–26). ACM Press,
New York, 2011, 233–248.
17. Silberschatz, A., Galvin, P.B., and Gagne, G. Operating
Systems Principles. John Wiley & Sons, Inc., New
18. Silberstein, M. GPUfs home page; https://sites.google.
19. Silberstein, M., Ford, B., Keidar, I., and Witchel, E.
GPUfs: Integrating file systems with GPUs. In
Proceedings of the 18th International Conference on
Architectural Support for Programming Languages
and Operating Systems (Houston, Mar. 16–20). ACM
Press, New York, 2013, 485–498.
20. Silberstein, M., Ford, B., Keidar, I., and Witchel, E.
GPUfs: Integrating file systems with GPUs. ACM
Transactions on Computer Systems 32, 1 (Feb. 2014),
21. Wikipedia. Seqlock; http://en.wikipedia.org/wiki/Seqlock
Mark Silberstein ( firstname.lastname@example.org) is an assistant
professor in the Department of Electrical Engineering
at The Technion – Israel Institute of Technology,
Bryan Ford ( email@example.com) is an associate
professor in the Department of Computer Science at
Yale University, New Haven, CT.
Emmett Witchel ( firstname.lastname@example.org) is an associate
professor in the Department of Computer Science at the
University of Texas at Austin.
© 2014 ACM 0001-0782/14/12 $15.00