for detailed comments that improved
this article. We also thank the teams at
Google that build, manage, and maintain the systems that contributed to the
insights we have explored here.
1. Alverson, R. et al. The Tera computer system. In
Proceedings of the Fourth International Conference on
Supercomputing (Amsterdam, The Netherlands, June
11–15). ACM Press, New York, 1990, 1–6.
2. Boost C++ Libraries. Boost asio library; http://www.
3. Caulfield, A. et al. Moneta: A high-performance storage
array architecture for next-generation, non-volatile
memories. In Proceedings of the 2010 IEEE/ACM
International Symposium on Microarchitecture (Atlanta,
GA, Dec. 4–8). IEEE Computer Society Press, 2010.
4. Dean, J. and Barroso, L.A. The tail at scale. Commun.
ACM 56, 2 (Feb. 2013), 74–80.
5. Erlang. Erlang User’s Guide Version 8.0. Processes;
6. Fikes, F. Storage architecture and challenges. In
Proceedings of the 2010 Google Faculty Summit
(Mountain View, CA, July 29, 2010); http://www.
7. Golang.org. Effective Go. Goroutines; https://golang.
8. Hennessy, J. and Patterson, D. Computer Architecture:
A Quantitative Approach, Sixth Edition. Elsevier,
Cambridge, MA, 2017.
9. Intel Newsroom. Intel and Micron produce
breakthrough memory technology, July 28,
10. Kanev, S. et al. Profiling a warehouse-scale computer.
In Proceedings of the 42nd International Symposium
on Computer Architecture (Portland, OR, June 13–17).
ACM Press, New York, 2015.
11. Microsoft. Asynchronous Programming with Async and
Await (C# and Visual Basic); https://msdn.microsoft.
12. Nanavati, M. et al. Non-volatile storage: Implications
of the datacenter’s shifting center. Commun. ACM 50, 1
(Jan. 2016), 58–63.
13. Nelson, J. et al. Latency-tolerant software distributed
shared memory. In Proceedings of the USENIX
Annual Technical Conference (Santa Clara, CA, July
8–10). Usenix Association, Berkeley, CA, 2015.
14. Ousterhout, J. et al. The RAMCloud storage system.
ACM Transactions on Computer Systems 33, 3 (Sept.
2015), 7:1–7: 55.
15. Smith, B. A pipelined shared-resource MIMD
computer. Chapter in Advanced Computer
Architecture. D.P. Agrawal, Ed. IEEE Computer
Society Press, Los Alamitos, CA, 1986, 39–41.
16. Wikipedia.org. Google n-gram viewer; https://
Luiz André Barroso ( firstname.lastname@example.org) is a Google
Fellow and Vice President of Engineering at Google Inc.,
Mountain View, CA.
Michael R. Marty ( email@example.com) is a senior
staff software engineer and manager at Google Inc.,
David Patterson ( firstname.lastname@example.org) is an
emeritus professor at the University of California,
Berkeley, and a distinguished engineer at Google Inc.,
Mountain View, CA.
Parthasarathy Ranganathan (partha.ranganathan@
google.com) is a principal engineer at Google Inc.,
Mountain View, CA.
Copyright held by the authors.
(vs. performance-per-total-cost-of-ownership in large-scale Web deployments). Consequently, they can keep
processors highly underutilized when,
say, blocking for MPI-style rendezvous
messages. In contrast, a key emphasis
in warehouse-scale computing systems
is the need to optimize for low latencies
while achieving greater utilizations.
As discussed, traditional processor
optimizations to hide latency run out
of instruction-level pipeline parallelism to tolerate microsecond latencies.
System designers need new hardware
optimizations to extend the use of synchronous blocking mechanisms and
thread-level parallelism to the microsecond range.
Context switching can help, albeit at the cost of increased power
and latency. Prior approaches for
fast context switching (such as Denelcor
HEP15 and Tera MTA computers1) traded
off single-threaded performance, giving
up on latency advantages from locality and private high-level caches and,
consequently, have limited appeal in
a broader warehouse-scale computing
environment where programmers want
to tolerate microsecond events with low
overhead and ease of programmability.
Some languages and runtimes (such as
Go and Erlang) feature lightweight
threads5, 7 to reduce memory and context-switch overheads associated with operating system threads. But these systems
fall back to heavier-weight mechanisms
when dealing with I/O. For example, the
Grappa platform13 builds an efficient
task scheduler and communication
layer for small messages but trades off a
more restricted programming environment and less-efficient performance
and also optimizes for throughput. New
hardware ideas are needed to enable context switching across a large number of
threads (tens to hundreds per processor,
though finding the sweet spot is an open
question) at extremely fast latencies
(tens of nanoseconds).
Hardware innovation is also needed
to help orchestrate communication with
pending I/O, efficient queue management and task scheduling/dispatch, and
better processor state (such as cache)
management across several contexts.
Ideally, future schedulers will have rich
support for I/O (such as being able to
park a thread based on the readiness of
multiple I/O operations). For instance,
Finally, techniques to enable micro-
second-scale devices should not neces-
sarily seek to keep processor pipelines
busy. One promising solution might
instead be to enable a processor to stop
consuming power while a microsecond-
scale access is outstanding and shift
that power to other cores not blocked
System designers can no longer ignore efficient support for microsecond-scale I/O, as the most useful new
warehouse-scale computing technologies start running at that time scale.
Today’s hardware and system software
make an inadequate platform, particularly given support for synchronous programming models is deemed
critical for software productivity. Novel
microsecond-optimized system stacks
are needed, reexamining questions
around appropriate layering and abstraction, control and data plane separation, and hardware/software boundaries. Such optimized designs at the
microsecond scale, and corresponding
faster I/O, can in turn enable a virtuous
cycle of new applications and programming models that leverage low-latency
communication, dramatically increasing the effective computing capabilities of warehouse-scale computers.
We would like to thank Al Borchers,
Robert Cypher, Lawrence Greenfield,
Mark Hill, Urs Hölzle, Christos Ko-zyrakis, Noah Levine, Milo Martin, Jeff
Mogul, John Ousterhout, Amin Vah-dat, Sean Quinlan, and Tom Wenisch
a The x86monitor/mwait instructions allow privileged software to wait on a single memory word.
Watch the authors discuss
their work in this exclusive