and core hops) all add overheads, again
in the microsecond range. We have also
measured standard Google debugging
features degrading latency by up to
tens of microseconds. Finally, queueing overheads—in the host, application, and network fabric—can all incur
additional latencies, often on the order
of tens to hundreds of microseconds.
Some of these sources of overhead have
a more severe effect on tail latency than
on median latency, which can be especially problematic in distributed computations.
Similar observations can be made
about overheads for new non-volatile storage. For example, the Moneta project3 at
the University of California, San Diego,
discusses how the latency of access for
a non-volatile memory with baseline
raw access latency of a few microseconds can increase by almost a factor of
five due to different overheads across
the kernel, interrupt handling, and data
System designers need to rethink
itself. As we will see, it is quite easy to
take fast hardware and throw away its
performance with software designed
for millisecond-scale devices.
How to Waste a Fast
To better understand how optimizations can target the microsecond regime, consider a high-performance
network. Figure 1 is an illustrative example of how a 2µs fabric can, through
a cumulative set of software overheads,
turn into a nearly 100µs datacenter
fabric. Each measurement reflects the
median round-trip latency (from the
application), with no queueing delays
or unloaded latencies.
A very basic remote direct memory
access (RDMA) operation in a fast data-
center network takes approximately 2µs.
An RDMA operation offloads the mecha-
nisms of operation handling and trans-
port reliability to a specialized hardware
device. Making it a “two-sided” primitive
(involving remote software rather than
just remote hardware) adds several more
microseconds. Dispatching overhead
from a network thread to an operation
thread (on a different processor) further
increases latency due to processor-wake-
up and kernel-scheduler activity. Using
interrupt-based notification rather than
spin polling adds many more microsec-
onds. Adding a feature-filled RPC stack
incurs significant software overhead in
excess of tens of microseconds. Finally,
using a full-fledged TCP/IP stack rather
than the RDMA-based transport adds to
the final overhead that exceeds 75µs in
this particular experiment.
In addition, there are other more
unpredictable, and more non-intuitive,
sources of overhead. For example, when
an RPC reaches a server where the core
is in a sleep state, additional latencies—
often tens to hundreds of microseconds—might be incurred to come out
of that sleep state (and potentially warm
up processor caches). Likewise, various
mechanisms (such as interprocessor interrupts, data copies, context switches,
A simple comparison, as in the figure here, of the occurrence of the terms “millisecond,” “microsecond,” and
“nanosecond” in Google’s n-gram viewer (a tool that charts frequencies of words in a large corpus of books printed from
1800 to 2012, https://books.google.com/ngrams) points to the lack of adequate attention to the microsecond-level time
scale. Microprocessors moved out of the microsecond scale toward nanoseconds, while networking and storage latencies
have remained in the milliseconds. With the rise of a new breed of I/O devices in the datacenter, it is time for system
designers to refocus on how to achieve high performance and ease of programming at the microsecond-scale.
Is the Microsecond
Getting Enough Respect?
n-gram viewer of ms, µs, and ns.
1920 1930 1940 1950 1960 1970 1980 1990 2000