of requests (requests per second), a
simple queuing theory model tells
us that as service time increases
throughput decreases. In the microsecond regime, when service time is
composed mostly of “overhead” rather than useful computation, throughput declines precipitously.
Illustrating the effect, Figure 2
shows efficiency (fraction of achieved
vs. ideal throughput) on the y-axis,
and service time on the x-axis. The dif-
ferent curves show the effect of chang-
ing “microsecond overhead” values;
that is, amounts of time spent on each
overhead event, with the line marked
“No overhead” representing a hypo-
thetical ideal system with zero over-
head. The model assumes a simple
closed queuing model with determin-
istic arrivals and service times, where
service times represent the time be-
tween overhead events.
As expected, for short service times,
overhead of just a single microsecond
leads to dramatic reduction in overall throughput efficiency. How likely
are such small service times in the real
world? Table 2 lists the service times
for a production web-search workload
measuring the number of instructions
between I/O events when the workload
is tuned appropriately. As we move to
systems that use fast flash or new non-volatile memories, service times in the
range of 0.5µs to 10µs are to be expected.
Microsecond overheads can significantly
degrade performance in this regime.
At longer service times, sub-micro-second overheads are tolerable and
throughput is close to the ideal. Higher
overheads in the tens of microseconds,
possibly from the software overheads
detailed earlier, can lead to degraded
performance, so system designers still
need to optimize for killer microseconds.
Other overheads go beyond the basic
mechanics of accessing microsecond-scale devices. A 2015 paper summarizing a multiyear longitudinal study at
Google10 showed that 20%–25% of fleet-wide processor cycles are spent on low-level overheads we call the “datacenter
tax.” Examples include serialization and
deserialization of data, memory allocation and de-allocation, network stack
costs, compression, and encryption.
The datacenter tax adds to the killer microsecond challenge. A logical question
is whether system designers can address
reduced processor efficiency by offload-ing some of the overheads to a separate
core or accelerator. Unfortunately, at
single-digit microsecond I/O latencies,
I/O operations tend to be closely coupled
with the main work on the processor.
It is this frequent and closely cou-
pled nature of these processor over-
heads that is even more significant, as
in “death by 1,000 cuts.” For example,
if microsecond-scale operations are
made infrequently, then conservation
of processor performance may not be
a concern. Application threads could
just busy-poll to wait for the microsec-
ond operation to complete. Alterna-
tively, if these operations are not cou-
pled closely to the main computation
on the processor, traditional offload
the hardware and software stack in the
context of the microsecond challenge.
System design decisions like operat-
ing system-managed threading and
interrupt-based notification that were
in the noise with millisecond-scale de-
signs now have to be redesigned more
carefully, and system optimizations
(such as storage I/O schedulers tar-
geted explicitly at millisecond scales)
have to be rethought for the microsec-
How to Waste a Fast
The other significant negative effect of microsecond-scale events is
on processor efficiency. If we measure processor resource efficiency
in terms of throughput for a stream
Table 3. High-performance computing and warehouse-scale computing systems
compared. Though high-performance computing systems are often optimized for
low-latency networking, their designs and techniques are not directly applicable to
High-Performance Computing Warehouse-Scale Computing
Workloads Supercomputing workloads that
often model the physical world;
simpler, static data structures.
Large-scale online data-intensive
workloads; operate on big data and
complex dynamic data structures;
response latency critical.
Code touched by a fewer programmers;
Hardware concurrency visible to
programmers at compile time.
Codebase touched by thousands of
developers; significant software releases
100 times per year.
Automatic scale-out of queries per second.
Focus on highest performance; recent
emphasis on performance per Watt.
Stranding of resources (such as
underutilized processors) acceptable.
Focus on highest performance per dollar.
Significant effort to avoid stranding of resources
(such as processor, memory, and power).
Reliability in hardware; often no-long-lived mutable data; no encryption.
Commodity hardware; reliability across
Figure 2. Efficiency degrades significantly at low service times due to microsecond-scale
Service time (us)
Nooverhead 1µs 16µs