Figure 1. Building system insights from queueing theory: (a) 99th percentile response time in
an M/M/1 model; and (b) 99th percentile queueing time in an M/M/4 model as a function of µ.
0. 0 0. 2 0. 4 0. 6 0. 8 1. 0
Service Rate mµ
10% load 50% load 99% load
0.0 0.2 0.4 0.6 0.8 1.0
Service Rate mµ
Amdahl’s Law describes the speedup of a program when
a fraction f of the computation is accelerated by a factor S.
Speedup is then defined as
In a multi-core machine, Amdahl’s Law captures the
benefit from multiple cores in average performance. While
this interpretation is still relevant, it is, by itself, insufficient
for describing tail latency requirements. To bridge the gap
we build upon ideas from queueing theory, which provides
a framework to reason about task-arrival rates, service
times, and end-to-end response times. Simple models (such
as M/M/1 and M/M/k) are particularly attractive for first-
order performance calculations because they can concisely
describe performance in closed-form expressions.
M/M/1 model. We start with one of the simplest queueing
models: the M/M/1 queue, modeling a system in which a
single server processes incoming tasks. Tasks arrive under a
Poisson process with rate λ. The service times also follow an
exponential distribution, with rate parameter µ and mean
service time Ts = 1/µ (µ= per f (r) in the main text of the article.
A larger µ means a more powerful server and results in lower
latency. Tasks are processed in a simple first-in-first-out
order. This simple queueing system is stable when µ > λ. In
contrast, when µ > λ, queued tasks keep increasing, leading to
instability. The load of the system is defined as ρ = λ/µ. Given
these definitions, the mean number of tasks in the system is
where N is a random variable for the number of tasks.
Likewise, the mean of task response time (using random
variable R) is
and the ρ-th percentile of response time is
Figure 1a outlines the 99th percentile of request latency as a
function of the service rate µ. As µ increases, tail latency drops
both at low and high load.
M/M/k model. We now extend the M/M/1 model to a more
realistic system with k equivalent servers in order to model a
multicore machine. Tasks are now added to a single, shared
queue, where servers draw them from for processing. As with
the M/M/1 model, tasks arrive under a Poisson process with
arrival rate λ and each server processes tasks with service rate
µ. Closed-form solutions for the mean response time and
response-time percentiles exist but are more complicated
than in the M/M/1 model. Specifically, system load is ρ = λ/
(kµ). The probability that a new task must be enqueued is
given by Erlang’s C formula
and the mean number of tasks in the system
The average response time is
Finally, the p-th percentile of queueing time is
Figure 1b outlines how the 99th percentile of queueing
time correlates to the service rate µ for one and four servers.
Higher service rates correspond to less time spent by requests
in the queue. We use the M/M/k model for analysis of system
trade-offs unless otherwise specified. In the article’s section
on validation, we verify that this model closely reflects real
system behavior. For applications with non-Poisson arrival-and service-time distributions, more general queueing
models may be needed (such as the G/G/k model). 10, 24 For
more complex applications (such as multi-tier services),
system architects would need a more sophisticated analytical
model (such as a queueing network).