Figure 4a shows how throughput in
queries per second (QPS) changes for
different latency QoS targets, under the
M/M/N queueing model described in
the sidebar. Throughput of 100QPS for
QoS=10Ts means the system achieved
100QPS for which the 99th latency percentile is 10 Ts. The x-axis captures the
size of selected cores, moving from
many small cores on the left side to a
single core of 100BCEs on the right
side. We examine all core sizes from
1BCE up to 100BCEs in increments of
a single resource unit. In configurations with multiple cores, throughput
is aggregated across all cores. The discontinuities in the graph are an artifact
of the limited resource budget and homogeneous design; for example, for
U = 51, an architect can build a single
51BCE core, while 49 resource units remain unused. Throughput for 10 Ts for
cores greater than 7BCE overlaps with
100Ts, as does throughput for 5Ts for
cores of more than 12BCEs.
Finding 1. Very strict QoS targets put
a lot of pressure on single-thread per-
formance. When QoS = Ts or 5 Ts, cores
smaller than 22BCEs or 12BCEs, re-
spectively, achieve zero QPS for which
the tail latency satisfies the QoS target.
This happens because the cores are too
weak to handle variability in service
time even in the absence of queue-
ing, and the queueing naturally occurs
when cores operate close to saturation.
This result means that, for services
with extremely low-latency require-
ments (such as in-memory caching
and in-memory distributed storage), 21
architects must focus on improving
(see Figure 1 in the sidebar “Analytical
Framework”) and can provide first-or-
der insights on how design decisions
interact with tail latency. As was the
case with the previous analyses based
on Amdahl’s Law, our model has sig-
nificant implications for processor de-
signs for cloud servers.
While analytical models help draw
first-order insights, they run the risk
of not accurately reflecting the complex operation of a real system. In Figure 2, we show a brief validation study
of the queueing model, as discussed
in the sidebar, with { 1, 4, 8, 16} compute cores against a real instantiation
of memcached, a popular in-memory,
key-value store, with the same number
of cores. We set the mean interarrival
rate and service time of the queueing
model based on the measured times
with memcached. In both cases, when
providing memcached with exponentially distributed input load, the memcached request latency is close to the
one estimated by the queueing model
across load levels.
Cost Model
Since hardware resources are not infinite, this analysis requires a cost model
that relates resource usage to performance. We use a model similar to the
one used by Hill and Marty11 to extend
Amdahl’s Law to multicore chips. That
is, we assume a given multicore chip is
limited to R base core equivalents (BCE)
units. This limitation represents area
or power-consumption constraints in
the chip design. The BCE is an abstract
cost unit that captures processor resources and caches but does not share
resources (such as interconnection
networks and memory controllers). As
in Hill and Marty, 11 we assume these
resources are fairly constant in the system variations we examine. A baseline
core that consumes 1BCE unit achieves
performance of perf( 1)= 1. Chip architects can build more powerful cores
by dedicating r ∈ [ 1,R] resource units
to each core to achieve performance
per f (r), where per f (r) is the rate parameter µ in our performance model.
Larger cores have higher service rate
µ, which is inversely related to tail latency, as discussed in the sidebar. If
performance increases superlinearly
with resources, then more cores are
always better. In practice per f (r) < r,
creating trade-offs between opting for
few brawny or many wimpy cores. By
default, we follow Shekhar Borkar3 and
use per f (r) = sqrt(r) but have also investigated how higher roots affect the corresponding insights.
Brawny Versus Wimpy Cores
We first examine a system where all
cores are homogeneous and have
identical cost. An important question
the designer must answer is: Given a
constrained aggregate power or area
budget, should architects build a few
large cores or many small cores? The
answer has been heavily debated in
recent years in both academia and
industry, 4, 12, 14, 17, 19, 22 as it relates to the
introduction of new designs (such as
the ARM server chips and throughput
processors like Xeon Phi).
Assuming the total budget is R =
100BCEs, an architect can build 100
basic cores of 1BCE each, 25 cores of
4BCEs each, one large core of 100BCEs,
or in general R/U cores of U units each,
as shown in Figure 3. We consider
an online service workload with tail
latency quality-of-service (QoS) constraints. QoS is defined as a function of
the mean service time Ts of the 100BCE
machine. For example, a very strict QoS
target would require the 99th percentile
of request latency to be Ts. This means
the time between arrival and completion of 99% of requests must be less or
equal to the machine’s mean service
time, allowing no tolerance for queueing or service-time variability. More relaxed QoS targets are defined as multiples of Ts: QoS = α Ts, α ∈ [ 5, 10, 50, 100].
Figure 2. Validation of the queueing model against a real instantiation of an in-memory
key-value store (memcached) for { 1, 4, 8, 16} cores.
0.0 0.2 0.4 0.6 0.8 1.0
Load
0
500
1000
1500
2000
2500
3000
3500
4000
L
ate
ncy(
u
sec
)
MM1
memcached k= 1
MM4
memcached k= 4
MM8
memcached k= 8
MM16
memcached k= 16