parallelism. At lower QoS targets, a larg-
er set of medium-size cores achieves
the best performance. For example,
7BCE cores are optimal for QoS = 10 Ts.
For applications with moderate la-
tency requirements (such as Web
search and Web servers), architects
should seek to balance improve-
ments in single-thread performance
(instruction-level parallelism) and
multi-core performance (request-
level parallelism). Increasing sin-
gle-thread performance at high cost
yields diminishing returns in this
case. Nevertheless, a large pool of
wimpy cores—1BCE—is optimal only
when applications have no latency
constraints, as with long data min-
ing queries or log-processing requests.
With QoS = 100 Ts, applications are es-
sentially throughput-limited and per-
form best with many wimpy cores.
These findings highlight a disparity between optimal system design
when optimizing for throughput versus when optimizing for tail latency.
For example, in a homogeneous system where throughput is the only performance metric of interest and parallelism is plentiful, the smallest cores
achieve the best performance; see the
1BCE cores in Figure 4a. In comparison, when optimizing for throughput
under a tail latency constraint, the optimal design point shifts toward larger cores, unless the latency constraint
relaxes significantly.
Finding 3. Limited parallelism also
calls for more powerful cores. So far
we have assumed all user requests are
independent and perfectly paralleliz-
able, though it is rarely the case in
practice. Requests are often depen-
dent on each other and on system
issues like connection ordering and
locks for writes causing serialization.
The growing trend of breaking com-
plex services down to smaller compo-
nents (microservices) will only make
the problem of request dependen-
cies more common. This brings up
the caveat of Amdahl’s Law. To what
extent are the previous findings ac-
curate when parallelism is limited?
Figure 4b shows the case of a reason-
able QoS (10Ts) with f ∈ {50%, 90%,
99%, 100%}. When, for example, the
parallel fraction of the computation
f is 90%, 10% of requests are serial-
ized. As a result, while optimal per-
formance was previously achieved
with seven BCE cores, the optimal
core size now shifts to 25 BCEs.
Limited parallelism also affects
throughput-centric systems, 11 with
more powerful cores outperforming
wimpy cores in applications with se-
rial regions. Using Hill’s and Marty’s
model11 with a 100BCE budget and
10% serialization, an architect would
determine that 10BCE cores are opti-
mal for throughput, a less aggressive
increase in core size than when op-
timizing for latency. As parallelism
decreases further, more performant
cores are needed to drive down tail
latency. When 50% of execution is se-
rial, a single 100BCE core is optimal,
single-thread performance even at
high cost. At the same time, some core
parallelism is needed. A single 100BCE
core performs significantly worse than
four 25BCE cores. This finding is in
agreement with industry concerns
about the performance of small cores
with warehouse-scale services. 12 The
need for high single-thread perfor-
mance also motivates application- or
domain-specific accelerators as a more
economical way of improving perfor-
mance than incremental out-of-order
core optimizations.
Finding 2. At lower latency constraints, architects should look for ways
to balance optimizations for single-thread performance and request-level
Figure 3. Homogeneous server configurations for a budget of R = 100 resource units:
(a) 100 1BCE cores; (b) 25 4BCE cores; and (c) one 100BCE core.
Service time: Ts = 1/µ
Small Core
(a) (b)
(c)
Arrival rate: λ Arrival rate: λ
Arrival rate: λ
Service time: Ts = 1/(µ√ 4)
Medium Core
Service time: Ts = 1/(µ√100)
Large Core