Finding 6. For services with strict
tail-latency requirements that exhibit
locality, the benefit from caching is
critical to achieving QoS. For strict
QoS constraints (such as QoS = Ts), at
least C = 20 units are needed to lower
the core’s service time in a way that
achieves QPS under the tail-latency
constraint. 16, 20 Moderately increasing
caching resources beyond C = 20 units
further improves performance, as larger fractions of the working set fit in the
last-level cache; 16 that is, more requests
enjoy the shorter processing time of
caches for the purpose of the queueing
model. However, the benefits diminish
beyond C = 40, and performance degrades rapidly as compute resources become insufficient. 16 Existing server chips
dedicate one-third to one-half of their
area budget to caches. Our analysis indicates this trend will continue.
Finding 7. For relaxed QoS targets,
caching is less critical. Since smaller
cores are sufficient for achieving the
QoS constraints in this case, and although caching is still beneficial, moderate cache provisioning (such as C =
10 units to 30 units) yields most of its
potential performance benefits. Increasing caching units to C = 40 has
no effect on performance, and further
increase degrades performance. Architects should focus instead on exploiting request parallelism in a way
that keeps the large number of smaller
cores busy. 12, 16
these findings significantly. The further away the ratio of long-to-short requests is from the ratio of big-to-small
cores the more homogeneous systems
outperform their heterogeneous counterparts. This result means that for
heterogeneous architectures to make
sense the system must closely track the
input load and adjust to its changes, a
common phenomenon in large-scale
online services. 18
Finding 5. We have again assumed
unlimited request parallelism. Once
serialization between requests is
introduced, the optimal operation
point shifts. Figure 4d shows QPS under various tail-latency QoS targets
for increasing values of f ∈ {50%, 90%,
99%, 100%}. Where previously homogeneity outperformed heterogeneous
designs for extreme QoS requirements—very strict and very relaxed—
now takes the lead heterogeneity.
For example, for a moderate QoS
target of 10 Ts and f = 0.9 a single big
core achieves optimal performance,
compared to the 50: 50 mix in Figure
4c. In general, the more parallelism is
limited the more the optimal operation
point shifts left, with more big and fewer
smaller cores. This is in agreement with
Hill’s and Marty’s observations, 11 with
the added implication that latency
considerations cause a more rapid
shift toward larger cores than when
throughput is the only performance
metric of interest. For example, when
f = 0.9 and the system optimizes only
for throughput, two 50BCE cores
achieve the best performance under
Hill’s and Marty’s model. As before,
this result highlights the importance
of quantifying the degree of parallelism in interactive applications. It also
establishes that, even with limited
parallelism, scheduling that takes
into account the different capabilities
of available hardware is essential for
harnessing the potential of hardware
heterogeneity.
Caching
Architects constantly deal with the
trade-off of using the limited re-
sources for compute or caching.
Larger caches help avoid the long
latencies of main memory but draw
significant static power and reduce
the amount of resources available for
compute cores; see Figure 6 for two
characteristic configurations. Using
the same total budget as before—R
= 100—we explore how QPS under a
tail-latency constraint changes as a
fraction C ∈ [0, 90] of resources goes
toward building caches, as opposed
to cores. We use 10BCE cores, ben-
efitting applications with moder-
ately strict QoS targets; Figure 4e
shows this trade-off. On the leftmost
point of the x-axis all resources are
dedicated to building cores. On the
rightmost point, 90% of resources go
toward building caches and the re-
maining 10% toward building cores,
one 10BCE core in this case. Increas-
ing caching by 10BCE results in one
fewer core in the system. We assume
caches improve service time under a
sqrt(C) function, meaning Ts0 = Ts =
sqrt(C). 23 We validate the selection of
the scaling factor against a real instal-
lation of memcached where the allo-
cated last-level cache partition is ad-
justed using Intel’s Cache Allocation
Technology. As the number of used
cores increases, the allocated cache
capacity decreases. Figure 7 outlines
that the difference between the ana-
lytical model and the real system is,
in general, marginal. The findings
reported in Figure 4e remain consis-
tent for scaling functions until the
seventh root of C, which corresponds
to progressively lower benefits from
caching, causing the optimal point to
shift increasingly to the left.
Figure 7. Validation of the queueing model against a real instantiation of an in-memory
key-value store (memcached) with increasing caching and reduced compute resources.
0.0 0.2 0.4 0.6 0.8 1.0
Load
0
500
1000
1500
2000
L
ate
ncy
(u
s
ec)
MM4+FullCache
memcached k= 4+FullCache
MM8+HalfCache
memcached k= 8+HalfCache
MM16+NoCache
memcached k= 16+NoCache