DOI: 10.1145/3299885
Distributed Strategies for
Computational Sprints
By Songchun Fan,† Seyed Majid Zahedi,† and Benjamin C. Lee
Abstract
Computational sprinting is a class of mechanisms that boost
performance but dissipate additional power. We describe a
sprinting architecture in which many, independent chip
multiprocessors share a power supply and sprints are constrained by the chips’ thermal limits and the rack’s power
limits. Moreover, we present the computational sprinting
game, a multi-agent perspective on managing sprints.
Strategic agents decide whether to sprint based on application phases and system conditions. The game produces an
equilibrium that improves task throughput for data analytics
workloads by 4–6× over prior greedy heuristics and performs
within 90% of an upper bound on throughput from a globally
optimized policy.
1. INTRODUCTION
Modern datacenters oversubscribe their power supplies to
enhance performance and efficiency. A conservative data-center that deploys servers according to their expected
power draw will under-utilize provisioned power, operate
power supplies at sub-optimal loads, and forgo opportunities for higher performance. In contrast, efficient datacenters deploy more servers than it can power fully and rely on
varying computational load across servers to modulate
demand for power.
4 Such a strategy requires responsive
mechanisms for delivering power to the computation that
needs it most.
Computational sprinting is a class of mechanisms that
supply additional power for short durations to enhance performance. In chip multiprocessors, for example, sprints
activate additional cores and boost their voltage and frequency.
Although originally proposed for mobile systems,
13, 14 sprinting has found numerous applications in datacenter systems.
It can accelerate computation for complex tasks or accommodate transient activity spikes.
16, 21
The system architecture determines sprint duration
and frequency. Sprinting multiprocessors generate extra
heat, absorbed by thermal packages and phase change
materials (PCMs),
14, 16 and require time to release this heat
between sprints. At scale, uncoordinated multiprocessors
that sprint simultaneously could overwhelm a rack or
cluster’s power supply. Uninterruptible power supplies
reduce the risk of tripping circuit breakers and triggering
power emergencies. But the system requires time to
recharge batteries between sprints. Given these physical
constraints in chip multiprocessors and the datacenter
rack, sprinters require recovery time. Thus, sprinting
The original version of this paper is entitled “The
Computational Sprinting Game” and was published in
Proceedings of the International Conference on Architectural
Support for Programming Languages and Operating Systems
(2016), ACM, NY.
mechanisms couple performance opportunities with
management constraints.
We face fundamental management questions when
servers sprint independently but share a power supply –
which processors should sprint and when should they
sprint? Each processor’s workload derives extra performance from sprinting that depends on its computational
phase. Ideally, sprinters would be the processors that benefit most from boosted capability at any given time.
Moreover, the number of sprinters would be small enough
to avoid power emergencies, which constrain future
sprints. Policies that achieve these goals are prerequisites
for sprinting to full advantage.
We present the computational sprinting game to manage
a collection of sprinters. The sprinting architecture, which
defines the sprinting mechanism as well as power and cooling constraints, determines rules of the game. A strategic
agent, representing a multiprocessor and its workload,
independently decides whether to sprint at the beginning of
an epoch. The agent anticipates her action’s outcomes,
knowing that the chip must cool before sprinting again.
Moreover, she analyzes system dynamics, accounting for
competitors’ decisions and risk of power emergencies.
We find the equilibrium in the computational sprinting
game, which permits distributed management. In an equilibrium, no agent can benefit by deviating from her optimal
strategy. The datacenter relies on agents’ incentives to
decentralize management as each agent self-enforces her
part of the sprinting policy. Decentralized equilibria allow
datacenters to avoid high communication costs and
unwieldy enforcement mechanisms in centralized management. Moreover, equilibria outperform prior heuristics.
2. THE SPRINTING ARCHITECTURE
We present a sprinting architecture for chip multiprocessors in datacenters. Multiprocessors sprint by activating
additional cores and increasing their voltage and frequency.
Datacenter applications, with their abundant task parallelism, scale across additional cores as they become available.
In Figure 1, Spark benchmarks perform 2–7× better on a
sprinting multiprocessor, but dissipates 1. 8× the power.
Power produces heat.
Sprinters require infrastructure to manage heat and
power. First, the chip multiprocessor’s thermal package
† These authors contributed equally to this work.