are free to choose when to sprint, but
must wait for a cool-off period before
sprinting again. Moreover, if too many
nodes sprint at once, supplemental battery power must be used to avoid tripping circuit breakers; servers connected
to that power circuit are not allowed to
sprint again until the battery recharges. To “win” in this game, agents must
choose to sprint when they achieve the
maximum performance benefit while
taking into account the risk they incur
that too many concurrent sprinters
cause a circuit to trip.
To optimize the datacenter as a
whole, each agent provides a broker with
its best estimate of its utility curve—how
much benefit it gains from sprinting for
various fractions of its execution while
taking into account the risks of a circuit
breaker trip. The broker then solves for a
global equilibrium that maximizes utility, and provides each agent the strategy
it should follow to reach that equilibrium. The strength of the underlying
economic theory is that agents provably cannot gain an advantage from lying about their utility curve or deviating
from their assigned strategy … so, they
are incentivized to cooperate.
The beauty of this approach is that
it provides nearly the effectiveness of
perfect centralized control while requiring only simple, infrequent interactions
with the broker. Because agents cannot gain an advantage by cheating, this
kind of coordination mechanism can be
used even among mutually distrusting
agents, as in the cloud. More generally,
the paper teaches us that, when we consider the myriad resource management
challenges that arise in computer systems, we ought to look beyond the confines of our own discipline; economics
provides a rich toolset from which all of
us can learn.
Thomas F. Wenisch is an associate professor of
computer science and engineering at the University of
Michigan, Ann Arbor, MI, USA.
Copyright held by author/owner.
NEARLY EVERY COMPUTER system today
runs hot … too hot. For over a decade,
thermal constraints have limited the
computational capability of computing
systems of all sizes—from mobile
phones to datacenters. And, for nearly
that long, system designers have cheated
those thermal limits, allowing systems
to burn more power, and produce more
heat, for short periods to deliver bursts
of peak performance beyond what can
be sustained. This idea—running a computer too hot for a short period of time to
get a burst of performance—is called
computational sprinting.
We have likely all experienced computational sprinting on our smartphones;
it turns out that, if all the cores, accelerators, and peripherals on a modern smartphone are turned on at once, the phone
will generate several times more heat
than can be dissipated through its case. If
you play a demanding 3D video game for
more than a few minutes, you might notice the phone get uncomfortably warm.
As the phone heats up, eventually, processing speeds have to slow to keep temperature rise in check. When the phone
cools, its processor can run full-tilt again.
What might be less widely known
is that modern datacenters can play
similar tricks; they oversubscribe both
power delivery and cooling capability
to eke out greater efficiency. Individual
servers may sprint by consuming more
than their fair share of power to maximize performance when their workload
is high. In a datacenter running diverse
workloads, different systems will likely
sprint at different times, and the average demands of the facility will (
probably) remain sustainable. But, a local
spike in one server rack might draw too
much power from a particular circuit,
risking that a circuit breaker trips. Or,
all the cores in a particular server might
run a sustained compute job at full bore
and risk local over-heating. To maximize
efficiency, a datacenter should sprint as
close to its power and thermal limits as
it can … without going over them.
Current datacenters must either run
complex, centralized control systems to
allocate power and thermal budgets at
fine granularity, or reserve large guard-bands to avoid power or thermal emergencies. But, because they require frequent communication, centralized
systems are prone to failure and notoriously difficult to scale—the frequent
communication rapidly becomes a bottleneck. Moreover, workloads benefit to
different degrees at different times from
computational sprinting; judicious use
of scarce power and cooling budgets can
lead to better overall performance. The
challenges of allocating budgets grow
even more daunting in cloud computing
environments, where each cloud tenant
seeks to maximize its own performance
and may have no incentive to cooperate.
Economics has long studied the challenges of allocating scarce resources.
Game theory, in particular, studies
resource allocation among strategic
agents that seek to maximize their individual utility and might even lie about
their preferences to do so.
The authors of the following paper,
Distributed Strategies for Computational
Sprints, bring this rich theory to the
challenge of managing computational
sprinting in datacenters. They formulate the problem of managing computational sprinting as a repeated game:
agents managing individual workloads
Technical Perspective
How Economic Theories Can
Help Computers Beat the Heat
By Thomas F. Wenisch
When we consider the
resource management
challenges that arise
in computer systems,
we should look
beyond the confines
of our own discipline.
To view the accompanying paper,
visit doi.acm.org/10.1145/3299885
DOI: 10.1145/3299883