parallelism when sprinting powers-on cores and tolerates
faults when cooling and recovery powers-off cores.
Agents are strategic and selfish entities that act on users’
behalf. They decide whether to sprint by continuously analyzing fine-grained application phases. Because sprints are
followed by cooling and recovery, an agent sprints judiciously and targets application phases that benefit most
from extra capability. Agents use predictors that estimate
utility from sprinting based on software profiles and hardware counters. Each agent represents a user and her application on a chip multiprocessor.
Coordination. The coordinator collects profiles from
all agents and assigns tailored sprinting strategies to each
agent. The coordinator interfaces with strategic agents who
may attempt to manipulate system outcomes by misreporting profiles or deviating from assigned strategies.
Fortunately, our game-theoretic mechanism guards against
such behavior.
First, agents will truthfully report their performance profiles. In large systems, game theory provides incentive compatibility, which means that agents cannot improve their
utility by misreporting their preferences. An agent who misreports her profile has little influence on conditions in a
large system. Not only does she fail to affect others, an agent
who misreports suffers degraded performance as the coordinator assigns her a poorly suited strategy based on inaccurate profiles.
Second, agents will implement their assigned strategies
because the coordinator optimizes those strategies to produce an equilibrium. In equilibrium, every agent implements her strategy and no agent benefits by deviating from
it. An equilibrium has compelling implications for management overheads. If each agent knows that every other agent
is playing her assigned strategy, she will do the same without
further communication with the coordinator. Global communication between agents and the coordinator is infrequent and occurs only when system profiles change. In
effect, an equilibrium permits the distributed enforcement
of sprinting policies.
Equilibria are especially compelling when compared to
the centralized enforcement of coordinated policies, which
poses several challenges. First, centralized enforcement
requires frequent and global communication as each agent
decides whether to sprint by querying the coordinator at the
start of each epoch. The length of an epoch is short and corresponds to sprint duration. Moreover, without equilibria,
agents with kernel privileges could ignore prescribed policies, sprint at will, and cause power emergencies that harm
all agents.
3. THE SPRINTING GAME
We design a sprinting game to govern power supply and
manage system dynamics. The game divides time into
epochs and asks agents to play repeatedly. Agents represent
chip multiprocessors that share power. Each agent chooses
to sprint independently, pursuing benefits in the current
epoch and estimating repercussions in future epochs. An
agent’s utility from sprinting varies across epochs according
to her application’s phases. Multiple agents can sprint
not trip when less than 25% of the chips sprint and definitely
trips when more than 75% of the chips sprint. In other
words, Nmin = 0.25N and Nmax = 0.75N. We consider circuit
breakers that can be overloaded to 125–175% of rated current
for a 150s sprint.
18, 21
Uninterruptible power supplies. When the breaker trips
and resets, power distribution switches from the branch circuit to the uninterruptible power supply (UPS).
7 The rack
augments power delivery with batteries to complete sprints
in progress. Lead acid batteries support discharge times of
5–120min, long enough to support the duration of a sprint.
After completing sprints and resetting the breaker, servers
resume computation on the branch circuit.
Servers are forbidden from sprinting again until UPS batteries are recharged. Sprints before recovery compromises
server availability and increases vulnerability to power emergencies. Moreover, frequent discharges without recharges
shorten battery life. The average recovery duration, denoted
by ∆trecover, depends on the UPS discharge depth and recharging time. A battery can be recharged to 85% capacity in 8–10×
the discharge time, which corresponds to 8–10× the sprint
duration.
2. 2 Management architecture
Figure 3 illustrates the management framework for a rack
of sprinting chip multiprocessors. The framework supports policies that pursue the performance of sprints
while avoiding system instability. Unmanaged and excessive sprints may trip breakers, trigger emergencies, and
degrade performance at scale. The framework achieves its
objectives with strategic agents and coarse-grained
coordination.
Users and agents. Each user deploys three run-time components: executor, agent, and predictor. Executors provide
clean abstractions, encapsulating applications that could
employ different software frameworks.
10 The executor supports task-parallel computation by dividing an application
into tasks, constructing a task dependence graph, and
scheduling tasks dynamically based on available resources.
Task scheduling is particularly important as it increases
Coordinator
Alg 1
Profile
Strategy
Executor engine
Task
Agent Predictor
User
Executor engine
Task
Agent Predictor
...
Figure 3. Users deploy task executors and agents that decide when
to sprint. Agents send performance profiles to a coordinator and
receives optimized sprinting strategies.