JUNE 2014 | VOL. 57 | NO. 6 | COMMUNICATIONS OF THE ACM 105
compute pL(s) after which we use Equation ( 12) to determine
an updated value at si. After all the si samples have been processed in this manner, we have an updated approximation of
the value function. We repeat this process until convergence
and use the last iteration as the final value function.
Temporal value function compression. Unlike graph-based approaches, motion fields let characters be in constant transition between many sources of data. Consequently, we need access to the value function at all motion states,
rather than only at transitions between clips. This fact leads
to a large memory footprint relative to graphs. We offset this
weakness with compression.
For the purpose of compression, we want to think of our
value function as a collection of value functions, each defined
over the space of task parameters. Without compression, we
store one of these value subfunctions Vmi (q T) = V (mi, q T) at every
database motion state mi (see Figure 3). Here, we observe that
our motion states were originally obtained from continuous
streams of motion data. At 30 Hz, temporally adjacent motion
states and their value functions are frequently similar; we expect
that Vmt changes smoothly over “consecutive” motion states mt
relative to the original clip time. Exploiting this idea, we only
store value functions at every Nth motion state, and interpo-
late the value functions for other database motion states (see
Figure 4). We call these database states storing value functions
“anchor” motion states. We compute the value function at the
ith motion state between two anchors m0 and mN as
We can learn a temporally compressed value function with
a trivially modified form of the algorithm given in Section
4. 2. 1. Instead of iterating over all task states, we only iterate
over those states associated with anchor motion states.
efficiently solve for the correct next action. The trick begins
by defining a value function V(s), a scalar-valued function representing the expected cumulative future reward received
for acting optimally starting from task state s:
We will describe shortly how we represent and precompute
the value function, but for the moment notice that we can
now rewrite Equation ( 9) by replacing the infinite future
search with a value function lookup:
Now the lookahead policy is only marginally more expensive
to compute than the greedy policy.
Value function representation and learning. Since there
are infinitely many possible task states, we cannot represent
the value function exactly. Instead, we approximate it by storing values at a finite number of task states si and interpolating to estimate the value at other points (Figure 2). We choose
these task state samples by taking the Cartesian product of
the database motion states mi and a uniform grid sampling
across the problem’s task parameters (see Section 5 for details
of the sampling). This sampling gives us high resolution near
the motion database states, which is where the character generally stays. In order to calculate the value V (s) of a task state
not in the database, we interpolate over neighboring motion
states using the similarity weights and over the task parameters multilinearly.
Given an MDP derived from a motion field and a task
specification, we solve for an approximate value function in
this form using fitted value iteration. 4 Fitted value iteration
operates by first noting that Equation ( 11) can be used to
write the definition of the value function in a recursive form.
We express the value at a task state sample si recursively in
terms of the value at other task state samples:
V (si) = R (si , pL(si)) + V (Is(si , pL(si))) ( 12)
where pL(si) is as defined in Equation ( 11) and V(Is(si, a))
is computed via interpolation. We can solve for V (si) at
each sample state by iteratively applying Equations ( 11)
and ( 12). We begin with an all-zero value function V0(si) = 0
for each sample si. Then at each si, Equation ( 11) is used to
value: next state?
Figure 2. Action search using a value function. (a) At every state, we
have several possible actions (dashed lines) and their next states
(dashed circles). (b) We interpolate the value function stored at
database states (black points) to determine the value of each next
state. (c) We select the highest value action to perform.
Figure 3. Uncompressed value function. The value functions Vmi are
stored at every motion state mi.
Figure 4. Value function with temporal compression. The value
functions at intermediate motion states are interpolated by the
neighboring “anchor” motion states that have explicitly stored value