Equation 5 is difficult because (as can be seen in Figure 1)
the lack of knowledge of the time-alignment index variables
t introduces a very large set of dependencies between all the
variables. However, when t is known, the optimization prob-
lem in Equation 5 greatly simplifies thanks to context specific
independencies. 10 For instance, knowledge that t k = 3 tells us
1
that y k depends only on z . Thus, when all of the t are fixed, we
13
obtain a simplified model such as the one shown in Figure 2.
In this model we can directly estimate the multinomial
parameters d in closed form; and we have a standard HMM
parameter learning problem for the covariances Σ(·), which
can be solved using the EM algorithm13—often referred to
as Baum–Welch in the context of HMMs. Concretely, for our
setting, the EM algorithm’s E-step computes the pairwise
marginals over sequential hidden state variables by running
a (extended) Kalman smoother; the M-step then uses these
marginals to update the covariances Σ(·).
To also optimize over the time-indexing variables t, we
propose an alternating optimization procedure. For fixed Σ(·)
and d, and for fixed z, we can find the optimal time-indexing
variables t using dynamic programming over the time-index
assignments for each demonstration independently. The
dynamic programming algorithm to find t is known in the
speech recognition literature as dynamic time warping29
and in the biological sequence alignment literature as the
Needleman–Wunsch algorithm. 21 The fixed z we use is the
one that maximizes the likelihood of the observations for
the current setting of parameters t, d, Σ(·).f
In practice, rather than alternating between complete
optimizations over Σ(·), d and t, we only partially optimize
over Σ(·), running only one iteration of the EM algorithm.
Complete details of the algorithm are provided in Coates. 12
5. Con TRoLLeR DesiGn
Using the methods of Sections 3 and 4, we can obtain a
good target trajectory and a high-accuracy dynamics model
for this trajectory using pilot demonstrations. It remains
to develop an adequate feedback controller that will allow
the helicopter to fly this trajectory in reality. Our solution is
based on the DDP algorithm, which we have used in previous work. 1, 2
5. 1. Reinforcement learning formalism and DDP
A reinforcement learning problem (or optimal control prob-
lem) can be described by a quintuple (S, A, T, H, s , R), which
0
is also referred to as a Markov decision process (MDP). Here
S is the set of states; A is the set of actions or inputs; T is the
dynamics model, which is a set of probability distributions
{P t } (P t (s′ | s, u) is the probability of being in state s′ at time
su su
t + 1 given the state and action at time t are s and u); H is the
horizon or number of time steps of interest; s ∈ S is the ini-
0
tial state; R: S × A → ℝ is the reward function.
A policy p = (m , m ,..., m ) is a tuple of mappings from
01 H
states S to actions A, one mapping for each time t = 0, . . . , H.
f
Fixing z means the dynamic time warping step only approximately optimizes the original objective. Unfortunately, without fixing z, the independencies required to obtain an efficient dynamic programming algorithm do
not hold. In practice we find our approximation works very well.
figure 2: example of graphical model when t is known. (shaded
nodes are observed.)
z
0
z
1
z
2
z
3
z
4
...
k
y
0
k
t
0
k
y
1
k
t
1
k
y
2
k
t
2
...
...
The expected sum of rewards when acting according to a
policy p is given by: . The optimal policy p *
for an MDP (S, A, T, H, s , R) is the policy that maximizes the
0
expected sum of rewards. In particular, the optimal policy
is given by
The linear quadratic regulator (LQR) control problem is
a special class of MDP, for which the optimal policy can be
computed efficiently. In LQR the set of states S = ℝn, the set
p
of actions/inputs A = ℝ , the dynamics model is given by
where for all t = 0, . . . , H we have that A ∈ ℝn × n, B ∈ ℝn × p and w is
tt t
a mean zero random variable (with finite variance). The reward
for being in state s and taking action/input u is given by
tt
Here Q , R are positive semidefinite matrices which param-
tt
eterize the reward function. It is well known that the opti-
mal policy for the LQR control problem is a time-varying
linear feedback controller, which can be efficiently com-
puted using dynamic programming. (See, e.g., Anderson7
for details on linear quadratic methods.)
The linear quadratic methods, which in their standard form as given above drive the state to zero, are easily
extended to the task of tracking the desired trajectory s*, . . . , s*
0H
learned in Section 4. The standard formulation (which we
use) expresses the dynamics and reward function as a function of the error state e = s - s rather than the actual state s .
ttt t
(See, e.g., Anderson. 7)
DDP approximately solves general continuous state-space
MDP’s by iteratively approximating them by LQR problems.
In particular, DDP solves an optimal control problem by iterating the following steps:
1. Around the trajectory obtained from running the current policy, compute: (i) a linear approximation to the