the control algorithm must determine an appropriate trade-off between the errors it must inevitably make. As well, it
complicates our modeling process because we do not know,
a priori, the trajectory that the controller will attempt to fly,
and hence cannot focus our data collection in that region of
state space.
One solution to these problems is to leverage expert demonstrations. By using a trajectory acquired from a demonstration aboard the real helicopter as the target trajectory
we are guaranteed that our target is a feasible trajectory.
Moreover, our data collection will already be focused on the
proper flight regime, provided that our expert demonstrations cover roughly the same parts of state space each time.
Thus, we expect that our model of the dynamics along the
demonstrated trajectory will be reasonably accurate. This
approach has been used successfully to perform autonomous autorotation landings with our helicopter. 1
While the autorotation maneuver can be demonstrated
relatively consistently by a skilled pilot,b it may be difficult
or impossible to obtain a perfect demonstration that is suitable for use as a target trajectory when the maneuver does
not include a steady-state regime, or involves complicated
adjustments over long periods of time. For example, when
our expert pilot attempts to demonstrate an in-place flip, the
helicopter position often drifts away from its starting point
unintentionally. Thus, when using this demonstration as
our desired trajectory, the helicopter will repeat the pilot’s
errors. However, repeated expert demonstrations are often
suboptimal in different ways, suggesting that a large number
of demonstrations could implicitly encode the ideal trajectory that the (suboptimal) expert is trying to demonstrate.
In Coates, 12 we proposed an algorithm that approximately extracts this implicitly encoded optimal demonstration from multiple suboptimal expert demonstrations. This
algorithm also allows us to build an improved, time-varying
model of the dynamics along the resulting trajectory suitable for high-performance control. In doing so, the algorithm allows the helicopter to not only mimic the behavior
of the expert but even perform significantly better.
Properly extracting the underlying ideal trajectory from a
set of suboptimal trajectories requires a significantly more
sophisticated approach than merely averaging the states
observed at each time step. A simple arithmetic average of
the states would result in a trajectory that does not obey the
constraints of the dynamics model. Also, in practice, each
of the demonstrations will occur at different rates so that
attempting to combine states from the same time step in
each trajectory will not work properly.
Following Coates, 12 we propose a generative model that
describes the expert demonstrations as noisy observations of
the unobserved, intended target trajectory, where each demonstration is possibly warped along the time axis. We use an
expectation–maximization (EM) algorithm to both infer the
unobserved, intended target trajectory and a time-alignment
of all the demonstrations. The time-aligned demonstrations
provide the appropriate data to learn good local models in
the vicinity of the trajectory—such trajectory-specific local
models tend to greatly improve control performance.
4. 1. Basic generative model
From our expert pilot we obtain M demonstration trajecto-
ries of length Nk, for k = 0..M − 1. Each trajectory is a sequence
of states, sk, and control inputs, uk, composed into a single
jj
state vector:
Our goal is to estimate a “hidden” target trajectory of length
H, denoted similarly:
We use the following notation: y = {yk | j = 0..Nk - 1,
j
k = 0..M - 1}, z = {z |t = 0..H}, and similarly for other indexed
t
variables.
The generative model for the ideal trajectory is given by
an initial state distribution z N (m , Σ ) and an approxi-
0 00
mate model of the dynamics
( 1)
The dynamics model does not need to be particularly accurate. In fact, in our experiments, this model is of the form
described in Section 3, trained on a large corpus of data that
is not even specific to the trajectory we want to fly.c In our
experiments (Section 6) we provide some concrete examples
showing how accurately the generic model captures the true
dynamics for our helicopter.
Our generative model represents each demonstration as
a set of independent “observations” of the hidden, ideal trajectory z. Specifically, our model assumes
( 2)
Here tk is the time index in the hidden trajectory to which
j
the observation yk is mapped. The noise term in the observa-
j
tion equation captures both inaccuracies in estimating the
observed trajectories from sensor data, as well as errors in
the maneuver that are the result of the human pilot’s imperfect demonstration.d
b
The autorotation maneuver consists of a steady-state “glide” followed by
a short (several second) “flare” before landing. Though the maneuver is not
easy to learn, these components tend not to vary much from one demonstration to the next.
c
The state transition model also predicts the controls as a function of the
previous state and controls. In our experiments we predict u as u*plus
t+ 1 t
Gaussian noise.
d
Even though our observations, y, are correlated over time with each other
due to the dynamics governing the observed trajectory, our model assumes
that the observations yk are independent for all j = 0 . . Nk − 1 and k = 0 . . M − 1.
j