(nonlinear) error state dynamics and (ii) a quadratic
approximation to the reward function.
2. Compute the optimal policy for the LQR problem
obtained in Step 2 and set the current policy equal to
the optimal policy for the LQR problem.
3. Simulate a trial starting from, s , under the current
0
policy and store the resulting trajectory.
In our experiments, we have a quadratic reward function,
thus the only approximation made in the algorithm is the
linearization of the dynamics. To bootstrap the process (i.e.,
to obtain an initial trajectory), we linearize around the target
trajectory in the first iteration.
The result of DDP is a sequence of linear feedback controllers that are executed in order. Since these controllers
were computed under the assumption of linear dynamics,
they will generally fail if executed from a state that is far from
the linearization point. For aerobatic maneuvers that involve
large changes in orientation, it is often difficult to remain
sufficiently close to the linearization point throughout the
maneuver. Our system, thus, uses DDP in a “receding horizon” fashion. Specifically, we rerun DDP online, beginning
from the current state of the helicopter, over a horizon that
extends 2 s into the future. The resulting feedback controller obtained from this process is always linearized around
the current state and, thus, allows the control system to
continue flying even when it ventures briefly away from the
intended trajectory.
5. 2. Learning reward function parameters
Our quadratic reward is a function of 21 features (which
are functions of the state and controls), consisting of the
squared error state variables, the squared inputs, and
squared change in inputs. Choosing the parameters for the
reward function (i.e., choosing the entries of the matrices
Q , R used by DDP) is difficult and tedious to do by hand.
tt
Intuitively, the reward parameters tell DDP how to “trade off”
between the various errors. Selecting this trade-off improperly can result in some errors becoming too large (allowing
the helicopter to veer off into poorly modeled parts of the
state space), or other errors being regulated too aggressively
(resulting in large, unsafe control outputs).
This problem is more troublesome when using infeasible
target trajectories. For instance, for the aerobatic flips and
rolls performed previously in Abbeel, 2 a hand-coded target
trajectory was used. That trajectory was not feasible, since
it assumed that the helicopter could remain exactly fixed
in space during the flip. Thus, there is always a (large) nonzero error during the maneuver. In this case, the particular
choice of reward parameters becomes critical, since they
specify how the controller should balance errors throughout
the flight.
Trajectories learned from demonstration using the methods presented in Section 4, however, are generally quite close
to feasible for the real helicopter. Thus, in contrast to our
prior work, the choice of trade-offs is less crucial when using
g
The 2 s horizon is a limitation imposed by available computing power. Our
receding horizon DDP controller executes at 20 Hz.
these learned trajectories. Indeed, in our recent experiments
it appears that a wide range of parameters work well with trajectories learned from demonstration.h Nonetheless, when
the need to make adjustments to these parameters arises, it
is useful to be able to learn the necessary parameters, rather
than tune them by mere trial and error.
Since we have expert demonstrations of the desired behavior (namely, following the trajectory) we can alleviate the tuning problem by employing the apprenticeship learning via
inverse reinforcement learning algorithm4 to select appropriate parameters for our quadratic reward function. In practice, in early iterations (before convergence) this algorithm
tends to generate parameters that are dangerous to use on
the real helicopter. Instead, we adjust the reward weights by
hand following the philosophy, but not the strict formulation of the inverse RL algorithm. In particular: we select the
feature (state error) that differed most between our autonomous flights and the expert demonstrations, and then
increase or decrease the corresponding quadratic penalties
to bring the autonomous performance closer to that of the
expert with each iteration.i Using this procedure, we obtain a
good reward function in a small number of trials in practice.
We used this methodology to successfully select reward
parameters to perform the flips and rolls in Abbeel, 2 and
continue to use this methodology as a guide in selecting
reward parameters.
6. eXPeRiMen TAL ResuLTs
6. 1. experimental setup
For our experiments we have used two different autonomous helicopters. The experiments presented here were
performed with an XCell Tempest helicopter (Figure 3), but
we have also conducted autonomous aerobatic flights using
a Synergy N9. Both of these helicopters are capable of professional, competition-level maneuvers. We instrumented our
helicopters with a Microstrain 3DM-GX1 orientation sensor.
A ground-based camera system measures the helicopter’s
position. A Kalman filter uses these measurements to track
the helicopter’s position, velocity, orientation, and angular
rate.
We collected multiple demonstrations from our expert for
a variety of aerobatic trajectories: continuous in-place flips
and rolls, a continuous tail-down “tic toc,” and an airshow,
which consists of the follo wing maneuvers in rapid sequence:
split-S, snap roll, stall-turn, loop, loop with pirouette, stall-turn with pirouette, “hurricane” (fast backward funnel), knife-edge, flips and rolls, tic-toc, and inverted hover.
We use a large, previously collected corpus of hovering,
horizontal flight, and mixed aerobatic flight data to build a
crude dynamics model using the method of Section 3. This
model and the pilot demonstrations are then provided to
the trajectory learning algorithm of Section 4. Our trajectory
h
It is often sufficient to simply choose parameters that rescale the various
reward features to have approximately the same magnitude.
i
For example, if our controller consistently uses larger controls than the expert but achieves lower position error, we would increase the control penalty
and decrease the position penalty.