In contrast to our setting, though, their algorithms do not
coherently integrate data from multiple (suboptimal) demonstrations by experts. We will nonetheless use similar ideas
in our trajectory learning algorithm.
Our work also has strong connections with recent work on
inverse reinforcement learning, which extracts a reward function from expert demonstrations. See, e.g., Abbeel, 4 Neu, 22
Ng, Ramachandran, Ratliff, 25–27 Syed. 32 We will describe a
methodology roughly corresponding to the inverse RL algorithm of Abbeel4 to tune reward weights in Section 5. 2.
3. MoDeLinG
The helicopter state s comprises its position (x, y, z), orien-
...
tation (expressed as a unit quaternion q), velocity (x, y, z),
and angular velocity (w , w , w ). The pitch angle of a blade
xyz
is changed by rotating it around its long axis changing the
amount of thrust the blade generates. The helicopter is controlled via a four-dimensional action space:
1. u and u : The lateral (left–right) and longitudinal
12
(front–back) cyclic pitch controls cause the helicopter
to roll left or right, and pitch forward or backward,
respectively.
2. u : The tail rotor pitch control changes tail rotor thrust,
3
controlling the rotation of the helicopter about its ver-
tical axis.
3. u : The main rotor collective pitch control changes the
4
pitch angle of the main rotor’s blades, by rotating the
blades around an axis that runs along the length of the
blade. The resulting amount of upward thrust (gener-
ally) increases with this pitch angle; thus this control
affects the main rotor’s thrust.
By using the cyclic pitch and tail rotor controls, the pilot can
rotate the helicopter into any orientation. This allows the
pilot to direct the thrust of the main rotor in any particular
direction (and thus fly in any particular direction) by rotating the helicopter appropriately.
Following our approach from Abbeel, 3 we learn a model
from flight data that predicts accelerations as a function of the
current state and inputs. Accelerations are then integrated to
obtain the state changes over time. To take advantage of symmetry of the helicopter, we predict linear and angular accelerations in a “body-coordinate frame” (a coordinate frame
attached to the helicopter). In this body-coordinate frame,
the x-axis always points forward, the y-axis always points to
the right, and z-axis always points down with respect to the
helicopter.
In particular, we use the following model:
By our convention, the superscripts b indicate that we
are using body coordinates. We note our model explicitly
encodes the dependence on the gravity vector ( gb, gb, gb) and
xyz
has a sparse dependence of the accelerations on the current
velocities, angular rates, and inputs. The terms w , w , w ,
xyz
are zero mean Gaussian random variables,
which represent the perturbation of the accelerations due to
noise (or unmodeled effects).
To learn the coefficients, we record data while the helicopter is being flown by our expert pilot. We typically ask
our pilot to fly the helicopter through the flight regimes we
would like to model. For instance, to build a model for hovering, the pilot places the helicopter in a stable hover and
sweeps the control sticks back and forth at varying frequencies to demonstrate the response of the helicopter to different inputs while hovering. Once we have collected this data,
the coefficients (e.g., A , B , C , etc.) are estimated using lin-
xx 1
ear regression.
When we want to perform a new maneuver, we can collect data from the flight regimes specific to this maneuver and build a new model. For aerobatic maneuvers, this
involves having our pilot repeatedly demonstrate the desired
maneuver.
It turns out that, in practice, these models generalize
reasonably well and can be used as a “crude” starting point
for performing aerobatic maneuvers. In previous work, 2 we
demonstrated that models of the above form are sufficient
for performing several maneuvers including “funnels” (fast
sideways flight in a circle) and in-place flips and rolls. With
a “crude” model trained from demonstrations of these
maneuvers, we can attempt the maneuver autonomously.
If the helicopter does not complete the maneuver successfully, the model can be re-estimated, incorporating the data
obtained during the failed trial. This new model more accurately captures the dynamics in the flight regimes actually
encountered during the autonomous flight and hence can
be used to achieve improved performance during subsequent attempts.
The observation that we can leverage pilot demonstrations to safely obtain “reasonable” models of the helicopter
dynamics is the key to our approach. While these models may
not be perfect at first, we can often obtain a good approximation to the true dynamics provided we attempt to model only
a small portion of the flight envelope. This model can then,
optionally, be improved by incorporating new data obtained
from autonomous flights. Our trajectory learning algorithm
(Section 4) exploits this same observation to achieve expert-level performance on an even broader range of maneuvers.
4. TRAJeCToRY LeARninG
Once we are equipped with a (rudimentary) model of the
helicopter dynamics, we need to specify the desired trajectory to be flown. Specifying the trajectory by hand, while
tedious, can yield reasonable results. Indeed, much of our
own previous work used hand-coded target trajectories. 2
Unfortunately these trajectories usually do not obey the
system dynamics—that is, the hand-specified trajectory
is infeasible, and cannot actually be flown in reality. This
results in a somewhat more difficult control problem since