Researchers Use Deep Learning to Train Autonomous Acrobatic Drones


  • Researchers safely trained acrobatic controllers in simulation and deployed them with no fine-tuning on physical quadrotor drones using zero-shot transfer.

  • For the first time, a vision-based drone with only onboard sensing and computation can autonomously perform agile maneuvers with accelerations of up to 3g.



Acrobatic maneuvers present a challenge for drone actuators, sensors and physical components. While hardware limitations can be resolved using expert-level equipment made for extreme accelerations, reliable state estimation is a major limiting factor for agile flight. Acrobatic maneuvers produce large angular rates and high speeds, which induces strong motion blur in vision sensors, compromising the quality of state estimation. Additionally, the harsh requirements of fast and precise control at high speeds make it difficult to tune controllers on the real platform since even minor mistakes can result in catastrophic crashes.

To overcome these challenges, researchers trained the sensorimotor controller policy entirely in simulation by using demonstrations from an optimal controller with access to privileged information. Using a novel simulation-to-reality transfer strategy based on abstraction of both visual and inertial measurements, the visual input can be transferred to a real drone without fine-tuning. This method does not require a human expert to provide demonstrations, keeping equipment safe from potential crashes. That means even the most challenging maneuvers, including ones that stretch the abilities of expert human pilots, can be simulated.

The team’s approach is the first to learn an end-to-end sensorimotor mapping — from sensor measurements to low-level controls — that can perform high speed and high acceleration acrobatic maneuvers on a real physical system. Researchers trained a sensorimotor controller to predict low-level actions from a history of onboard sensor measurements and a user-defined reference trajectory.

Using privileged learning

The sensorimotor policy is represented by a neural network that combines information from different inputs to directly regress thrust and body rates. To cope with different output frequencies of the onboard sensors, the team designed an asynchronous network that operates independently of the sensor frequencies. This network is trained in simulation to imitate demonstrations from an optimal controller that has access to privileged state information. The sensorimotor controller is trained by imitating demonstrations provided by the privileged expert. While the expert has access to privileged information in the form of ground-truth state estimates, the sensorimotor controller does not access any privileged information and can be directly deployed in the physical world.

By leveraging abstraction, the team was able to bridge the gap between simulation and reality. Rather than operating on raw sensory input, the sensorimotor controller operates on an intermediate representation produced by a perception module. This intermediate representation is more consistent across simulation and reality than raw visual input.

Pulling from a visual-inertial odometry (VIO) system, the team used feature tracks as an abstraction of camera frames. In contrast to camera frames, feature tracks primarily depend on scene geometry, rather than surface appearance. The information contained in the feature tracks is sufficient to infer the ego-motion of the platform up to an unknown scale. Information about the scale can be recovered from the inertial measurements.

Simulated and real-world testing

Using the Gazebo simulator, the team trained the policies with an off-policy learning approach by executing the trained policy, collecting rollouts and adding them to a dataset. In simulation, the learned controller, which leverages both inertial measurement unit (IMU) and visual data, provided consistently good performance without a single failure.

To validate the importance of input abstraction, the team compared their abstraction-based approach to an image-based network that uses raw camera images instead of feature tracks as visual input. In the training environment, the image-based network had a success rate of only 80% with a 58% higher tracking error than the team’s abstraction-based controller. This could be attributed to the higher sample complexity of learning from raw pixel images. Even more dramatically, the image-based controller failed completely when tested with previously unseen background images. In contrast, the abstraction-based approach maintained a 100% success rate in these conditions.

When deployed to the real world, the learned controllers can fly all maneuvers with no intervention. The results indicate that using all input modalities, including the abstracted visual input in the form of feature tracks, enhances robustness.