OpenFly
A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation

*: Equal Contribution. †: Corresponding Author.

Abstract

  Vision-Language Navigation (VLN) aims to guide agents through an environment by leveraging both language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising a versatile toolchain and large-scale benchmark for aerial VLN. Firstly, we develop a highly automated toolchain for data collection, enabling automatic point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Secondly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. The corresponding visual data are generated using various rendering engines and advanced techniques, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). All data exhibit high visual quality. Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of the dataset. Thirdly, we propose OpenFly-Agent, a keyframe-aware VLN model, which takes language instructions, current observations, and historical keyframes as input, and outputs flight actions directly. Extensive analyses and experiments are conducted, showcasing the superiority of our OpenFly platform and OpenFly-Agent. The toolchain, dataset, and codes will be open-sourced.

Video

Toolchain Framework

FAILED
Framework of the automatic data generation platform. Various rendering engines and simulators are first integrated, providing diverse high-quality scenes. Built on these, several interfaces and tools are developed, enabling automated generation of trajectories and instructions.

Model Architecture

FAILED
The architecture of OpenFly-Agent. Keyframes at the time of action transitions are selected to extract crucial observations as the history, with corresponding visual tokens compressed to reduce the computational burden.

Scene Overview

The OpenFly platform integrates five simulators utilizing diverse digital assets and simulation tools/plugins—including Unreal Engine with UnrealCV, Unreal Engine with AirSim, GTA V with Script Hook V, Google Earth with Google Earth Studio, and 3D Gaussian Splatting with SIBR viewers—to collect large-scale heterogeneous data. These simulators collectively cover 18 high-fidelity scenes.

Point Cloud and Instance Segmentation

Through the OpenFly platform's toolchain, point clouds for each scene can be collected, enabling semantic segmentation of the entire scene.

Trajectory Generation

The OpenFly platform's toolchain generates diverse collision-free trajectories with variable lengths and altitudes by leveraging scene point clouds and semantic segmentation.

Dataset Overview

FAILED

Successful Examples

The OpenFly platform proposes OpenFly-Agent, a keyframe-aware VLN model that integrates language instructions, current observations, and historical keyframes to directly generate flight actions.

BibTeX

🖱️ Click here to copy BibTex.
@article{OpenFly,
  author       = {Yunpeng Gao and
                  Chenhui Li and
                  Zhongrui You and
                  Junli Liu and
                  Zhen Li and
                  Pengan Chen and
                  Qizhi Chen and
                  Zhonghan Tang and
                  Liansheng Wang and
                  Penghui Yang and
                  Yiwen Tang and
                  Yuhang Tang and
                  Shuai Liang and
                  Songyi Zhu and
                  Ziqin Xiong and
                  Yifei Su and
                  Xinyi Ye and
                  Jianan Li and
                  Yan Ding and
                  Dong Wang and
                  Zhigang Wang and
                  Bin Zhao and
                  Xuelong Li},
  title        = {OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation},
  journal      = {CoRR},
  volume       = {abs/2502.18041},
  year         = {2025}
}