Example runway detections on VTI flight test data.
On September 19th and 20th, 2023, the FAA hosted an Artificial Intelligence Roadmap Review and Technical Session. Industry and government representatives met to advance the FAA’s Roadmap for Artificial Intelligence (AI) Safety Assurance. A common theme of the wide-ranging discussion is simply put:
How do you prove AI works?
When faced with this question, many entrants into AI development, especially in the context of aviation, walk into a combination of challenging circumstances including:
Limited amounts of data due to the cost of flight testing
Variable label quality due to manual labeling cost and difficulty
Mixed data provenance (i.e. different data sources / configurations / settings) from historical flight tests
Pressure to show results quickly in new development programs
VTI’s “Simulate First” thesis uses synthetic data to address these challenges. In this article we share our perspective on the benefits of synthetic data in AI software development and highlight a specific internal engineering milestone.
Motivations for “Simulate First”
Synthetic data is computer generated representation of sensor and higher-level scenario data products required for Machine Learning (ML) training. Our synthetic data engines span optical sensors, lidar, radar GPS, ADS-B, and even ground-based radio navigation signals. By starting with synthetic data, we can carefully control scenario distributions and label quality. Data can then be packaged in middleware formats which mimic real-world sensor packages.
This means that we can produce any relevant data product under any set of conditions. This is especially valuable when compared to costly real world data collection where environmental conditions or circumstances could be extremely difficult to recreate or unsafe in real world flight tests. The high degree of data control means that VTI focuses on improving the system, instead of improving the data. Finally, it allows for intentional exploration of model performance across expected conditions (i.e. model generalization).
VTI does not believe that synthetic data is a replacement for targeted flight testing and data collection. But we argue that starting “in sim” leads to faster development and more reliable, testable software at scale.
Synthetic Data At Work
A core capability of VTI’s FlightStack is the recognition of runways in sensor imagery. We therefore evaluated runway detector models across dozens of the busiest airports in the United States and a wide variety of environmental conditions. We then analyzed the results to glean insights into model generalization, data requirements, and potential for performance transfer to real data. This was an important opportunity to test and demonstrate our synthetic data thesis.
Note that runway detection is not as simple as it sounds! Models must deal with huge variation in scale and perspective during approach in addition to a wide variety of markings, surface lighting, surroundings, and visibility. The video below demonstrates:
The dataset we built to support our goal was ~1 million images with pixel-perfect runway bounding boxes. Our statistical sampling strategy created the equivalent of ~10,000 unique approaches - far beyond the scale of any flight testing effort underway today. VTI’s database and ML framework was used to train models specific to each airport and validate those models against data from every other airport. We wanted to test how a model trained on airport A performed on airports B, C, D, etc... Standard detection metrics were used to measure performance.
Airport-specific cross validation metrics of airport-specific models allow for a graph embedding visualization to uncover trends. An example from daytime-only data demonstrates the similarity of many airports across the United States which are situated in semi-industrial zones at the edge of major cities. Two interesting sets of airports exist within the “cross-validation map”:
Large multi-runway airports (Denver, Chicago, Detroit, Dallas, and Phoenix).
Coastal airports with some portion of the approach near water (San Francisco, San Diego, Boston, and LaGuardia).
It was exciting to see cross-validation results reflect basic realities of the test airports.
Graph embedding of cross-validation performance for airport-specific runway detection models between constituent datasets.
We also trained and tested “super sets” composed of data from all airports and conditions. Models trained on the super set of airports performed as good as models trained at an individual airport when validated against airport-specific slices of the dataset. We also observed that splitting training sets on time of day did not improve performance over a global model. This was somewhat surprising given the clear differences between visual visible runway features such as paint markings in daylight and runway lighting features at night. A next step for us will be to quantify performance during twilight hours - does the apparent generalization span this period?
Example daytime case.
Example nighttime case.
Our most surprising outcome was the robust transfer of models trained in simulation to real camera data from a variety of sources (see lead image). But that’s a story for another flight!
Comments