Abstract
This paper presents a study on pedestrian detection and estimation from UAS imagery using recent YOLO-based object detection models. The objective is to evaluate model performance for identifying humans from aerial perspectives and to develop a customized detector suited for UAS applications. The study demonstrates the potential of combining modern artificial intelligence models with UAS-mounted vision systems for applications such as crowd monitoring, autonomous surveillance, and search-and-rescue operations. Experimental results demonstrated that the model achieved consistent detection accuracy up to 40m altitude, achieving near-perfect pedestrian identification with minimal false positives. The framework demonstrated its robustness for real-time deployment in aerial surveillance, search and rescue operations, and crowd monitoring scenarios.
Introduction
Recent advancements in unmanned aerial systems (UAS) and artificial intelligence (AI) have enabled new opportunities in aerial surveillance and human activity monitoring. Pedestrian detection from UAS imagery has become especially important for public safety, disaster response, and crowd management. With onboard vision systems, UASs offer a broader and more flexible observation capability than ground-based cameras, providing real-time situational awareness across large environments.
Object detection models, particularly those based on the YOLO family, continue to advance in accuracy and real-time efficiency. However, pedestrian detection from aerial views remains challenging due to varying altitudes, lighting changes, occlusion, and small target scales. These factors require models that can generalize well under dynamic aerial conditions while maintaining reliable detection performance.
As illustrated in Figure 1, the experimental setup used a UAS-mounted RGB camera to perform pedestrian detection over team members in a controlled field environment. The goal of this work is to evaluate recent YOLO models for pedestrian identification and counting from aerial imagery. A refined dataset was created from public aerial sources to fine-tune the detector for improved performance. By focusing on realistic flight scenarios and true UAS perspectives, this study aims to support reliable AI-assisted systems for crowd monitoring, search and rescue, and security surveillance.

UAS field test demonstrating real-time pedestrian detection on subjects in an outdoor environment. The yellow arrow highlights the UAS platform, which may have reduced visibility against the sky, while the narrow black arrow indicates that pedestrian counts are being recorded from the aerial perspective.
The key contributions of this work are: (a) the first application and evaluation of the YOLOv12 architecture for aerial pedestrian detection from UAS imagery; (b) a systematic altitude stress-testing protocol that quantifies detection robustness from 10 m to 50 m; (c) a comparative analysis across YOLOv10 and YOLOv12 model families with deployment-aware variant selection based on accuracy–latency trade-offs; and (d) real-time deployment validation on live UAS video feeds with quantitative throughput measurements.
Literature review
Research in UAS-based AI applications highlights the versatility and reliability of modern computer vision systems across diverse operational contexts, many of which inform and strengthen pedestrian detection studies.
Buchelt et al. 1 demonstrated that AI-equipped UASs can effectively classify and monitor ecological features, establishing the potential of computer vision in large-scale real-world environments. Hegde et al. 2 emphasized that the quality of annotated datasets directly determines the reliability of autonomous navigation and detection, a critical insight for pedestrian identification. Kaufmann et al. 3 explored reinforcement learning approaches where UASs outperformed human pilots in real-time navigation, demonstrating the adaptability of AI in dynamic scenarios relevant to pedestrian tracking.
Jung and Choi 4 enhanced YOLOv5 for UAS imagery, improving both efficiency and precision under conditions such as low light or changing altitude, key challenges in human detection from aerial views. Similarly, Piponidis and Theocharides 5 introduced dynamic CNN parameter optimization for varying flight altitudes, offering strategies to balance detection accuracy with computational efficiency on constrained UAS hardware. Zhu et al. 6 proposed TPH-YOLOv5 with a transformer-based prediction head specifically designed for drone-captured scenarios, achieving competitive results on the VisDrone challenge 7 and highlighting the benefit of attention mechanisms for small-object aerial detection.
Kunhoth et al. 8 analyzed YOLO variants on thermal imagery, showing that model choice should align with environmental and dataset characteristics. Shekhar et al. 9 proposed combining supervised learning with clustering to identify anomalies, potentially useful for detecting unusual pedestrian behavior or movement. Sujith et al. 10 addressed challenges in underwater detection caused by poor clarity similar to aerial altitude issues and demonstrated that hyperspectral imaging and CNN-based feature extraction can significantly enhance detection accuracy, offering insights into improving aerial imagery under degraded visibility.
Papaioannidis et al. 11 proposed a supervised deep learning approach for UAS safety that detects human crowds in aerial images to define “no-fly zones”. Their multitasking CNN combines semantic segmentation with image-to-image translation, achieving faster and more accurate crowd detection than previous methods. Iftikhar et al. 12 explored deep learning-based pedestrian detection in autonomous vehicles, finding some major challenges and some improvements. Iftikhar found it challenging for an autonomous vehicle to accurately detect pedestrians partially due to the fact of the low quality images and varying lighting conditions in such a camera source. Iftikhar also found that LiDAR and thermal plus RGB imaging improved the detection rates especially during night.
Gomes et al. 13 describes a real-time system for enumerating people and bicycles using YOLOv5. Their power consumption problem was solved by having a dynamically adjusting frame rate based on the speed of the object being detected. Zuehlke et al. 14 deploys a vision-based object detection algorithm in UAS along with proportional navigation for collision avoidance. Using image processing and integrating into guidance systems enables the UAS to autonomously sense and avoid collisions in real world and simulated environments. Chen and Juang 15 presents the application of the YOLOv4 model for radiographic testing tasks in aviation maintenance. Searching for things like cracks and engine flaws, the model demonstrated faster and higher detection accuracy than the previous two stage detectors.
Methodology
The overall workflow of the proposed system is organized into three sequential layers: data acquisition, data processing, and output visualization. As illustrated in Figure 2, the UAS captures real-time RGB video using its onboard camera and transmits the feed to a ground station. The incoming frames are processed using a custom-trained YOLOv12 model to detect and count pedestrians, and the annotated results are displayed to the user through a graphical interface. This modular design ensures smooth integration between aerial sensing, onboard inference, and real-time visualization for field deployment.

Overview of the proposed real-time YOLOv12-based pedestrian detection system using UAS imagery.
YOLOv12 architecture overview
The YOLOv12 model
16
represents the latest evolution of the YOLO family and follows the standard backbone–neck–head structure. The

YOLOv12 architecture diagram.
Vision model selection
To identify a suitable baseline for aerial pedestrian detection, we conducted a controlled comparison of the YOLOv10 and YOLOv12 model families using RGB imagery captured from the Skydio X10. Eight human subjects were positioned at a fixed horizontal distance, while the UAS altitude was varied from 3 m to 30 m. For each altitude, pre-trained variants (n, s, m, l, x) were evaluated by averaging detection confidence across multiple frames.
The results in Figures 4 to 6 show that YOLOv12 consistently maintains higher and more stable confidence across all altitudes compared to YOLOv10. The YOLOv12x model achieved the strongest performance overall, demonstrating smoother degradation at higher flight levels and better generalization to aerial viewpoints.

Detection confidence of YOLOv10 variants across different altitudes, keeping horizontal distance 3 meters.

Performance of YOLOv12 variants with respect to UAS altitude, keeping horizontal distance 3 meters.

Comparison of best-performing variants from each family (YOLOv10x vs YOLOv12x), keeping horizontal distance 3 meters.
To account for on-board deployment constraints, a secondary analysis was conducted using Ultralytics benchmark metrics to compare accuracy, inference latency, and parameter count across the YOLOv12 family. As shown in Figures 7 and 8, larger variants offer marginal accuracy gains but introduce substantial increases in computational cost.

Accuracy versus inference latency for YOLOv12 variants.

Model size versus detection accuracy for YOLOv12 variants.
Based on the combined trade-off between accuracy, stability, inference latency, and real-time feasibility on embedded platforms,
Dataset preparation
The YOLOv12s model was trained and evaluated using the VisDrone dataset, 17 a widely used benchmark containing thousands of annotated aerial images captured under diverse altitudes, lighting conditions, and crowd densities. For this study, aerial imagery was additionally collected using the Skydio X10 platform, whose VT300-L sensor payload is summarized in Table 1.
Specifications of the Skydio X10 VT300-L sensor payload.
To tailor the dataset for pedestrian detection, only the
Dataset split after preprocessing for the pedestrian detection task.
This refined dataset enabled the model to learn pedestrian-specific features from diverse aerial viewpoints, improving generalization under real-world UAS conditions.
Model training pipeline
The YOLOv12s model was trained on the VisDrone pedestrian subset 17 using an NVIDIA GeForce RTX 4070 Laptop GPU. Training ran for 287 epochs with early stopping, and the best checkpoint was obtained at epoch 187. Table 3 summarizes the key training configuration.
Training configuration and results for the YOLOv12s detector.
Standard YOLO augmentations (scaling, flips, and mosaic) were applied to improve robustness to varying aerial conditions. Figure 9 shows the convergence of major loss terms, all exhibiting smooth downward trends that indicate stable optimization.

Training loss convergence curves for the YOLOv12s model.
Model evaluation was performed using the precision–recall analysis, where the curve (Figure 10) indicates an mAP@0.5 of 0.529, confirming strong generalization across diverse aerial viewpoints.

Precision–recall curve.
Comparison with other detection architectures
To contextualize the performance of YOLOv12s, Table 4 compares its results against several established detectors evaluated on the VisDrone pedestrian detection task. Published results for Faster R-CNN, 18 FCOS, 19 and RT-DETR 20 are included alongside the YOLOv12s results obtained in this study.
Comparison of detection architectures on VisDrone pedestrian subset.
Published results are cited from respective works; YOLOv12s results are from this study.
YOLOv12s achieves the highest mAP@0.5 among the compared detectors while maintaining the smallest parameter footprint. Although RT-DETR-L reports higher theoretical throughput on server-grade hardware, YOLOv12s offers a favorable trade-off for resource-constrained UAS deployments where model size and memory are limiting factors.
The normalized confusion matrices in Figure 11 illustrate the model’s performance across all data splits. The training set achieves

Normalized confusion matrices showing YOLOv12s pedestrian detection performance across different data splits: (a) training set, (b) validation set, and (c) test set. The model achieves consistent performance across all splits, with the slight decrease from training to test accuracy indicating good generalization without significant overfitting: (a) training set (91% accuracy) and (b) validation set (86% accuracy), and (c) test set (84% accuracy).
Results and conclusion
Field testing and real-world evaluation
To assess real-world generalization beyond the VisDrone benchmark, the trained YOLOv12s model was tested on aerial RGB footage collected using a

Qualitative field results of the trained YOLOv12s pedestrian detector on unseen Skydio X10 UAS imagery (confidence threshold
Across all locations, the model accurately detected pedestrians with stable bounding boxes and counts that closely matched ground truth. Its consistent performance under changing illumination and mild occlusions demonstrates suitability for real-time aerial surveillance, search-and-rescue, and crowd monitoring tasks.
For real-time deployment, the Skydio X10 remote controller was connected to a laptop running the trained YOLOv12s model over the live video feed. During inference, the system sustained a frame rate of approximately 30–45 FPS on an NVIDIA RTX 4070 Laptop GPU, with all detections filtered at a confidence threshold of 0.70. Table 5 summarizes the deployment configuration.
Real-time deployment metrics during live UAS feed inference.
Altitude stress testing and detection robustness
To assess robustness across different aerial viewpoints, an altitude stress test was performed using the Skydio X10 at

Altitude stress-testing of the custom-trained YOLOv12s pedestrian detector at increasing heights (confidence threshold
Trained YOLOv12s pedestrian detection performance at varying flight altitudes as shown in Figure 13.
The trained YOLOv12s model maintained perfect detection performance up to 40 m. At 50 m, it missed two pedestrians due to reduced pixel density and shadow interference. Overall, the model demonstrated strong robustness across typical UAS operating heights, with mild degradation only at higher altitudes.
Crowd density stress testing
In addition to altitude variation, the detector was evaluated under high-density crowd conditions to assess robustness in scenarios with significant pedestrian overlap and occlusion. Figure 14 shows a representative example of the model maintaining stable detection performance despite extreme visual clutter.

Crowd density stress test: custom-trained YOLOv12s pedestrian detector applied to a high-occlusion, densely populated outdoor event scenario (119 detections).
Limitations and future directions
Although the trained YOLOv12s detector performed well across most aerial settings, its accuracy declined above 50 m due to smaller pedestrian scales and shadow ambiguity. Environmental factors, including uneven lighting, reflections, and partial occlusions, also introduced occasional errors. In addition, the model was trained on a single RGB dataset and tested mainly in daylight, limiting its robustness in low-light conditions.
Future improvements will explore multi-modal sensing with RGB–thermal fusion, training larger YOLOv12 variants for better altitude and lighting generalization, and incorporating temporal cues through lightweight tracking to enable smoother real-time aerial monitoring.
Conclusion
This study presented a real-time pedestrian detection framework using the YOLOv12s model for aerial imagery captured by UAS platforms. Trained on a refined VisDrone subset and tested across diverse real-world settings, the detector maintained strong accuracy up to 40 m altitude, with only minor degradation at higher elevations due to reduced object scale and shadow effects. Despite these challenges, the system preserved stable performance and real-time inference on lightweight hardware.
Overall, trained YOLOv12s offers an effective balance between speed, accuracy, and efficiency for UAS-based perception, making it suitable for applications such as crowd monitoring, autonomous surveillance, and search-and-rescue. Future work will investigate multi-sensor fusion, temporal consistency methods, and fine-tuning of larger YOLOv12 variants to improve robustness in more complex aerial environments.
Footnotes
Acknowledgments
The research was funded through the Kennesaw State University Office of Research. The authors acknowledge the use of the Roboflow platform for dataset preparation and annotation, and the Ultralytics framework for model training and evaluation. AI-based writing assistance tools were used to improve grammar, clarity, and formatting during manuscript preparation. All technical content, results, and interpretations remain the sole responsibility of the authors.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research was funded through the Kennesaw State University Office of Research.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
