AI-assisted pedestrian detection and enumeration from aerial perspectives

Abstract

This paper presents a study on pedestrian detection and estimation from UAS imagery using recent YOLO-based object detection models. The objective is to evaluate model performance for identifying humans from aerial perspectives and to develop a customized detector suited for UAS applications. The study demonstrates the potential of combining modern artificial intelligence models with UAS-mounted vision systems for applications such as crowd monitoring, autonomous surveillance, and search-and-rescue operations. Experimental results demonstrated that the model achieved consistent detection accuracy up to 40m altitude, achieving near-perfect pedestrian identification with minimal false positives. The framework demonstrated its robustness for real-time deployment in aerial surveillance, search and rescue operations, and crowd monitoring scenarios.

Keywords

Pedestrian detection UAS YOLO deep learning aerial imagery computer vision object detection

Introduction

Recent advancements in unmanned aerial systems (UAS) and artificial intelligence (AI) have enabled new opportunities in aerial surveillance and human activity monitoring. Pedestrian detection from UAS imagery has become especially important for public safety, disaster response, and crowd management. With onboard vision systems, UASs offer a broader and more flexible observation capability than ground-based cameras, providing real-time situational awareness across large environments.

Object detection models, particularly those based on the YOLO family, continue to advance in accuracy and real-time efficiency. However, pedestrian detection from aerial views remains challenging due to varying altitudes, lighting changes, occlusion, and small target scales. These factors require models that can generalize well under dynamic aerial conditions while maintaining reliable detection performance.

As illustrated in Figure 1, the experimental setup used a UAS-mounted RGB camera to perform pedestrian detection over team members in a controlled field environment. The goal of this work is to evaluate recent YOLO models for pedestrian identification and counting from aerial imagery. A refined dataset was created from public aerial sources to fine-tune the detector for improved performance. By focusing on realistic flight scenarios and true UAS perspectives, this study aims to support reliable AI-assisted systems for crowd monitoring, search and rescue, and security surveillance.

Figure 1.

UAS field test demonstrating real-time pedestrian detection on subjects in an outdoor environment. The yellow arrow highlights the UAS platform, which may have reduced visibility against the sky, while the narrow black arrow indicates that pedestrian counts are being recorded from the aerial perspective.

The key contributions of this work are: (a) the first application and evaluation of the YOLOv12 architecture for aerial pedestrian detection from UAS imagery; (b) a systematic altitude stress-testing protocol that quantifies detection robustness from 10 m to 50 m; (c) a comparative analysis across YOLOv10 and YOLOv12 model families with deployment-aware variant selection based on accuracy–latency trade-offs; and (d) real-time deployment validation on live UAS video feeds with quantitative throughput measurements.

Literature review

Research in UAS-based AI applications highlights the versatility and reliability of modern computer vision systems across diverse operational contexts, many of which inform and strengthen pedestrian detection studies.

Buchelt et al.¹ demonstrated that AI-equipped UASs can effectively classify and monitor ecological features, establishing the potential of computer vision in large-scale real-world environments. Hegde et al.² emphasized that the quality of annotated datasets directly determines the reliability of autonomous navigation and detection, a critical insight for pedestrian identification. Kaufmann et al.³ explored reinforcement learning approaches where UASs outperformed human pilots in real-time navigation, demonstrating the adaptability of AI in dynamic scenarios relevant to pedestrian tracking.

Jung and Choi⁴ enhanced YOLOv5 for UAS imagery, improving both efficiency and precision under conditions such as low light or changing altitude, key challenges in human detection from aerial views. Similarly, Piponidis and Theocharides⁵ introduced dynamic CNN parameter optimization for varying flight altitudes, offering strategies to balance detection accuracy with computational efficiency on constrained UAS hardware. Zhu et al.⁶ proposed TPH-YOLOv5 with a transformer-based prediction head specifically designed for drone-captured scenarios, achieving competitive results on the VisDrone challenge⁷ and highlighting the benefit of attention mechanisms for small-object aerial detection.

Kunhoth et al.⁸ analyzed YOLO variants on thermal imagery, showing that model choice should align with environmental and dataset characteristics. Shekhar et al.⁹ proposed combining supervised learning with clustering to identify anomalies, potentially useful for detecting unusual pedestrian behavior or movement. Sujith et al.¹⁰ addressed challenges in underwater detection caused by poor clarity similar to aerial altitude issues and demonstrated that hyperspectral imaging and CNN-based feature extraction can significantly enhance detection accuracy, offering insights into improving aerial imagery under degraded visibility.

Papaioannidis et al.¹¹ proposed a supervised deep learning approach for UAS safety that detects human crowds in aerial images to define “no-fly zones”. Their multitasking CNN combines semantic segmentation with image-to-image translation, achieving faster and more accurate crowd detection than previous methods. Iftikhar et al.¹² explored deep learning-based pedestrian detection in autonomous vehicles, finding some major challenges and some improvements. Iftikhar found it challenging for an autonomous vehicle to accurately detect pedestrians partially due to the fact of the low quality images and varying lighting conditions in such a camera source. Iftikhar also found that LiDAR and thermal plus RGB imaging improved the detection rates especially during night.

Gomes et al.¹³ describes a real-time system for enumerating people and bicycles using YOLOv5. Their power consumption problem was solved by having a dynamically adjusting frame rate based on the speed of the object being detected. Zuehlke et al.¹⁴ deploys a vision-based object detection algorithm in UAS along with proportional navigation for collision avoidance. Using image processing and integrating into guidance systems enables the UAS to autonomously sense and avoid collisions in real world and simulated environments. Chen and Juang¹⁵ presents the application of the YOLOv4 model for radiographic testing tasks in aviation maintenance. Searching for things like cracks and engine flaws, the model demonstrated faster and higher detection accuracy than the previous two stage detectors.

Methodology

The overall workflow of the proposed system is organized into three sequential layers: data acquisition, data processing, and output visualization. As illustrated in Figure 2, the UAS captures real-time RGB video using its onboard camera and transmits the feed to a ground station. The incoming frames are processed using a custom-trained YOLOv12 model to detect and count pedestrians, and the annotated results are displayed to the user through a graphical interface. This modular design ensures smooth integration between aerial sensing, onboard inference, and real-time visualization for field deployment.

Figure 2.

Overview of the proposed real-time YOLOv12-based pedestrian detection system using UAS imagery.

YOLOv12 architecture overview

The YOLOv12 model¹⁶ represents the latest evolution of the YOLO family and follows the standard backbone–neck–head structure. The backbone uses the Attentive C2f (AC2f) module to enhance spatial and channel-wise attention during feature extraction. The neck applies an Attention-PAN-FPN (APAN-FPN) for multi-scale fusion, improving the detection of small and partially occluded pedestrians. The detection head employs decoupled branches for classification and bounding box regression, with attention-guided refinement for improved accuracy. Figure 3 shows a high-level overview of the architecture. YOLOv12 was selected for this work due to its improved attention mechanisms and real-time performance on lightweight UAS hardware.

Figure 3.

YOLOv12 architecture diagram.

Vision model selection

To identify a suitable baseline for aerial pedestrian detection, we conducted a controlled comparison of the YOLOv10 and YOLOv12 model families using RGB imagery captured from the Skydio X10. Eight human subjects were positioned at a fixed horizontal distance, while the UAS altitude was varied from 3 m to 30 m. For each altitude, pre-trained variants (n, s, m, l, x) were evaluated by averaging detection confidence across multiple frames.

The results in Figures 4 to 6 show that YOLOv12 consistently maintains higher and more stable confidence across all altitudes compared to YOLOv10. The YOLOv12x model achieved the strongest performance overall, demonstrating smoother degradation at higher flight levels and better generalization to aerial viewpoints.

Figure 4.

Detection confidence of YOLOv10 variants across different altitudes, keeping horizontal distance 3 meters.

Figure 5.

Performance of YOLOv12 variants with respect to UAS altitude, keeping horizontal distance 3 meters.

Figure 6.

Comparison of best-performing variants from each family (YOLOv10x vs YOLOv12x), keeping horizontal distance 3 meters.

To account for on-board deployment constraints, a secondary analysis was conducted using Ultralytics benchmark metrics to compare accuracy, inference latency, and parameter count across the YOLOv12 family. As shown in Figures 7 and 8, larger variants offer marginal accuracy gains but introduce substantial increases in computational cost.

Figure 7.

Accuracy versus inference latency for YOLOv12 variants.

Figure 8.

Model size versus detection accuracy for YOLOv12 variants.

Based on the combined trade-off between accuracy, stability, inference latency, and real-time feasibility on embedded platforms, YOLOv12s was selected as the baseline model for fine-tuning and downstream experiments. Notably, YOLOv12s offers significantly lower latency compared to larger variants (Figure 7), making it particularly suitable for real-time UAS applications where rapid inference is critical for responsive detection and tracking.

Dataset preparation

The YOLOv12s model was trained and evaluated using the VisDrone dataset,¹⁷ a widely used benchmark containing thousands of annotated aerial images captured under diverse altitudes, lighting conditions, and crowd densities. For this study, aerial imagery was additionally collected using the Skydio X10 platform, whose VT300-L sensor payload is summarized in Table 1.

Table 1.

Specifications of the Skydio X10 VT300-L sensor payload.

Camera type	Specifications
Narrow RGB Camera	46 mm eq.; 64 MP; f/1.8; 50 $\circ$ FOV
1-inch Wide RGB Camera	20 mm eq.; 50 MP; f/1.95; 93 $\circ$ FOV
Thermal Camera	60 mm eq.; 640 $\times$ 512 px; 41 $\circ$ FOV; <30 mK
Flashlight Module	22 lux at 3 m

To tailor the dataset for pedestrian detection, only the person class was retained, and all other categories were removed. Annotations were converted into the standard YOLO format, and the final split is shown in Table 2.

Table 2.

Dataset split after preprocessing for the pedestrian detection task.

Subset	Images
Training Set	5,366
Validation Set	520
Test Set	1,197
Total	7,083

This refined dataset enabled the model to learn pedestrian-specific features from diverse aerial viewpoints, improving generalization under real-world UAS conditions.

Model training pipeline

The YOLOv12s model was trained on the VisDrone pedestrian subset¹⁷ using an NVIDIA GeForce RTX 4070 Laptop GPU. Training ran for 287 epochs with early stopping, and the best checkpoint was obtained at epoch 187. Table 3 summarizes the key training configuration.

Table 3.

Training configuration and results for the YOLOv12s detector.

Parameter	Value
Model	YOLOv12s
Dataset	VisDrone (pedestrian)
Image size	640 $\times$ 640
Batch size	16
Epochs/best epoch	287 / 187
Learning rate	0.01
Hardware	RTX 4070 Laptop GPU
Training time	$\sim$ 44 hours

Standard YOLO augmentations (scaling, flips, and mosaic) were applied to improve robustness to varying aerial conditions. Figure 9 shows the convergence of major loss terms, all exhibiting smooth downward trends that indicate stable optimization.

Figure 9.

Training loss convergence curves for the YOLOv12s model.

Model evaluation was performed using the precision–recall analysis, where the curve (Figure 10) indicates an mAP@0.5 of 0.529, confirming strong generalization across diverse aerial viewpoints.

Figure 10.

Precision–recall curve.

Comparison with other detection architectures

To contextualize the performance of YOLOv12s, Table 4 compares its results against several established detectors evaluated on the VisDrone pedestrian detection task. Published results for Faster R-CNN,¹⁸ FCOS,¹⁹ and RT-DETR²⁰ are included alongside the YOLOv12s results obtained in this study.

Table 4.

Comparison of detection architectures on VisDrone pedestrian subset.

Detector	mAP@0.5	Params (M)	FPS
Faster R-CNN¹⁸	0.291	41.1	12
FCOS¹⁹	0.312	32.0	21
RT-DETR-L²⁰	0.481	32.0	114
YOLOv12s (Ours)	0.529	9.3	35 $^{†}$

Published results are cited from respective works; YOLOv12s results are from this study. $^{†}$ Measured on laptop GPU during live UAS feed inference.

YOLOv12s achieves the highest mAP@0.5 among the compared detectors while maintaining the smallest parameter footprint. Although RT-DETR-L reports higher theoretical throughput on server-grade hardware, YOLOv12s offers a favorable trade-off for resource-constrained UAS deployments where model size and memory are limiting factors.

The normalized confusion matrices in Figure 11 illustrate the model’s performance across all data splits. The training set achieves 91% pedestrian detection accuracy, while the validation and test sets achieve 86% and 84% accuracy, respectively. The modest decrease in accuracy from training to test set indicates that the model generalizes well without significant overfitting, maintaining strong separation between pedestrian and background classes across all evaluation scenarios.

Figure 11.

Normalized confusion matrices showing YOLOv12s pedestrian detection performance across different data splits: (a) training set, (b) validation set, and (c) test set. The model achieves consistent performance across all splits, with the slight decrease from training to test accuracy indicating good generalization without significant overfitting: (a) training set (91% accuracy) and (b) validation set (86% accuracy), and (c) test set (84% accuracy).

Results and conclusion

Field testing and real-world evaluation

To assess real-world generalization beyond the VisDrone benchmark, the trained YOLOv12s model was tested on aerial RGB footage collected using a Skydio X10 in environments such as rooftops, walkways, and open plazas. These scenes varied in altitude, scale, and lighting, and none were included in the training data. Representative qualitative results are shown in Figure 12.

Figure 12.

Qualitative field results of the trained YOLOv12s pedestrian detector on unseen Skydio X10 UAS imagery (confidence threshold $\geq$ 0.75; all displayed detections have confidence scores at or above this threshold): (a) detection over a university walkway with eight visible pedestrians, (b) pedestrian detection near a building entrance, (c) top-down perspective under reduced contrast and shadow, and (d) mid-altitude view showing stability across height variations.

Across all locations, the model accurately detected pedestrians with stable bounding boxes and counts that closely matched ground truth. Its consistent performance under changing illumination and mild occlusions demonstrates suitability for real-time aerial surveillance, search-and-rescue, and crowd monitoring tasks.

For real-time deployment, the Skydio X10 remote controller was connected to a laptop running the trained YOLOv12s model over the live video feed. During inference, the system sustained a frame rate of approximately 30–45 FPS on an NVIDIA RTX 4070 Laptop GPU, with all detections filtered at a confidence threshold of 0.70. Table 5 summarizes the deployment configuration.

Table 5.

Real-time deployment metrics during live UAS feed inference.

Parameter	Value
Inference hardware	NVIDIA RTX 4070 Laptop GPU
Input resolution	640 $\times$ 640
Frame rate (live feed)	30–45 FPS
Confidence threshold	0.70
Model parameters	9.3 M
Model FLOPs	21.5 G

Altitude stress testing and detection robustness

To assess robustness across different aerial viewpoints, an altitude stress test was performed using the Skydio X10 at 10 m, 20 m, 30 m, 40 m, and 50 m. A confidence threshold of 0.75 was applied during inference, meaning all reported detections have confidence scores greater than or equal to 0.75, ensuring high reliability in the detected pedestrians. Higher altitudes introduced smaller pedestrian scales and stronger shadow effects. Representative outputs are shown in Figure 13, and the corresponding detection results are summarized in Table 6.

Figure 13.

Altitude stress-testing of the custom-trained YOLOv12s pedestrian detector at increasing heights (confidence threshold $\geq$ 0.75; all displayed detections have confidence scores at or above this threshold): (a) 10 m (6/6 correct), (b) 20 m (7/7 correct), (c) 30 m (18/18 correct), (d) 40 m (18/18 correct), and (e) 50 m (22/24 correct; minor shadow misdetections).

Table 6.

Trained YOLOv12s pedestrian detection performance at varying flight altitudes as shown in Figure 13.

Altitude (m)	Groundtruth	Predicted	Remarks
10	6	6	Perfect detection
20	7	7	Perfect detection
30	18	18	Perfect detection
40	18	18	Perfect detection
50	24	22	Missed two; minor shadow confusion

The trained YOLOv12s model maintained perfect detection performance up to 40 m. At 50 m, it missed two pedestrians due to reduced pixel density and shadow interference. Overall, the model demonstrated strong robustness across typical UAS operating heights, with mild degradation only at higher altitudes.

Crowd density stress testing

In addition to altitude variation, the detector was evaluated under high-density crowd conditions to assess robustness in scenarios with significant pedestrian overlap and occlusion. Figure 14 shows a representative example of the model maintaining stable detection performance despite extreme visual clutter.

Figure 14.

Crowd density stress test: custom-trained YOLOv12s pedestrian detector applied to a high-occlusion, densely populated outdoor event scenario (119 detections).

Limitations and future directions

Although the trained YOLOv12s detector performed well across most aerial settings, its accuracy declined above 50 m due to smaller pedestrian scales and shadow ambiguity. Environmental factors, including uneven lighting, reflections, and partial occlusions, also introduced occasional errors. In addition, the model was trained on a single RGB dataset and tested mainly in daylight, limiting its robustness in low-light conditions.

Future improvements will explore multi-modal sensing with RGB–thermal fusion, training larger YOLOv12 variants for better altitude and lighting generalization, and incorporating temporal cues through lightweight tracking to enable smoother real-time aerial monitoring.

Conclusion

This study presented a real-time pedestrian detection framework using the YOLOv12s model for aerial imagery captured by UAS platforms. Trained on a refined VisDrone subset and tested across diverse real-world settings, the detector maintained strong accuracy up to 40 m altitude, with only minor degradation at higher elevations due to reduced object scale and shadow effects. Despite these challenges, the system preserved stable performance and real-time inference on lightweight hardware.

Overall, trained YOLOv12s offers an effective balance between speed, accuracy, and efficiency for UAS-based perception, making it suitable for applications such as crowd monitoring, autonomous surveillance, and search-and-rescue. Future work will investigate multi-sensor fusion, temporal consistency methods, and fine-tuning of larger YOLOv12 variants to improve robustness in more complex aerial environments.

Footnotes

Acknowledgments

The research was funded through the Kennesaw State University Office of Research. The authors acknowledge the use of the Roboflow platform for dataset preparation and annotation, and the Ultralytics framework for model training and evaluation. AI-based writing assistance tools were used to improve grammar, clarity, and formatting during manuscript preparation. All technical content, results, and interpretations remain the sole responsibility of the authors.

ORCID iDs

Owais Ahmed

Adeel Khalid

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research was funded through the Kennesaw State University Office of Research.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Buchelt

Adrowitzer

Kieseberg

, et al. Exploring artificial intelligence for applications of drones in forest ecology and management. Forest Ecol Manag 2024; 551: 121530.

Hegde

Ahmed

Nalband

. The importance of data annotation for autonomous drone navigation. In: Third international conference on trends in electrical, electronics, and computer engineering (TEECCON), 2024, pp.117–122. Bangalore, India.

Kaufmann

Bauersfeld

Loquercio

, et al. Champion-level drone racing using deep reinforcement learning. Nature 2023; 620: 982–987.

Jung

Choi

. Improved YOLOv5: efficient object detection using drone images under various conditions. Appl Sci 2022; 12: 7255.

Piponidis

Theocharides

. Dynamic CNN parameter exploration for multi-altitude UAV object detection. In: 11th International conference on control, automation and robotics (ICCAR), 2025, pp.510–515.

Zhu

Lyu

Wang

, et al. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In: IEEE/CVF international conference on computer vision workshops (ICCVW), 2021, pp.2778–2788.

Zhu

Wen

, et al. VisDrone-DET2019: the vision meets drone object detection in image challenge results. In: IEEE/CVF international conference on computer vision workshops (ICCVW), 2019, pp.213–226.

Kunhoth

Alfadhli

Al-Maadeed

. Optimizing high-altitude UAV object detection with deep learning. In: IEEE 21st international conference on smart communities (HONET), 2024, pp.103–108.

Shekhar

Ajay

Kumar

, et al. The object identification and classification methods in a class of objects using AI-based supervised and unsupervised training algorithms. Grenze Int J Eng Technol 2024; 10: 329–334.

10.

Sujith

Ganapathy

Sathya Prasanna

, et al. Advancing underwater object identification using aerial hyperspectral imaging. In: International conference on power, energy, control and transmission systems (ICPECTS), 2024, pp.1–6.

11.

Papaioannidis

Mademlis

Pitas

. Autonomous UAV safety by visual human crowd detection using multi-task deep neural networks. In: IEEE international conference on robotics and automation (ICRA), 2021.

12.

Iftikhar

Zhang

Asim

, et al. Deep learning-based pedestrian detection in autonomous vehicles: substantial issues and challenges. Electronics 2022; 11: 3551.

13.

Gomes

Redinha

Lavado

, et al. Counting people and bicycles in real time using YOLO on Jetson nano. Energies 2022; 15: 8816.

14.

Zuehlke

Prabhakar

Clark

, et al. Vision-based object detection and proportional navigation for UAS collision avoidance. In: AIAA Scitech 2019 Forum, 2019.

15.

Chen

Juang

. YOLOv4 object detection model for nondestructive radiographic testing in aviation maintenance tasks. AIAA J 2021; 60: 1–6.

16.

Tian

Doermann

. YOLOv12: attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524, 2025.

17.

Zhu

Wen

, et al. Detection and tracking meet drones challenge. IEEE Trans Pattern Anal Mach Intell 2022; 44: 7380–7399.

18.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 2017; 39: 1137–1149.

19.

Tian

Shen

Chen

, et al. FCOS: fully convolutional one-stage object detection. In: IEEE/CVF international conference on computer vision (ICCV), 2019, pp.9627–9636.

20.

Zhao

, et al. DETRs beat YOLOs on real-time object detection. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2024, pp.16965–16974.