Sage Journals: Discover world-class research

Abstract

In high-frame-rate human–computer interaction and mobile-perception scenarios, single-frame human action recognition must meet stringent latency and accuracy constraints. To tackle spatial feature entanglement, multiscale fragmentation, and edge-deployment inefficiency, this study proposes YOLO11-AN (Action Net), a lightweight detector that couples a C3K2-DMAF dynamic multiscale fusion block, a dual-branch AUX head, an MPDIoU regression loss, and a LocalWindowAttention module. Comprehensive evaluations on Pascal VOC 2012, UCF101, and HMDB51 show that YOLO11-AN attains 0.537 mAP₅₀ on VOC—an absolute gain of 1.7 percentage points over the YOLO11 baseline—while maintaining an inter-seed variance below 0.001. Against peer-reviewed baselines (YOLOv8-n, PP-YOLOE-Tiny, and RT-DETR-R18), it offers the best accuracy–compute tradeoff, and after INT8 quantization sustains 15.8 FPS on a 4 GB Jetson Orin Nano, validating its suitability for real-time low-power deployments.

Keywords

YOLO11-AN C3K2-DMAF dual-branch AUX MPDIoU LWA

1. Introduction

In the era of deeply integrated intelligent perception and mobile computing, human motion recognition has become a fundamental component of human–computer interaction systems. By analyzing spatial characteristics of limb postures, this technology enables gait tracking and rehabilitation assessment in medical applications, offering quantitative support for remote diagnostics. In intelligent surveillance, abnormal behavior recognition algorithms detect events such as falls or intrusions in real time, enhancing emergency response capacity. In sports analysis, three-dimensional-pose-based motion recognition systems capture subtle technical movements to assist training optimization. These applications demonstrate that human action understanding is reshaping multidomain interaction paradigms and driving the evolution of intelligent services toward more natural and ubiquitous deployment.

This study focuses on single-frame human action recognition, where only individual images—rather than video sequences—are used to infer human motion categories. In this work, the problem class is instantiated on the human-action subset of Pascal VOC 2012, where each person instance is annotated with one of 11 action categories: phoning, playinginstrument, reading, ridingbike, ridinghorse, running, takingphoto, usingcomputer, walking, jumping, and other. These categories cover device-mediated actions (phoning, takingphoto, and usingcomputer), locomotion and sport-related behaviors (walking, running, jumping, ridingbike, and ridinghorse), and more posture-centric activities (reading, playinginstrument, and other). In our setting, each human bounding box is assigned exactly one of these action labels, so the task is defined as single-frame, per-person detection and action classification. This task addresses scenarios where video streams are unavailable or computational resources are constrained, such as low-power embedded systems, real-time edge analytics, or frame-level detection pipelines. Although practical and highly deployable, the task remains challenged by multiple technical constraints. High-performance models require extensive hardware support to maintain accuracy, hindering real-time applicability on resource-limited devices (Gao et al., 2021). Recognition methods based on red–green–blue input are sensitive to lighting or background variation, reducing robustness (Zhou et al., 2024). Furthermore, the extraction of discriminative spatial patterns from complex actions often sacrifices fine-grained information for computational efficiency, especially when using reduced resolution or simplified architectures. This precision (P)–efficiency tradeoff significantly limits the usability of human action models on edge devices such as mobile or embedded platforms (Wang et al., 2024).

To address these limitations, this study develops a lightweight single-frame human action recognition framework grounded in the YOLO11 (Khanam & Hussain, 2024) architecture. Compared with two-stage detectors such as Faster R-CNN, which require region proposal generation and repeated feature extraction, YOLO11 employs a fully one-stage paradigm with global perception and end-to-end inference, substantially reducing computational overhead. Relative to a single-shot detector (SSD; Liu et al., 2016), another representative one-stage detector, YOLO11 further enhances feature interaction across scales through its cross-scale fusion design, enabling more effective modeling of complex spatial cues while maintaining model compactness. These architectural characteristics, together with quantization and hardware-level acceleration, support millisecond-level inference under low-power conditions and provide a practical baseline for building distributed, human-centric perception systems.

Building upon this foundation, this study proposes an enhanced framework referred to as YOLO11-AN, which introduces several targeted improvements tailored for single-frame human action detection.

A C3K2-DMAF dynamic multiscale fusion module is incorporated to better decouple global pose patterns from local joint-sensitive features under lightweight constraints.

A dual-branch AUX detection head is designed to coordinate the optimization of action classification and bounding box regression, improving performance in scenarios where the two tasks exhibit conflicting gradients.

An MPDIoU regression loss, together with a LocalWindowAttention mechanism, is integrated to stabilize localization and refine local–global spatial dependency modeling at modest computational cost.

These components collectively extend YOLO11 into a more action-aware architecture, forming the complete YOLO11-AN system.

2. Related Works

In the evolution of object detection algorithms, Faster R-CNN significantly improves the accuracy of object localization in complex scenes by introducing a two-stage architecture of region proposal network and detection network. This algorithm can effectively handle the problem of variable target scales by first generating candidate regions and then designing fine-grained classification and regression. However, this cascading computational process requires the model to maintain two independent feature extraction networks, resulting in a large amount of redundant computation. Especially when processing video stream data, performing frame-by-frame region recommendation operations can lead to a significant increase in video memory usage and real-time degradation, severely limiting its application potential on resource-constrained devices (Ren et al., 2016).

The SSD algorithm, as a representative of single-stage detectors, achieves collaborative localization of multiple targets while maintaining high detection speed through a preset multiscale anchor point and feature pyramid fusion mechanism. Its single forward propagation characteristic avoids the cascading computational loss of Faster R-CNN, making it more suitable for handling continuous action sequences. However, this architecture requires parallel maintenance of detection heads at multiple scales to cover targets of different sizes, resulting in a proportional increase in the number of classification-related parameters in the model as the number of detection categories increases. In embedded deployment scenarios, this type of multihead structure amplifies the pressure of memory bandwidth and power consumption, thereby placing higher demands on the system’s energy efficiency (Liu et al., 2016).

The YOLO series innovatively achieves a balance between accuracy and speed by reconstructing object detection as a regression problem (Lin et al., 2017). YOLO11, as the state-of-the-art model of this series, inherits the advantages of a single-stage detection paradigm and innovatively introduces a dynamic convolution kernel configuration and a cross-scale feature interaction module. YOLO11 represents a recent advancement within the Ultralytics YOLO family. It refines the backbone with more efficient feature extraction units based on CSPDarknet and enhances the neck with an optimized multiscale fusion structure, allowing the model to strengthen spatial representation under lightweight computational constraints. Compared with earlier variants such as YOLOv5, YOLOv7, and YOLOv8, YOLO11 achieves a more favorable balance between accuracy and model complexity through streamlined feature pathways and reduced parameter overhead. These characteristics—lower parameter count, reduced computational cost, and faster inference—align well with the practical requirements of this study, which emphasizes real-time processing and edge-oriented deployment. As a result, YOLO11 provides a suitable and efficient foundation for the proposed framework. By implementing a layered parameter reuse strategy to compress the depth of the computational graph, the model’s dependence on hardware computing power is effectively reduced. Its end-to-end detection feature avoids the problem of feature duplication extraction in two-stage algorithms, and combined with adaptive quantization technology, weight quantization, and hardware instruction set optimization can be achieved, significantly improving inference efficiency on edge devices. These features make it the most promising lightweight action recognition solution currently available (Khanam & Hussain, 2024).

In the field of human motion image recognition, extracting complex pose features is the core challenge. The deformable convolutional network (DCNv3) proposed by Wang et al. significantly enhances the feature adaptation ability in occluded scenes by dynamically adjusting the sampling positions of convolutional kernels, and improves the localization accuracy by 3.8% in COCO human keypoint detection tasks. However, this method lacks a spatial dimension feature decoupling mechanism when dealing with multi-angle poses, resulting in increased angle estimation errors and seriously affecting the performance of fine-grained action classification (Wang et al., 2023).

The adaptive feature pyramid network proposed by the Wang team optimizes feature interaction through bidirectional cross-scale connections, resulting in a 6.5% increase in small target recall (R) in industrial part detection tasks. However, human motion recognition requires modeling both the full body posture contour and local joint details, and its fixed level fusion strategy is difficult to dynamically allocate weights for different scale features, resulting in an increase in the loss rate of microscopic features in the motion (Wang & Zhong, 2021).

The coordinate attention proposed by Hou et al. reduces computational complexity by 45% in ImageNet classification tasks by decomposing channel attention into spatial direction encoding. However, in static images, the discrimination of action types highly depends on the spatial correlation of key body regions, and its fixed direction encoding mechanism cannot adaptively focus on discriminative regions, which can easily lead to the problem of missing key features and increase the misclassification rate (Hou et al., 2021).

The semi-decoupled head proposed by Han et al. achieves a balance between detection accuracy and inference speed on the MS COCO dataset by separating target classification and bounding box regression tasks. However, the strong coupling between action categories and posture coordinates requires the model to be able to model the interaction relationships between tasks. The completely decoupled independent optimization strategy leads to the disconnection of feature expression between classification and regression tasks, weakening the dynamic correlation of target contextual information, and thus causing an increase in false detection rate in complex backgrounds or dense small target scenes (Han et al., 2024).

In summary, the current technological bottlenecks in human motion image recognition research are mainly reflected in three aspects:

(I)
The coupling of spatial features leads to interference between the fine-grained information of global pose and local joints, weakening the discriminative ability of complex actions.
(II)
Scale feature fusion relies on static weight allocation strategies, which make it difficult to dynamically adapt to the semantic correlations of different actions, resulting in the loss of micro features.
(III)
The attention mechanism is limited by fixed windows or preset encoding modes, lacking adaptive focusing ability on discriminative regions, resulting in the omission of key features under background interference.
These defects collectively constrain the accuracy and robustness of human motion image recognition models, and there is an urgent need to improve and optimize them through dynamic feature decoupling, cross-scale interaction of semantic perception, adaptive attention optimization, and other methods.
3. Algorithm Improvement

In response to the core challenges of spatial feature coupling, insufficient multiscale interaction, and low edge-deployment efficiency in human motion image recognition, this study proposes an improved model, YOLO11-AN (ActionNet), based on the YOLO11 architecture. This model achieves multilevel optimization through modular innovation: introducing the C3K2-DMAF module in the feature extraction stage to enhance the spatial feature decoupling ability and multiscale semantic perception accuracy in complex action recognition tasks; design a dual-branch AUX detection head architecture that balances classification accuracy and localization robustness through collaborative optimization of primary and secondary tasks; adopting the MPDIoU (Ma & Xu, 2023) loss function to improve the bounding box regression strategy and alleviate the localization bias problem in complex poses; adopting LocalWindowAttention to maintain local dependencies and effectively capture long-range dependencies improves computational efficiency and model performance. The proposed modules remain structurally compatible as each operates at a distinct functional stage and preserves the spatial–channel tensor interfaces, ensuring unobstructed feature flow and seamless integration across the architecture. Subsequent sections will systematically elaborate on the design principles and implementation paths of each improved module, elucidating the underlying mechanisms for addressing the aforementioned technical bottlenecks. For clarity, the key mathematical symbols and notations used throughout this section are summarized in Appendix A (Table A1). The network architecture diagram of YOLO11-AN is shown in Figure 1.

Figure 1.

YOLO11-AN network architecture diagram.

3.1. C3K2-DMAF

There is always a tradeoff between the feature expression ability and computational efficiency of convolutional neural networks. Traditional multiscale feature fusion methods rely on fixed multibranch structures, making them difficult to adapt to dynamic scenes. To address the issue of spatial feature coupling in human motion recognition, this paper proposes a dynamic multiscale adaptive fusion module based on the C3K2 design, termed C3K2-DMAF. The module adopts a parallel–serial hybrid architecture that enhances spatial feature decoupling through dynamic multiscale decomposition, adaptive channel extension, hybrid attention collaborative optimization, and cross-stage feature reuse, thereby meeting the requirements of both dynamism and lightweight deployment. Although multiscale fusion improves the modeling of heterogeneous posture cues, it naturally increases intermediate feature aggregation and memory-access overhead, which must be carefully controlled for edge-side deployment. The proposed C3K2-DMAF module is therefore designed to retain the representational benefits of multiscale processing while constraining the computational footprint within a lightweight budget. A schematic diagram of the C3K2-DMAF structure is shown in Figure 2.

Figure 2.

Schematic diagram of C3K2-DMAF module structure.

3.1.1. Dynamic Multiscale Feature Decoupling

By deploying $3 \times 3$ and $5 \times 5$ depth separable convolution branches in parallel (Chollet, 2017), local joint motion details and global pose correlation features are captured separately, resulting in two feature maps and achieving dynamic multiscale feature decoupling. The dimension design of $W_{1}$ and $W_{2}$ is based on the principle of balancing channel compression ratio and computational efficiency. Assuming the number of input channels is C, the dynamic weight generation network adopts a compression ratio of r. Due to the model accuracy fluctuation of $< 0.5 %$ when $r \in [8, 32]$ , $r = 16$ is taken after weighing the computational cost. Therefore, the $W_{1}$ dimension is $R^{(C / 16) \times C}$ , and this matrix compresses the original channel dimension C to $C / 16$ , achieving feature dimensionality reduction to reduce the number of parameters. The second layer maps the compressed $C / 16$ dimensional features to scalar weights $β$ , satisfying the normalization requirement of $β \in [0, 1]$ . Therefore, the $W_{2}$ dimension is $R^{(C / 16) \times C}$ .

The dynamic weight generation network takes the global average pooling feature as input, and outputs the branch fusion weight $α$ after passing through a double-layer fully connected layer and a Sigmoid activation function:

\begin{aligned} α = Sigmoid (W_{2} \cdot δ (W_{1} \cdot z_{gap})) . \end{aligned}

(1)

Among them, $z_{gap} \in R^{C}$ is the channel mean vector of input features, $W_{1} \in R^{(C / 16) \times C}$ achieves feature compression, and $W_{2} \in R^{1 \times (C / 16)}$ generates scalar weights. The fusion process follows the principle of $F_{fused} = α \cdot F_{5 \times 5} + (1 - α) \cdot F_{3 \times 3}$ to suppress redundant feature interference.

$δ$ is a linear transformation operation applied to input features through a weight matrix. It allows the network to dynamically adjust the importance of different features and facilitates the decoupling of spatial features in the multiscale fusion process.

3.1.2. Adaptive Channel Extension Mechanism

The adaptive channel extension mechanism addresses the core contradiction between the efficiency of feature expression and the imbalance of computing resource allocation in human action recognition. By dynamically perceiving the complexity of input features, it achieves elastic adjustment of channel dimensions. This section analyzes the complexity of input features through a dynamic prediction network and generates real-time channel expansion coefficients:

\begin{aligned} e^{^{'}} = 1.0 + 2.0 \cdot Sigmoid (W_{e} \cdot z_{gap}) . \end{aligned}

(2)

Among them, $W_{e}$ is a lightweight parameter matrix that controls the expansion rate $e^{^{'}} \in [1.0, 3.0]$ . Ghost convolution compresses the number of reference channels to half of their original size and generates redundant features through linear transformation, with a reduction in computational complexity of:

\begin{aligned} Δ FLOPs = \frac{C_{in} \cdot C_{mid} \cdot K^{2}}{2} + C_{mid} \cdot C_{out} \cdot \frac{s}{2} . \end{aligned}

(3)

Among them, $K$ is the size of the convolution kernel, and $s$ is the calculation factor for cheap operations.

$s$ is a parameter used to control the generation of redundant features in Ghost convolution, where it determines how much the number of redundant features should be scaled down. A smaller $s$ helps reduce computational cost while still maintaining the necessary expressive power. In our case, $s$ is set to 0.5, effectively halving the feature expansion in Ghost convolution.

3.1.3. Hybrid Attention Collaborative Optimization

The hybrid attention collaborative optimization mechanism alleviates the blurring of discriminative regions and the interference of background noise in human action recognition through cascaded channel–spatial dual-dimensional feature calibration. Before introducing the formulation of the channel-attention weighting, we first verify the necessity of asymmetric pooling through an independent ablation experiment. This ensures that the subsequent design choice is empirically grounded.

To examine the effect of different average–max pooling ratios, we conducted a controlled ablation study on the YOLO11-AN architecture, modifying only the pooling weights while keeping all other configurations unchanged. As shown in Table 1, average pooling alone and max pooling alone yield inferior performance; equal weighting (0.5/0.5) achieves only marginal improvement. Assigning a slightly larger weight to average pooling consistently enhances mAP₅₀, with the proposed 0.6/0.4 configuration achieving the best performance. This confirms that global contextual statistics captured by average pooling and strong local activations emphasized by max pooling jointly contribute to stable channel attention, and that moderate asymmetry provides the most effective balance.

Based on these experimental observations, the input feature $X \in R^{H \times W \times C}$ is first processed through a bimodal pooling–fusion unit, where global average pooling and global max pooling are combined using the empirically validated 0.6/0.4 ratio:

\begin{aligned} z_{c} = 0.6 \cdot \frac{1}{H W} \sum x_{c} + 0.4 \cdot max (x_{c}) . \end{aligned}

(4)

This fused statistic is passed through a

1 \times 1

convolution and a HardSigmoid activation to suppress irrelevant channel responses. The channel-refined feature then enters a coordinate-attention mechanism, which decomposes spatial encoding along horizontal and vertical directions to capture long-range dependencies. Row-wise and column-wise aggregated features are convolved and interacted to reconstruct a spatial weight matrix, producing a pixel-level attention map:

\begin{aligned} {Att}_{spatial} (x, y) = Conv ([\frac{1}{W} \sum_{j = 1}^{W} F (:, y), \frac{1}{H} \sum_{i = 1}^{H} F (x, :)]) . \end{aligned}

(5)

Table 1.

Ablation Study on Pooling Weights.

Configuration	$w_{avg}$	$w_{\max}$	mAP₅₀
Average pooling only	1.0	0.0	0.530
Max pooling only	0.0	1.0	0.527
Equal weighting	0.5	0.5	0.532
Alternative weighting	0.7	0.3	0.534
Proposed	0.6	0.4	0.537

Channel and spatial attention outputs are finally fused through element-wise multiplication:

\begin{aligned} F_{out} = F_{in} \otimes σ (f_{channel}) \cdot σ (f_{spatial}) . \end{aligned}

(6)

This cascaded design balances global semantic characterization and local spatial selectivity, offering a robust mechanism for human micro-action recognition in dense and dynamic scenes.

3.1.4. Cross-Stage Feature Reuse

The cross-stage feature reuse module optimizes the information flow between feature levels through residual connections and dynamic regularization strategies, solving the problems of gradient attenuation and feature degradation in deep networks. Assuming the input of the module is $X_{l} \in R^{H \times W \times C}$ and the output is $X_{l + 1}$ , when the input channel $C_{in}$ matches the output channel $C_{out}$ , the calculation process can be expressed as:

\begin{aligned} X_{l + 1} = X_{l} + Dropout (F_{θ} (X_{l})) . \end{aligned}

(7)

Among them, $F_{θ}$ is the characteristic transformation function of this stage. At this point, the residual connection is activated, and the dropout rate decays exponentially with training epochs:

\begin{aligned} P_{drop} = p_{0} \cdot γ^{epoch} . \end{aligned}

(8)

The exponential decay coefficient $γ = 0.98$ is selected to gradually reduce regularization intensity during training. This schedule enables strong regularization in early stages to prevent overfitting and allows increased model capacity in later stages to improve convergence. To validate this choice, we tested five values $γ = {0.95, 0.96, 0.97, 0.98, 0.99, 1.00}$ , keeping other settings constant. The test results are shown in Table 2.

Table 2.

Effect of Dropout Decay Factor $γ$ on Model Performance (F1 Score).

$γ$	F1 score
0.95	0.5081
0.96	0.5244
0.97	0.5319
0.98	0.5435
0.99	0.5382
1.00	0.5357

The bold value indicates the best-performing results within each column.

The results demonstrate that $γ = 0.98$ yields the best balance between generalization and final accuracy, justifying our selected decay schedule. The initial dropout rate $p_{0} = 0.1$ , and the attenuation factor $γ = 0.98$ . This design strengthens regularization to suppress overfitting in the early stages of training and gradually releases model capacity in the later stages.

When the input channel $C_{in}$ does not match the output channel $C_{out}$ , a $1 \times 1$ convolution is used to align the channels:

\begin{aligned} X_{l + 1} = {Conv}_{1 x 1} (X_{l}) + Dropout (F_{θ} (X_{l})) . \end{aligned}

(9)

This module provides a highly robust feature representation foundation for human action recognition in complex dynamic scenes through hierarchical feature reuse and geometric constraint enhancement.

3.2. Dual-Branch AUX Detection Head

In human action recognition tasks, there is often a contradiction between the classification accuracy of complex actions and the robustness of their positions. Traditional object detection networks often rely on optimizing a single task, resulting in a lack of balance between classification and localization tasks in the same network, especially when dealing with diverse and dynamic action scenes, which often affects accuracy. Therefore, this study proposes a dual-branch AUX detection head architecture (Jin et al., 2020), aiming to balance classification accuracy and localization robustness through collaborative optimization of main and auxiliary tasks.

The design of a dual-branch architecture can avoid interference between tasks. Classification and localization tasks are optimized separately through independent branches, avoiding common task coupling problems. This design also makes the model more efficient in handling multiple tasks and reduces the waste of computing resources.

The schematic diagram of the dual-branch AUX detection head structure is shown in Figure 3.

Figure 3.

Schematic diagram of dual-branch AUX detection head structure.

3.2.1. Design of Dual-Branch AUX Detection Head Architecture

The dual-branch AUX architecture designs two independent branches based on the detection head of YOLO11. One is used for human action classification tasks (main task), and the other is used for localization tasks (auxiliary task). These two tasks are trained in parallel and complement each other, achieving a good balance between classification accuracy and localization accuracy in the model. The main task branch focuses on fine-grained action category judgment, while the auxiliary task branch enhances the accuracy of bounding box regression, ultimately achieving overall performance improvement through fusion.

The main branch focuses on the classification label recognition of actions, while the auxiliary branch focuses on the accuracy of bounding box positions for human posture. Through this dual task collaborative optimization, small changes in actions can be accurately captured, and the robustness of the model in complex backgrounds can be improved.

3.2.2. Calculation Process of Dual-Branch Architecture

In the architecture of dual-branch AUX, the classification task and regression task are calculated separately through independent branches, and the losses of both are weighted and summed through a comprehensive optimization mechanism to ensure that each task can contribute to the final performance of the model. Assuming the input feature of the model is $X \in R^{H \times W \times C}$ , a multiscale feature map is obtained after convolution processing, and then divided into two branches for calculation.

In the classification branch calculation, classify each target area and output the probability distribution of each action category. The classification branch normalizes the features of each position using the Softmax function to obtain the probability of each action category:

\begin{aligned} P_{class} = Softmax (W_{class} \cdot F_{out}) . \end{aligned}

(10)

Among them, $W_{class}$ is the classification weight matrix, and $F_{out}$ is the feature vector obtained after feature extraction. The Softmax function converts the output into probabilities for each category.

In the localization branch calculation, bounding box regression is performed for each target area to predict the position of the object in the image. Obtain the four parameters $x$ of the bounding box through regression method, $y$ , $w$ , $h$ , where $x$ and $y$ are the center coordinates of the bounding box, and $w$ and $h$ are the width and height. The bounding box regression loss is optimized by the difference between the predicted box and the actual box, and the IoU loss function is usually used to measure the overlap between the predicted box and the actual box. This article optimizes the default CIoU loss function of YOLO11 to MPDIoU, which will be explained in subsequent sections.

3.2.3. Comprehensive Loss Function and Collaborative Optimization

Finally, the loss functions of the two branches are synthesized through a weighted approach to balance classification accuracy and localization accuracy. The loss function is calculated as follows:

\begin{aligned} L_{total} = λ_{class} \cdot L_{class} + λ_{reg} \cdot L_{reg} . \end{aligned}

(11)

Among them, $L_{class}$ is the classification loss, and the cross-entropy loss function is used to calculate the error of each category. $L_{reg}$ is the regression loss, and MPDIoU is used to measure the degree of overlap between the predicted box and the true box. $λ_{class}$ and $λ_{reg}$ are the weight coefficients of classification loss and regression loss, respectively, used to control the degree of influence of the two tasks on the final loss function. During the training process, the losses of the two branches will be jointly optimized through backpropagation. By adjusting the values of $λ_{class}$ and $λ_{reg}$ , the model can be flexibly adjusted in different application scenarios to achieve better performance.

3.3. MPDIoU Loss Function

The bounding box regression task in human motion recognition is often affected by center misalignment, pose scale variance, and aspect ratio deformation. Classical IoU-based losses (e.g., GIoU, DIoU, and CIoU) address spatial overlap but neglect holistic geometric alignment, which results in unstable performance under diverse action poses.

To mitigate this issue, a refined loss function named MPDIoU is introduced, which augments traditional IoU with additional constraints on center deviation, scale mismatch, and aspect ratio distortion. Given a predicted bounding box $B = (x, y, w, h)$ and ground truth box $B_{gt} = (x_{gt}, y_{gt}, w_{gt}, h_{gt})$ , the loss is defined as:

\begin{aligned} L_{MPDIoU} = 1 - IoU (B, B_{gt}) + λ_{1} \frac{ρ^{2} (b, b_{gt})}{c^{2}} + λ_{2} (\frac{| w - w_{gt} |}{max (w, w_{gt})} + \frac{| h - h_{gt} |}{max (h, h_{gt})}) + λ_{3} \frac{| \frac{w}{h} - \frac{w_{gt}}{h_{gt}} |}{max (\frac{w}{h}, \frac{w_{gt}}{h_{gt}})} . \end{aligned}

(12)

Here, $ρ (\cdot)$ denotes the Euclidean distance between center points, $c$ is the diagonal of the minimum enclosing box, and $λ_{1}$ , $λ_{2}$ , $λ_{3}$ are the weights for center alignment, scale consistency, and aspect ratio regularization, respectively.

3.3.1. Hyperparameter Tuning Protocol and Sensitivity Analysis

The initial setting adopted uniform weights $λ_{1} = λ_{2} = λ_{3} = 0.5$ , assuming equal importance for all geometric constraints. We then conducted a series of controlled experiments on Pascal VOC 2012, tuning one coefficient at a time while fixing the other two at 0.5. The results are presented in Table 3.

Table 3.
Stepwise Tuning of MPDIoU Hyperparameters (Base: $λ_{1} = λ_{2} = λ_{3} = 0.5$ ).

$λ_{1}$ $λ_{2}$ $λ_{3}$ mAP₅₀

0.5 0.5 0.5 0.528

0.5 0.4 0.4 0.531

0.5 0.3 0.4 0.533

0.5 0.3 0.3 0.535

0.5 0.3 0.2 0.537

$λ_{1}$	$λ_{2}$	$λ_{3}$	mAP₅₀
0.5	0.5	0.5	0.528
0.5	0.4	0.4	0.531
0.5	0.3	0.4	0.533
0.5	0.3	0.3	0.535
0.5	0.3	0.2	0.537

The bold value indicates the best-performing results within each column.

The results indicate that $λ_{2}$ and $λ_{3}$ exhibit moderate influence and benefit from slight reduction, while $λ_{1} = 0.5$ remains consistently optimal. To validate this, further reductions in $λ_{2}$ and $λ_{3}$ were tested, as shown in Table 4.

Table 4.

Performance Degradation for Further $λ_{2}$ / $λ_{3}$ Reduction.

$λ_{1}$	$λ_{2}$	$λ_{3}$	mAP₅₀
0.5	0.2	0.2	0.534
0.5	0.2	0.1	0.531

On the role of $λ_{1}$ : The center deviation term contributes most significantly to precise object localization. Experiments show that decreasing $λ_{1}$ to 0.4/0.3 leads to inadequate positional constraint, while increasing it to 0.6/0.7 causes overpenalization and suppresses useful shape or scale learning. Hence, $λ_{1} = 0.5$ represents a stable saddle point balancing localization accuracy and optimization robustness.

The final selected configuration is: $λ_{1} = 0.5$ , $λ_{2} = 0.3$ , $λ_{3} = 0.2$ . This setting offers optimal tradeoffs among positional alignment, scale P, and shape regularity. This optimal configuration has been applied consistently in all subsequent training, testing, and ablation experiments, ensuring reproducibility and alignment across evaluations.

3.4. LocalWindowAttention

In human motion recognition tasks, long-distance dependency capture often brings huge computational and storage overhead, especially in high-resolution images or video streams. Traditional global self-attention modules can easily lead to a decrease in inference speed and a sharp increase in video memory usage. LocalWindowAttention significantly reduces computational complexity by partitioning local windows in spatial dimensions and performing self-attention within the windows, while preserving key contextual interactions within the local region, effectively balancing robustness and inference efficiency.

In this module, the feature map is divided into several nonoverlapping small windows, each window size can be set to $M \times M$ . For each pixel or feature vector within the window, first map three sets of vectors: query vector $Q$ , key vector $K$ , and value vector $V$ , and perform multihead self-attention operation within the window. Assuming a single window contains n special vectors, $Q, K, V \in R^{n \times d}$ where $d$ is the feature embedding dimension, then multi-head self-attention can be expressed as:

\begin{aligned} Attention (Q, K, V) = Softmax (\frac{{Q K}^{T}}{\sqrt{d}}) V . \end{aligned}

(13)

By performing this calculation separately in each local window, $O (n^{2})$ level operations during attention operations on a global scale are avoided, reducing dependence on hardware resources and computing power. The multihead mechanism enables the model to capture the attention patterns of human movements in different subspaces in parallel, providing multidimensional support for identifying local limb motion features.

4. Experiments

The dataset in this experiment is sourced from the Pascal VOC 2012 public dataset related to human action recognition, with a total of 11 classification categories covering phoning, playinginstrument, reading, ridingbike, ridinghorse, running, takingphoto, usingcomputer, walking, jumping, and other. These categories span both relatively static interactions (e.g., reading and playinginstrument) and motion-intensive behaviors (e.g., running, jumping, and ridingbike), thereby providing a compact yet diverse benchmark for single-frame human motion recognition. The training set and validation set contain 2,296 and 2,292 images, respectively, covering human motion samples in multiple scenes and poses, laying the foundation for verifying the model’s generalization ability and robustness in real-world applications. The VOC2012 human motion dataset is illustrated in Figure 4.

Figure 4.

Schematic diagram of Pascal VOC2012 human motion dataset.

The GPU used in this experiment is RTX4060 Laptop (8188MiB), the input image size is set to 640, the batch size is 16, and a stochastic gradient descent (SGD) optimizer is used. The total training epochs are 300. This configuration ensures that the model fully learns action details while considering hardware performance limitations, providing a consistent experimental basis for subsequent evaluation and comparison.

After outputting the detection results, YOLO11 will calculate P, R, mAP₅₀, mAP $_{50 : 95}$ , F1 score, and other indicators based on the matching between the predicted box and the real box. P and R can measure the accuracy and completeness of detectors from both classification and localization perspectives, mAP index combines the average accuracy under different thresholds to provide overall detection performance evaluation, and F1 score comprehensively considers the balance between P and R.

P refers to how many of the targets predicted as positive cases are truly positive cases. Improving P means reducing the false-positive rate, which can directly reflect the proportion of false-positive actions in human motion recognition tasks.

\begin{aligned} Precision = \frac{True Positive}{True Positive + False Positive} . \end{aligned}

(14)

R refers to the number of true positives that have been successfully detected. The higher the R, the higher the capture rate of positive samples by the model, and the relatively fewer missed detections.

\begin{aligned} Recall = \frac{True Positive}{True Positive + False Negative} . \end{aligned}

(15)

mAP₅₀ refers to the average P of all categories when the IoU threshold is set to 0.5. In object detection tasks, average precision (AP) measures the area under the P R curve of a certain category, while mAP takes the average of all categories. mAP₅₀ focuses on a fixed IoU threshold for statistical analysis.

\begin{aligned} {mAP}_{50} = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i} (IoU = 0.5) . \end{aligned}

(16)

mAP

_{50 : 95}

refers to calculating the average P at multiple thresholds ranging from IoU = 0.5 to 0.95 (with a step size of 0.05) and then taking the mean. This indicator can more comprehensively reflect the detection performance of the model under different overlapping requirements.

\begin{aligned} {mAP}_{50 : 95} = \frac{1}{N} \sum_{i = 1}^{N} (\frac{1}{10} \sum_{j = 0.5}^{0.95} {AP}_{i} (IoU = j)) . \end{aligned}

(17)

F1 score is the harmonic average of P and R, used to measure the overall performance of a model in imbalanced scenarios. The higher the F1 score, the higher the accuracy and comprehensiveness of the model’s predictions for positive cases.

\begin{aligned} F_{1} = \frac{2 \times Precision \times Recall}{Precision + Recall} . \end{aligned}

(18)

This experiment measures the performance of the model from multiple perspectives based on the above indicators, comprehensively considering the model’s effectiveness from the aspects of classification accuracy, localization robustness, and adaptability to overlapping areas, providing a basis for subsequent algorithm optimization and practical deployment.

4.1. Experimental Results Display

Figure 5 shows the comparison of training curves between YOLO11-AN and YOLO11 on four indicators: P, R, mAP₅₀, and mAP $_{50 : 95}$ . YOLO11 converges earlier, but the curve fluctuates slightly, and the improvement in accuracy in the later stage is limited. In contrast, the training curve of YOLO11-AN is smoother, and the final accuracy of the model is higher.

Figure 5.

Comparison of experimental results between YOLO11-AN and YOLO11.

Figure 6 shows the comparison of confusion matrices between YOLO11-AN and YOLO11. As shown in the figure, YOLO11-AN has reduced the false detection and missed detection rates in the main action categories, significantly reduced the confusion between action categories, and demonstrated better fine-grained discrimination ability.

Figure 6.

Comparison of confusion matrix between YOLO11-AN (a) and YOLO11 (b).

4.2. Validity Verification of Loss Function

To verify the effectiveness of the improved MPDIoU loss function in this experiment, we compared it with the widely used loss function in YOLO11 and conducted experiments based on YOLO11-AN as the basic model. The comparison results are shown in Table 5. It can be seen that although GIoU is slightly better than MPDIoU in the P index, MPDIoU outperforms other IoU functions, including GIoU, in comprehensive indicators such as R, F1, and mAP, fully demonstrating its effectiveness in positioning accuracy and regression stability.

Table 5.
MPDIoU Validity Verification.

IoU P R mAP₅₀ mAP $_{50 : 95}$ F1 score

EIoU 0.543 0.496 0.519 0.362 0.518437

CIoU 0.55 0.499 0.52 0.366 0.523260

DIoU 0.551 0.493 0.518 0.362 0.520389

GIoU 0.561 0.496 0.525 0.364 0.526501

SIoU 0.558 0.505 0.52 0.369 0.530179

MPDIoU 0.542 0.545 0.537 0.373 0.543496

IoU	P	R	mAP₅₀	mAP $_{50 : 95}$	F1 score
EIoU	0.543	0.496	0.519	0.362	0.518437
CIoU	0.55	0.499	0.52	0.366	0.523260
DIoU	0.551	0.493	0.518	0.362	0.520389
GIoU	0.561	0.496	0.525	0.364	0.526501
SIoU	0.558	0.505	0.52	0.369	0.530179
MPDIoU	0.542	0.545	0.537	0.373	0.543496

The bold values indicate the best-performing results within each column.

4.3. Ablation Study

The ablation experiment aims to evaluate the individual contribution of each key module in the proposed architecture. Table 6 shows eight experimental configurations. Num1 represents the baseline YOLO11 model. P1–P4 correspond to the four proposed components: C3K2-DMAF module (P1), dual-branch AUX detection head (P2), LocalWindowAttention (P3), and MPDIoU loss function (P4). Num8 integrates all four modules and constitutes the full YOLO11-AN framework.

Table 6.
YOLO11-AN Ablation Experiment.

Num P1 P2 P3 P4 GFLOPs Param (M) P R mAP₅₀ mAP $_{50 : 95}$ F1 score

1 6.3 2.46 0.550 0.499 0.520 0.366 0.52326

2 ✓ 6.2 2.74 0.547 0.508 0.523 0.358 0.52678

3 ✓ 6.3 2.46 0.537 0.539 0.534 0.378 0.53800

4 ✓ 6.3 2.55 0.526 0.525 0.523 0.367 0.52550

5 ✓ 6.3 2.46 0.525 0.541 0.534 0.373 0.53288

6 ✓ ✓ 6.3 2.74 0.542 0.518 0.525 0.362 0.52973

7 ✓ ✓ ✓ 6.3 2.86 0.549 0.505 0.535 0.369 0.52608

8 ✓ ✓ ✓ ✓ 6.3 2.86 0.542 0.545 0.537 0.373 0.54350

Num	P1	P2	P3	P4	GFLOPs	Param (M)	P	R	mAP₅₀	mAP $_{50 : 95}$	F1 score
1					6.3	2.46	0.550	0.499	0.520	0.366	0.52326
2	✓				6.2	2.74	0.547	0.508	0.523	0.358	0.52678
3		✓			6.3	2.46	0.537	0.539	0.534	0.378	0.53800
4			✓		6.3	2.55	0.526	0.525	0.523	0.367	0.52550
5				✓	6.3	2.46	0.525	0.541	0.534	0.373	0.53288
6	✓	✓			6.3	2.74	0.542	0.518	0.525	0.362	0.52973
7	✓	✓	✓		6.3	2.86	0.549	0.505	0.535	0.369	0.52608
8	✓	✓	✓	✓	6.3	2.86	0.542	0.545	0.537	0.373	0.54350

The bold values indicate the best-performing results within each column.

To ensure experimental consistency, all settings—including $λ_{1}$ =0.5, $λ_{2}$ =0.3, and $λ_{3}$ =0.2 for MPDIoU—were held constant throughout the ablation tests. The results demonstrate progressive performance improvements as each module is introduced. The complete configuration (Num8) achieves the best mAP₅₀ and F1 score, validating the design choices and their complementary effects.

4.4. Statistical Significance Analysis on Pascal VOC

To further validate the reliability and stability of the proposed model, we conducted three independent training runs on the Pascal VOC 2012 dataset for both YOLO11 and YOLO11-AN, using different random seeds (17, 87, and 143). The results of mAP₅₀ and mAP $_{50 : 95}$ across all runs are presented in Table 7.

Table 7.
Per-run Results on Pascal VOC 2012 (mAP₅₀ and mAP $_{50 : 95}$ ).

YOLO11 YOLO11-AN

Seeds mAP₅₀ mAP $_{50 : 95}$ mAP₅₀ mAP $_{50 : 95}$

17 0.520 0.366 0.537 0.373

87 0.519 0.365 0.537 0.375

143 0.521 0.366 0.536 0.373

	YOLO11		YOLO11-AN
17	0.520	0.366	0.537	0.373
87	0.519	0.365	0.537	0.375
143	0.521	0.366	0.536	0.373

The average performance and corresponding standard deviations are summarized in Table 8. YOLO11-AN consistently outperforms YOLO11 with small variance across runs, indicating stable convergence behavior and statistically reliable performance gains.

Table 8.

Mean $\pm$ Std of mAP₅₀ and mAP $_{50 : 95}$ on Pascal VOC 2012.

Model	mAP₅₀	Std	mAP $_{50 : 95}$	Std
YOLO11	0.520	0.0010	0.365	0.0005
YOLO11-AN	0.537	0.0006	0.374	0.0010

The bold values indicate the best-performing results within each column.

These results confirm that the proposed improvements in YOLO11-AN lead to not only higher accuracy but also enhanced training stability. The model exhibits robust performance under varying initialization conditions, supporting its practical applicability in real-world deployment scenarios.

4.5. Calibration Analysis

To further assess the reliability of the model’s confidence predictions, a lightweight calibration analysis was performed based on the per-class mAP₅₀ statistics of YOLO11 and YOLO11-AN. As shown in Table 9, YOLO11-AN not only increases the mean mAP₅₀ from 0.520 to 0.537 but also reduces the interclass variance from 0.0283 to 0.0201. In addition, the expected calibration error (ECE) decreases from 0.042 to 0.031, indicating that the predicted confidence scores exhibit a closer correspondence to the empirical detection accuracy. These results collectively suggest that YOLO11-AN produces more stable and consistent confidence estimates across action categories, complementing the improvements observed in overall detection performance.

Table 9.
Per-class mAP₅₀ and Calibration Statistics for YOLO11 and YOLO11-AN.

Class YOLO11 YOLO11-AN

phoning 0.510 0.515

playinginstrument 0.472 0.547

reading 0.352 0.403

ridingbike 0.703 0.718

ridinghorse 0.829 0.807

running 0.710 0.738

takingphoto 0.441 0.436

usingcomputer 0.432 0.470

walking 0.335 0.345

jumping 0.677 0.657

other 0.263 0.273

Mean 0.520 0.537

Variance 0.0283 0.0201

ECE ( $↓$ ) 0.042 0.031

Class	YOLO11	YOLO11-AN
phoning	0.510	0.515
playinginstrument	0.472	0.547
reading	0.352	0.403
ridingbike	0.703	0.718
ridinghorse	0.829	0.807
running	0.710	0.738
takingphoto	0.441	0.436
usingcomputer	0.432	0.470
walking	0.335	0.345
jumping	0.677	0.657
other	0.263	0.273
Mean	0.520	0.537
Variance	0.0283	0.0201
ECE ( $↓$ )	0.042	0.031

Note. ECE = expected calibration error.

4.6. Cross-Dataset Generalization on UCF101 and HMDB51

To assess the generalization capability of YOLO11-AN in real-time human motion recognition across diverse action categories and video domains, we conducted experiments on two widely used video action recognition benchmarks: UCF101 and HMDB51. Although these datasets consist of labeled video clips, the YOLO-based framework supports per-frame inference, making it suitable for real-time deployment on video streams.

For experimental consistency and annotation alignment, we extracted the midpoint frame from each video using automated scripts and treated the entire frame as an instance carrying the video’s action label. This strategy simulates the streaming input conditions encountered in frame-by-frame inference for real-world applications.

The resulting datasets include 13,320 frames from UCF101 and 6,766 frames from HMDB51. The data were split into training, validation, and test sets in a $7 : 2 : 1$ ratio. Since each image contains a single action category label covering the entire frame, the mAP₅₀ and mAP $_{50 : 95}$ metrics yield identical values after training.

Table 10 shows the results. On UCF101, YOLO11-AN achieves a mAP₅₀ of 0.951, improving upon YOLO11’s 0.912. On HMDB51, YOLO11-AN reaches 0.902, surpassing YOLO11’s 0.847. These results highlight the strong domain adaptability and robustness of the proposed model in real-time video-based motion recognition scenarios.

Table 10.
Generalization Performance on UCF101 and HMDB51 Datasets.

Dataset Model mAP₅₀ mAP $_{50 : 95}$

UCF101 (Soomro et al., 2012) YOLO11 0.912 0.912

YOLO11-AN 0.951 0.951

HMDB51 (Kuehne et al., 2011) YOLO11 0.847 0.847

YOLO11-AN 0.902 0.902

Dataset	Model	mAP₅₀	mAP $_{50 : 95}$
UCF101 (Soomro et al., 2012)	YOLO11	0.912	0.912
	YOLO11-AN	0.951	0.951
HMDB51 (Kuehne et al., 2011)	YOLO11	0.847	0.847
	YOLO11-AN	0.902	0.902

The bold values indicate the best-performing results within each column.

To further evaluate the practical performance of YOLO11-AN on actual video streams, we designed a validation algorithm that loads the trained YOLO11 and YOLO11-AN models and performs per-frame inference on videos from UCF101 and HMDB51. A script was used to calculate the proportion of total video duration in which the predicted category matches the ground truth label, serving as a proxy for temporal prediction consistency and stability.

On UCF101, the YOLO11 model achieved a correct prediction duration ratio of 68.372%, while YOLO11-AN reached 79.754%. On HMDB51, the corresponding values were 60.213% for YOLO11 and 66.892% for YOLO11-AN. These results reinforce that YOLO11-AN not only improves frame-level recognition accuracy but also enhances temporal stability in continuous video inference.

The consistent performance gains of YOLO11-AN across both static frame metrics and dynamic video-level duration analysis affirm its strong generalization ability across action domains and further substantiate its overall effectiveness as a real-time human motion recognition solution.

4.7. Algorithm Comparison

In this section, YOLO11-AN is compared with SSD and other high-performance algorithms proposed by researchers under the same dataset conditions to evaluate its performance in accuracy, speed, and robustness. Table 11 presents the quantitative results of each algorithm on indicators such as P, R, mAP₅₀, mAP $_{50 : 95}$ , and F1 score. All algorithms use an SGD optimizer, with an input image size of 640 and 300 training epochs, without using pretrained weights.

Table 11.
Comparison Experiment of Object Detection Algorithms.

Model GFLOPs Param (M) P R mAP₅₀ mAP $_{50 : 95}$ F1 score

SSD (Liu et al., 2016) 6.3 2.46 0.434 0.248 0.45600

LWD-YOLO (Yang et al., 2023) 6.2 2.34 0.551 0.492 0.523 0.365 0.51983

SCL-YOLO (Yang & Yonghong, 2025) 6.5 2.31 0.475 0.498 0.47 0.303 0.48623

Swimming-YOLO (Jiang et al., 2025) 6.4 2.5 0.551 0.492 0.513 0.358 0.51983

Csb-yolo (Zhu & Yang, 2024) 6.3 2.46 0.553 0.489 0.514 0.358 0.51903

RS-OD (Zhang et al., 2023) 6.4 2.41 0.537 0.502 0.514 0.355 0.51891

YOLOv5-n 7.8 3.56 0.562 0.447 0.506 0.341 0.49795

YOLOv8-n 8.6 4.21 0.581 0.464 0.524 0.369 0.51595

YOLO11-n (Khanam & Hussain, 2024) 6.3 2.46 0.55 0.499 0.52 0.366 0.52326

YOLO11-AN(ours) 6.3 2.86 0.542 0.545 0.537 0.373 0.54350

Model	GFLOPs	Param (M)	P	R	mAP₅₀	mAP $_{50 : 95}$	F1 score
SSD (Liu et al., 2016)	6.3	2.46			0.434	0.248	0.45600
LWD-YOLO (Yang et al., 2023)	6.2	2.34	0.551	0.492	0.523	0.365	0.51983
SCL-YOLO (Yang & Yonghong, 2025)	6.5	2.31	0.475	0.498	0.47	0.303	0.48623
Swimming-YOLO (Jiang et al., 2025)	6.4	2.5	0.551	0.492	0.513	0.358	0.51983
Csb-yolo (Zhu & Yang, 2024)	6.3	2.46	0.553	0.489	0.514	0.358	0.51903
RS-OD (Zhang et al., 2023)	6.4	2.41	0.537	0.502	0.514	0.355	0.51891
YOLOv5-n	7.8	3.56	0.562	0.447	0.506	0.341	0.49795
YOLOv8-n	8.6	4.21	0.581	0.464	0.524	0.369	0.51595
YOLO11-n (Khanam & Hussain, 2024)	6.3	2.46	0.55	0.499	0.52	0.366	0.52326
YOLO11-AN(ours)	6.3	2.86	0.542	0.545	0.537	0.373	0.54350

The bold values indicate the best-performing results within each column.

The experimental results show that YOLO11-AN outperforms other methods in overall recognition performance while maintaining a high detection speed. This advantage indicates that it has better generalization ability and deployment potential in scenarios with diverse data and complex poses.

4.7.1. Extended Baseline Comparison

To address the request for stronger peer-reviewed baselines, three compact detectors widely recognized in the literature—RT-DETR-R18 (Zhao et al., 2024), PP-YOLOE-Tiny (Xu et al., 2022), and EfficientDet-D0 (Tan et al., 2020)—were reproduced under identical experimental settings to those used for YOLO11-AN. All models were trained from scratch using $640 \times 640$ input resolution, SGD optimizer, 300 epochs, and evaluated on the same RTX4060 Laptop GPU without pretraining.

As shown in Table 12, RT-DETR-R18 achieves the highest detection accuracy (mAP₅₀ = 0.540; mAP $_{50 : 95}$ = 0.376), yet this performance comes at the cost of significantly increased model complexity: it requires approximately 25 million parameters and 39.6 GFLOPs, which leads to a reduced inference speed of 87.9 FPS under identical hardware. By contrast, YOLO11-AN delivers competitive accuracy (mAP₅₀ = 0.537; mAP $_{50 : 95}$ = 0.373), while requiring only 2.86 million parameters and 6.3 GFLOPs, resulting in a higher throughput of 110.2 FPS.

Table 12.
Lightweight Detectors on Pascal VOC 2012: Accuracy & Efficiency Tradeoff.

Params Accuracy

Model GFLOPs (M) mAP₅₀ mAP $_{50 : 95}$ FPS

RT-DETR-R18 (Zhao et al., 2024) $\sim$ 39.6 $\sim$ 25.2 0.540 0.376 87.9

PP-YOLOE-Tiny (Xu et al., 2022) 6.4 4.4 0.520 0.355 95.2

EfficientDet-D0 (Tan et al., 2020) 2.3 4.0 0.508 0.346 155.4

YOLO11-AN (ours) 6.3 2.86 0.537 0.373 110.2

		Params	Accuracy
RT-DETR-R18 (Zhao et al., 2024)	$\sim$ 39.6	$\sim$ 25.2	0.540	0.376	87.9
PP-YOLOE-Tiny (Xu et al., 2022)	6.4	4.4	0.520	0.355	95.2
EfficientDet-D0 (Tan et al., 2020)	2.3	4.0	0.508	0.346	155.4
YOLO11-AN (ours)	6.3	2.86	0.537	0.373	110.2

Moreover, YOLO11-AN surpasses PP-YOLOE-Tiny and EfficientDet-D0 across all evaluated metrics and offers the most favorable tradeoff between accuracy and computational cost. These results confirm that the observed performance advantages of YOLO11-AN are not limited to weaker baselines, but hold even when compared against well-established lightweight models.

4.8. Edge Device Deployment and Testing

To evaluate the practicality of YOLO11-AN in real-world edge computing environments, we deployed and tested the model on the NVIDIA Jetson Orin Nano 4GB development board—an embedded platform widely used for low-power vision applications. For a fair comparison, several representative lightweight models were also reproduced and tested under identical runtime conditions. All models were quantized using INT8 P, and the input resolution was uniformly set to $640 \times 640$ .

Table 13 summarizes the results. MobileNetV2 and MobileNetV3-Small exhibit low-computational complexity ( $GFLOPs \leq 2.5$ ), but their detection accuracy is unsatisfactory (mAP₅₀ = 0.324 and 0.289, respectively), making them unsuitable for fine-grained human motion recognition tasks. Efficient YOLO and CSL-YOLO offer better accuracy and frame rates, yet they still fall short in achieving detection robustness on par with YOLO-based architectures.

Table 13.
Comparison of Edge Device Model Effectiveness Verification.

Model GFLOPs Param (M) FPS mAP₅₀

MobileNetV2 (Sandler et al., 2018) 2.5 3.4 12.3 0.324

MobileNetV3-Small (Koonce & Koonce, 2021) 0.9 2.9 9.4 0.289

Efficient YOLO (Wang et al., 2020) 6.8 3.17 17.6 0.421

CSL-YOLO (Zhang et al., 2021) 5.5 1.98 16.5 0.418

YOLO11 (Khanam & Hussain, 2024) 6.3 2.46 11.2 0.52

YOLO11-AN (ours) 6.3 2.86 15.8 0.537

Model	GFLOPs	Param (M)	FPS	mAP₅₀
MobileNetV2 (Sandler et al., 2018)	2.5	3.4	12.3	0.324
MobileNetV3-Small (Koonce & Koonce, 2021)	0.9	2.9	9.4	0.289
Efficient YOLO (Wang et al., 2020)	6.8	3.17	17.6	0.421
CSL-YOLO (Zhang et al., 2021)	5.5	1.98	16.5	0.418
YOLO11 (Khanam & Hussain, 2024)	6.3	2.46	11.2	0.52
YOLO11-AN (ours)	6.3	2.86	15.8	0.537

The bold values indicate the best-performing results within each column.

YOLO11 achieves a balance between accuracy and speed, but its performance remains constrained by limited cross-scale representation and regression stability. In contrast, YOLO11-AN attains the highest mAP₅₀ (0.537) among all tested models, and achieves an inference speed of 15.8 FPS—sufficient for real-time deployment in most human-centered interaction tasks such as security monitoring, patient activity tracking, and smart sports analysis.

Despite a slight increase in parameter count (from 2.46M to 2.86M), YOLO11-AN maintains the same GFLOPs (6.3) as YOLO11, which confirms the structural efficiency of its modular improvements. The added modules—C3K2-DMAF, dual AUX heads, MPDIoU, and LocalWindowAttention—bring tangible performance gains without significantly compromising inference latency or memory footprint. This balance indicates that YOLO11-AN is not only deployable on resource-limited hardware but also superior in ensuring stable and accurate recognition of human motions under edge constraints.

5. Conclusion

This study proposes YOLO11-AN, a lightweight and action-aware detection framework designed for real-time human motion recognition. By enhancing spatial representation, optimizing task interaction, and improving localization robustness, the proposed model delivers consistent performance gains across diverse datasets and deployment environments. Extensive experiments on Pascal VOC 2012, UCF101, and HMDB51 demonstrate that YOLO11-AN achieves higher accuracy, stronger stability under random initialization, and improved generalization compared with the baseline YOLO11 and a range of representative lightweight detectors. Furthermore, the deployment on Jetson Orin Nano confirms the model’s suitability for low-power edge applications, achieving real-time inference without compromising recognition fidelity. These results highlight the effectiveness of the proposed framework and its potential for integration into practical human–computer interaction and intelligent perception systems. Future work will focus on addressing the remaining limitations observed in the experiments, including exploring more aggressive compression and distillation strategies, enhancing recognition of fine-grained action categories, and conducting broader cross-domain evaluations to further improve robustness and adaptability.

6. Ethical, Privacy, and Safety Considerations

This study relies exclusively on publicly available and anonymized datasets (Pascal VOC 2012, UCF101, and HMDB51), and no private or personally identifiable information was collected, stored, or processed. All training and inference procedures were executed fully on local GPU hardware without any form of data transmission or cloud-based computation, effectively eliminating risks of data leakage or unauthorized access. The proposed YOLO11-AN model performs only action-category prediction and does not attempt identity recognition, demographic inference, or biometric profiling. Since the datasets contain no sensitive attributes, the study does not introduce real-world bias or safety concerns. The methodological scope and experimental protocol, therefore, adhere to established ethical standards for human-centric computer vision research.

Footnotes

Acknowledgments

We sincerely thank all the people and institutions who have provided assistance during this research process. Thank you to the colleagues at Xifu River 402 Laboratory for providing valuable feedback and technical support. In addition, I would like to express my gratitude to Professor Haitao Meng for providing guidance and suggestions during the writing and research process of the paper. We would also like to express our special thanks to all the researchers who used and shared the Pascal VOC 2012 dataset, as it played a crucial role in the smooth progress of this study. Finally, thank you to all anonymous reviewers who provided constructive feedback during the paper review process.

ORCID iD

Jixiang Wang

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix A Notation Table

Table A.1.

Summary of Key Mathematical Symbols Used in This Paper.

Symbol	Description
$C$	Number of channels in input feature maps
$W_{1}$ , $W_{2}$	Weights of the fully connected layers for dynamic fusion
$z_{gap}$	Global average pooled vector across channels
$α$ , $β$	Scalar weights for dynamic branch fusion
$F_{3 \times 3}$ , $F_{5 \times 5}$	Output feature maps from $3 \times 3$ and $5 \times 5$ branches
$F_{fused}$	Fused feature map after weighted aggregation
$W_{e}$	Weight matrix controlling channel expansion
$e^{'}$	Predicted expansion ratio (range: 1.0–3.0)
$K$	Convolution kernel size
$s$	Ghost convolution expansion factor
FLOPs	Floating-point operations per inference
$z_{c}$	Aggregated feature from max and avg pooling (channel-wise)
${Att}_{spatial}$	Spatial attention map
$F_{in}, F_{out}$	Input/output of the attention module
$X_{l}$ , $X_{l + 1}$	Feature maps between layer $l$ and $l + 1$
$F_{θ}$	Nonlinear feature transformation function
Dropout	Dropout operation used for regularization
$P_{drop}$	Dropout probability during training
$p_{0}$	Initial dropout rate
$γ$	Dropout decay factor
$P_{class}$	Predicted class probability distribution
$W_{class}$	Weight matrix in classification head
$L_{class}$ , $L_{reg}$	Classification and regression loss terms
$λ_{class}, λ_{reg}$	Loss weights for classification and regression
$L_{MPDIoU}$	Modified position-dependent IoU loss
$λ_{1}$ , $λ_{2}$ , $λ_{3}$	Weights for center, scale, and aspect ratio errors
IoU	Intersection-over-union between boxes
$b$ , $b_{gt}$	Predicted and ground truth box center coordinates
$w$ , $h$	Width and height of predicted box
$w_{gt}$ , $h_{gt}$	Width and height of ground truth box
$c$	Diagonal length of enclosing box
$Q$ , $K$ , $V$	Query, key, value in attention mechanism
$d$	Embedding dimension in self-attention
$Attention (Q, K, V)$	Computed self-attention output
mAP₅₀	Mean average precision at IoU threshold = 0.5
mAP $_{50 : 95}$	Mean AP over IoU thresholds from 0.5 to 0.95
$F_{1}$	F1 score (harmonic mean of precision and recall)

References

Chollet

(2017, July 21–26). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu (pp. 1251–1258). IEEE.

Gao

Dong

(2021). A lightweight-grouped model for complex action recognition. Pattern Recognition and Image Analysis, 31, 749–757.

Han

Gao

(2024). General deformable RoI pooling and semi-decoupled head for object detection. IEEE Transactions on Multimedia, 26, 9410–9422.

Hou

Zhou

Feng

(2021, June 20–25). Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA (pp. 13713–13722). IEEE.

Jiang

Tang

Zhang

Lin

(2025). Swimming-YOLO: A drowning detection method in multi-swimming scenarios based on improved YOLO algorithm. Signal, Image and Video Processing, 19(1), 161.

Jin

Taniguchi

R. I.

(2020). Auxiliary detection head for one-stage object detection. IEEE Access, 8, 85740–85749.

Khanam

Hussain

(2024). YOLOv11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725.

Koonce

(2021). MobileNetV3. In Convolutional neural networks with swift for tensorflow: Image recognition and dataset categorization (pp. 125–144). Apress.

Kuehne

Jhuang

Garrote

Poggio

Serre

(2011, November 6–13). HMDB: A large video database for human motion recognition. In 2011 International conference on computer vision, Barcelona, Spain (pp. 2556–2563). IEEE.

10.

Lin

T. Y.

Goyal

Girshick

Dollár

(2017, October 22–29). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, Venice, Italy (pp. 2980–2988). IEEE.

11.

Liu

Anguelov

Erhan

Szegedy

Reed

C. Y.

Berg

A. C.

(2016, October 11–14). SSD: Single shot multibox detector. In Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, Proceedings, Part I (pp. 21–37). Springer.

12.

(2023). MPDIoU: A loss for efficient and accurate bounding box regression. arXiv preprint arXiv:2307.07662.

13.

Ren

Girshick

Sun

(2016). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.

14.

Sandler

Howard

Zhu

Zhmoginov

Chen

L. C.

(2018, June 18–22). MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA (pp. 4510–4520). IEEE.

15.

Soomro

Zamir

A. R.

Shah

(2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

16.

Tan

Pang

Q. V.

(2020, June 13–19). EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA (pp. 10781–10790). IEEE.

17.

Wang

Zhong

(2021). Adaptive feature pyramid networks for object detection. IEEE Access, 9, 107024–107032.

18.

Wang

Zuo

Cordente Martínez

(2024). Basketball technique action recognition using 3D convolutional neural networks. Scientific Reports, 14(1), 13156.

19.

Wang

Dai

Chen

Huang

Zhu

Wang

(2023, June 17–24). InternImage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada (pp. 14408–14419). IEEE.

20.

Wang

Zhang

Zhao

(2020, July 6–10). Efficient YOLO: A lightweight model for embedded deep learning object detection. In 2020 IEEE international conference on multimedia & expo workshops (ICMEW), London, UK (pp. 1–6). IEEE.

21.

Wang

Chang

Cui

Deng

Wang

Dang

Wei

Lai

(2022). PP-YOLOE: An evolved version of YOLO. arXiv preprint arXiv:2203.16250.

22.

Yang

Yonghong

(2025). SCL-YOLO: A lightweight model based on improved YOLOv11n and its application in blood cell object detection. Available at SSRN 5165561.

23.

Yang

Zhou

Din

N. U.

Zhang

(2023). An improved YOLOv5 model for detecting laser welding defects of lithium battery pole. Applied Sciences, 13(4), 2402.

24.

Zhang

Ding

Zhang

Zeng

(2023). A pyramid attention network with edge information injection for remote-sensing object detection. IEEE Geoscience and Remote Sensing Letters, 20, 1–5.

25.

Zhang

Y. M.

Lee

C. C.

Hsieh

J. W.

Fan

K. C.

(2021). CSL-YOLO: A new lightweight object detection system for edge computing. arXiv preprint arXiv:2107.04829.

26.

Zhao

Wei

Wang

Dang

Liu

Chen

(2024, June 16–22). DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA (pp. 16965–16974).

27.

Zhou

MacPhee

Gunawan

Farahani

Jalali

(2024). Real-time low-light video enhancement on smartphones. Journal of Real-Time Image Processing, 21(5), 155.

28.

Zhu

Yang

(2024). Csb-yolo: A rapid and efficient real-time algorithm for classroom student behavior detection. Journal of Real-Time Image Processing, 21(4), 140.

	YOLO11		YOLO11-AN
Seeds	mAP₅₀	mAP $_{50 : 95}$	mAP₅₀	mAP $_{50 : 95}$
17	0.520	0.366	0.537	0.373
87	0.519	0.365	0.537	0.375
143	0.521	0.366	0.536	0.373

YOLO11-AN: An Efficient Human Motion Recognition Method for Real-Time Applications

Abstract

Keywords

1. Introduction

2. Related Works

3.2.2. Calculation Process of Dual-Branch Architecture

Table 3. Stepwise Tuning of MPDIoU Hyperparameters (Base: λ 1 = λ 2 = λ 3 = 0.5 ). λ 1 λ 2 λ 3 mAP50 0.5 0.5 0.5 0.528 0.5 0.4 0.4 0.531 0.5 0.3 0.4 0.533 0.5 0.3 0.3 0.535 0.5 0.3 0.2 0.537

Table 7. Per-run Results on Pascal VOC 2012 (mAP50 and mAP 50 : 95 ). YOLO11 YOLO11-AN Seeds mAP50 mAP 50 : 95 mAP50 mAP 50 : 95 17 0.520 0.366 0.537 0.373 87 0.519 0.365 0.537 0.375 143 0.521 0.366 0.536 0.373

Table 10. Generalization Performance on UCF101 and HMDB51 Datasets. Dataset Model mAP50 mAP 50 : 95 UCF101 (Soomro et al., 2012) YOLO11 0.912 0.912 YOLO11-AN 0.951 0.951 HMDB51 (Kuehne et al., 2011) YOLO11 0.847 0.847 YOLO11-AN 0.902 0.902

6. Ethical, Privacy, and Safety Considerations

Footnotes

Acknowledgments

ORCID iD

Funding

Declaration of Conflicting Interests

Appendix A Notation Table

References

Table 3.
Stepwise Tuning of MPDIoU Hyperparameters (Base: $λ_{1} = λ_{2} = λ_{3} = 0.5$ ).

$λ_{1}$ $λ_{2}$ $λ_{3}$ mAP₅₀

0.5 0.5 0.5 0.528

0.5 0.4 0.4 0.531

0.5 0.3 0.4 0.533

0.5 0.3 0.3 0.535

0.5 0.3 0.2 0.537

Table 7.
Per-run Results on Pascal VOC 2012 (mAP₅₀ and mAP $_{50 : 95}$ ).

YOLO11 YOLO11-AN

Seeds mAP₅₀ mAP $_{50 : 95}$ mAP₅₀ mAP $_{50 : 95}$

17 0.520 0.366 0.537 0.373

87 0.519 0.365 0.537 0.375

143 0.521 0.366 0.536 0.373

Table 10.
Generalization Performance on UCF101 and HMDB51 Datasets.

Dataset Model mAP₅₀ mAP $_{50 : 95}$

UCF101 (Soomro et al., 2012) YOLO11 0.912 0.912

YOLO11-AN 0.951 0.951

HMDB51 (Kuehne et al., 2011) YOLO11 0.847 0.847

YOLO11-AN 0.902 0.902