Abstract
To address the challenges of inadequate multi-scale feature adaptation and loss of fine-grained details in facial expression recognition, we proposed a dual-path enhancement network, GFB-YOLO (Global Edge Information Transfer Path + Flexible Inception Mixer Bottleneck-YOLO), built on the YOLOv11 object detection framework in this research. First, we introduced a Flexible Inception Mixer Bottleneck (FIMB) into the backbone network to reconstruct the convolution structure. And, a deformable convolution kernel weighting mechanism was used, which enabled adaptive expansion of the model’s receptive field based on the input facial scale. Next, we designed a global edge information transfer path (GEITP) that operates in parallel with the detection backbone. A three-level feature pyramid was constructed using the Sobel Edge Generator (SEG), and the features are enhanced by the Detail Fusion Module (DFM) for cross-scale edge reinforcement. Experimental results demonstrate that the model achieves an expression detection accuracy of 84.6% on the original RAF-DB dataset, marking a 6.1% improvement over the original YOLOv11, which significantly outperforms other models for multiple expressions. Notably, the model showed substantial improvements in recognizing expressions such as fear and disgust. This improvement highlights the effectiveness of the proposed modules in extracting fine-grained features for accurately recognizing subtle and easily confused facial expressions. This work overcomes the limitation of traditional expression recognition models that focused more on classification than on localization and offers a new approach for real-time emotion computing based on the object detection paradigm.
Get full access to this article
View all access options for this article.
