Abstract
As the primary operators of vehicles, drivers play a decisive role in maintaining traffic safety. Numerous studies have identified driver behavior as a major contributing factor to traffic crashes, with distracted driving recognized as one of the most frequent and dangerous causes. To mitigate these risks, recent studies have focused on developing real-time driver-monitoring systems using vision-based models. However, existing models often rely solely on images’ global features and face challenges in balancing recognition accuracy and inference efficiency. In this study, we propose a novel pose-guided multilevel fusion network (PG-MFNet). Specifically, driver pose features are introduced to force attention to behavior-relevant local regions near keypoints, while also modeling the spatial relationships among body parts. Next, a multilevel fusion strategy is applied to progressively integrate low-level geometric contours, mid-level structural patterns, and high-level semantic cues, enabling comprehensive behavioral understanding from fine-grained detection to global interpretation. Moreover, we introduce a feature conditional attention module that dynamically adjusts class-specific representations based on inter-class differences, enhancing discriminability across behavior classes. Furthermore, to support training under varied real-world scenarios, we construct SAA13, a large-scale dataset that aggregates diverse drivers, driving contexts, and sensor viewpoints from multiple sources. Experimental results show that PG-MFNet achieves 92.16% accuracy with 68.1 FPS (frames per second) inference speed, outperforming state-of-the-art (SOTA) models in balancing performance and efficiency. These advancements serve as a practical and scalable solution for real-time distracted-driving detection and driver monitoring, providing reliable behavioral tracking for intelligent transportation systems.
Keywords
Get full access to this article
View all access options for this article.
