Abstract
Existing photoelectric detection systems often suffer from inadequate image feature extraction accuracy and suboptimal performance in multimodal information fusion, particularly under complex and dynamic environmental conditions. To address these challenges, this paper presents a novel photoelectric detection framework based on the self-supervised vision model DINOv2 and a cross-modal Transformer architecture with deformable attention. First, DINOv2 is employed to extract rich, global semantic features from visible and infrared images, generating high-fidelity visual representations at a unified scale without reliance on large annotated datasets. Second, a deformable cross-modal attention mechanism within a Transformer-based fusion network is designed to enable adaptive spatial alignment and deep integration of heterogeneous modalities, effectively capturing long-range dependencies and local structural correspondences. Finally, a self-supervised fine-tuning module based on contrastive learning is introduced to enhance feature consistency across modalities and improve the robustness of the fused representation under environmental variations. Experimental results on benchmark multimodal detection tasks demonstrate that the proposed method achieves a target recognition accuracy ranging from 86.4% to 93.2%, with a maximum performance gain of 7.8% over baseline models. Moreover, the cross-modal alignment error is reduced to within 2.7%, indicating superior fusion precision. The proposed framework significantly enhances both detection accuracy and fusion consistency, offering a promising solution for the development of high-performance, robust photoelectric detection systems in real-world scenarios.
Keywords
Get full access to this article
View all access options for this article.
