Abstract
Video captioning, which aims to generate natural language descriptions for video content, has made significant progress with the development of pretrained language models and multimodal learning. However, a persistent challenge remains in maintaining semantic consistency between generated captions and video content, often leading to inaccurate or contextually misaligned descriptions. This paper proposes a novel cross-modal contrastive learning framework to enhance semantic consistency in video captioning by improving the alignment between visual and textual representations. The proposed method incorporates a dual-branch contrastive learning strategy that refines feature extraction from both video and text modalities while enforcing a fine-grained semantic matching mechanism. Furthermore, we introduce a semantic consistency loss function to penalize mismatches between generated captions and their corresponding video content. To evaluate the effectiveness of our approach, extensive experiments are conducted on benchmark datasets, including MSR-VTT and ActivityNet Captions. The results demonstrate that our method significantly improves semantic alignment, outperforming state-of-the-art models in BLEU, METEOR, and CIDEr scores.
Keywords
Get full access to this article
View all access options for this article.
