Sage Journals: Discover world-class research

Abstract

Video captioning, which aims to generate natural language descriptions for video content, has made significant progress with the development of pretrained language models and multimodal learning. However, a persistent challenge remains in maintaining semantic consistency between generated captions and video content, often leading to inaccurate or contextually misaligned descriptions. This paper proposes a novel cross-modal contrastive learning framework to enhance semantic consistency in video captioning by improving the alignment between visual and textual representations. The proposed method incorporates a dual-branch contrastive learning strategy that refines feature extraction from both video and text modalities while enforcing a fine-grained semantic matching mechanism. Furthermore, we introduce a semantic consistency loss function to penalize mismatches between generated captions and their corresponding video content. To evaluate the effectiveness of our approach, extensive experiments are conducted on benchmark datasets, including MSR-VTT and ActivityNet Captions. The results demonstrate that our method significantly improves semantic alignment, outperforming state-of-the-art models in BLEU, METEOR, and CIDEr scores.

Keywords

video captioning cross-modal contrastive learning semantic consistency multimodal representation learning pretrained language models

Get full access to this article

View all access options for this article.

References

Zhao

Misra

Krähenbühl

, et al. Learning video representations from large language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway: IEEE, 2023, pp. 6586–6597.

Papineni

Roukos

Ward

, et al. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Stroudsburg: Association for Computational Linguistics, 2002, pp. 311–318.

Banerjee

Lavie

. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Stroudsburg: Association for Computational Linguistics, 2005, pp. 65–72.

Vedantam

Lawrence Zitnick

Parikh

. Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 2015, pp. 4566–4575.

Adewale

Ige

Matti

. Encoder-decoder based long short-term memory (lstm) model for video captioning. arXiv preprint arXiv:2401.02052 2023.

Venugopalan

Rohrbach

Donahue

, et al. Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision. Cambridge: IEEE, 2015, pp. 4534–4542.

Gao

Song

, et al. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 2019; 42(5): 1112–1131.

Bin

Yang

Shen

, et al. Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern 2018; 49(7): 2631–2641.

Sun

Myers

Vondrick

, et al. Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. Piscataway: IEEE, 2019, pp. 7464–7473.

10.

Luo

Shi

, et al. Univl: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 2020.

11.

Yang

Nagrani

Seo

, et al. Vid2seq: large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway: IEEE, 2023, pp. 10714–10726.

12.

Malaviya

Patel

Bharti

. Video captioning using large language models. In: 2024 3rd international conference for innovation in technology (INOCON). Piscataway: IEEE, 2024, pp. 1–7.

13.

Mokady

Hertz

Bermano

. Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734 2021.

14.

Radford

Kim

Hallacy

, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PmLR, 2021, pp. 8748–8763.

15.

Fan

Liu

Chen

, et al.

When does contrastive learning preserve adversarial robustness from pretraining to finetuning?

Adv Neural Inf Process Syst 2021; 34: 21480–21492.

16.

Bain

Nagrani

Varol

, et al. Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. Piscataway: IEEE, 2021, pp. 1728–1738.

17.

Akbari

Yuan

Qian

, et al. Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. Adv Neural Inf Process Syst 2021; 34: 24206–24221.

18.

Lin

Tiwari

Huang

, et al. Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway: IEEE, 2023, pp. 14846–14855.

19.

Zhang

, et al. Fine-grained knowledge graph-driven Video-Language learning for action recognition. arXiv preprint arXiv:2407.14146 2024.

20.

Huang

Liu

Goda

. Applicability of smooth particle hydrodynamics method to large sliding deformation of saturated slopes under earthquake action. Chin J Geotech Eng 2023; 45(2): 336–344.

21.

Liu

Ning

Cao

, et al. Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway: IEEE, 2022, pp. 3202–3211.

22.

Mei

Yao

, et al. Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 2016, pp. 5288–5296.

23.

Caba Heilbron

Escorcia

Ghanem

, et al.

Activitynet: a large-scale video benchmark for human activity understanding

Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 961–970.

24.

Aswiga

Shanthi

. A multilevel transfer learning technique and LSTM framework for generating medical captions for limited CT and DBT images. J Digit Imaging 2022; 35(3): 564–580.

25.

Anderson

Fernando

Johnson

, et al. Spice: semantic propositional image caption evaluation. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, october 11-14, 2016, proceedings, Part V 14. Berlin, Germany: Springer International Publishing, 2016, pp. 382–398.

Application of cross-modal contrastive learning for semantic consistency optimization in video captioning

Abstract

Keywords

Get full access to this article

References