Sage Journals: Discover world-class research

Abstract

In recent years, the task of generating captions for videos has become a prominent research focus, with its main challenge being how to effectively capture essential semantic elements – such as objects, actions, and their spatial-temporal relationships – from abundant and redundant visual content. To address this challenge, earlier methods generally concentrate on either extracting representative clips across multiple frames (global level) or locating salient areas within single frames (local level). Many existing methods tend to ignore the fundamental hierarchical organization of videos, where identifying representative frames should come before locating informative regions. To tackle this limitation, we propose G2L, a hierarchical attention framework that (1) selects salient clips & frames via differentiable Gumbel Top-K sampling and (2) refines region-level context for caption generation. Extensive experiments conducted on the widely adopted benchmarks MSVD and MSR-VTT confirm that our method achieves notable improvements over existing state-of-the-art approaches. Ablations confirm that the global-to-local cascade and dual-branch optimization jointly account for the gain.

Keywords

video captioning guidance-propagation hierarchical attention multiple frames Gumbel sampling

Get full access to this article

View all access options for this article.

References

Yao

Torabi

Cho

, et al. Describing videos by exploiting temporal structure. ICCV 2015; 4507–4515.

Pei

Zhang

Wang

, et al. Memory-attended recurrent network for video captioning. CVPR 2019; 8347–8356.

Aafaq

Akhtar

Liu

, et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. CVPR 2019; 12487–12496.

Yang

Han

Wang

. Catching the temporal regions-of-interest for video captioning. ACM MM 2017; 146–153.

Zhao

, et al. MAM-RNN: multi-level attention model based RNN for video captioning. IJCAI 2017; 2208–2214.

Chen

Jiang

Y-G

. Motion guided spatial attention for video captioning. AAAI Press, 2019, 8191–8198.

Zhang

Peng

. Object-aware aggregation with bidirectional temporal graph for video captioning. CVPR 2019; 8327–8336.

Venugopalan

Rohrbach

Donahue

, et al. Sequence to sequence-video to text. ICCV 2015; 4534–4542.

Kiros

, et al. Show, attend and tell: neural image caption generation with visual attention. ICML 2015; 37: 2048–2057.

10.

Anderson

Buehler

, et al. Bottom-up and top-down attention for image captioning and visual question answering. CVPR 2018; 6077–6086.

11.

Buschman

Miller

. Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science 2007; 315(5820): 1860–1862.

12.

Chen

, et al. StructCap: structured semantic embedding for image captioning. In: Proceedings of the 25th ACM International conference on multimedia (MM ’17), Dublin, Ireland, 27–31 October 2025, 2017, pp. 1298–1306.

13.

Chen

Sun

, et al. GroupCap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), San Francisco, CA, USA, 18-20 June 1996 2018, pp. 1345–1353.

14.

Chen

, et al. Variational structured semantic inference for diverse image captioning. Adv Neural Inf Process Syst 2019; 32: 1929–1939.

15.

Cho

Van Merrie¨nboer

Gulcehre

, et al. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.

16.

Chen

, et al. 3SHNet: boosting image–sentence retrieval via visual semantic–spatial self-highlighting. Inf Process Manag 2024; 61(4): 103716.

17.

Chen

, et al. Hire: hybrid-modal interaction with multiple relational enhancements for image-text matching. ACM Trans Intell Syst Technol 2025.

18.

Chen

Shen

, et al. Colloquial image captioning. In: Proceedings of the IEEE international conference on multimedia and expo (ICME), Canada, 15–19 July 2024, 2019, pp. 356–361.

19.

Chen

, et al. Towards end-to-end explainable facial action unit recognition via vision-language joint learning. In: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24). Association for Computing Machinery, 2024, pp. 8189–8198.

20.

Guadarrama

Krishnamoorthy

Malkarnenkar

, et al. Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. ICCV 2013; 2712–2719.

21.

Jang

Poole

. Categorical reparame-terization with gumbel-softmax. arXiv 2016; ICLR2017.

22.

Long

Wang

Zhou

, et al. Stand-alone inter-frame attention in video models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Nashville TN, 11–15 June 2025, 2022.

23.

Feichtenhofer

Fan

Malik

, et al. Slow-fast networks for video recognition. ICCV 2019; 6202–6211.

24.

Mei

Yao

, et al. Msr-vtt: a large video description dataset for bridging video and language. CVPR 2016; 5288–5296.

25.

Rohrbach

Qiu

Titov

, et al. Translating video content to natural language descriptions. ICCV 2013; 433–440.

26.

Denkowski

Lavie

. Meteor universal: language specific translation evaluation for any target language. SMT Workshop 2014; 376–380.

27.

Vedantam

Lawrence Zitnick

Parikh

. Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, San Francisco, CA, USA, 18-20 June 1996, 2015, pp. 4566–4575.

28.

Venugopalan

Donahue

, et al. Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729 2014.

29.

Szegedy

Ioffe

Vanhoucke

. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI 2017; 4278–4284.

30.

Hara

Kataoka

Satoh

. Can spatiotem-poral 3d cnns retrace the history of 2d cnns and imagenet? CVPR, 2018.

31.

Wang

Zhu

Wang

. Efficient video transformers with spatial-temporal token selection. In: Proceedings of the European conference on computer vision (ECCV), Milan Italy, 29 September –4 October 2024, 2022.

32.

Loshchilov

Hutter

. Decoupled weight decay regularization. ICLR 2019.

33.

Anderson

Fernando

Johnson

, et al. Spice: semantic propositional image caption evaluation. ECCV Springer, 2016, pp. 382–398.

34.

Wang

Zhang

, et al. Reconstruction network for video captioning. CVPR 2018a; 7622–7631.

35.

Cao

, et al. Inter-pretable video captioning via trajectory structured localization. CVPR 2018; 6829–6837.

36.

Chen

Wang

Zhang

, et al. Less is more: picking informative frames for video captioning. ECCV 2018b; 358–373.

37.

Wang

Huang

, et al. M3: multimodal memory modelling for video captioning. CVPR, 2018b.

From higher to lower: A guidance-propagation hierarchical attention for video captioning

Abstract

Keywords

Get full access to this article

References