Abstract
In recent years, the task of generating captions for videos has become a prominent research focus, with its main challenge being how to effectively capture essential semantic elements – such as objects, actions, and their spatial-temporal relationships – from abundant and redundant visual content. To address this challenge, earlier methods generally concentrate on either extracting representative clips across multiple frames (global level) or locating salient areas within single frames (local level). Many existing methods tend to ignore the fundamental hierarchical organization of videos, where identifying representative frames should come before locating informative regions. To tackle this limitation, we propose G2L, a hierarchical attention framework that (1) selects salient clips & frames via differentiable Gumbel Top-K sampling and (2) refines region-level context for caption generation. Extensive experiments conducted on the widely adopted benchmarks MSVD and MSR-VTT confirm that our method achieves notable improvements over existing state-of-the-art approaches. Ablations confirm that the global-to-local cascade and dual-branch optimization jointly account for the gain.
Get full access to this article
View all access options for this article.
