Sage Journals: Discover world-class research

Abstract

In recent years, the wide application of weakly supervised temporal action localization (WTAL) technology has accelerated the efficiency of video analysis. However, this domain continues to confront numerous challenges, especially due to the lack of precise temporal annotations. Consequently, this technique becomes highly susceptible to contextual background noise and overly reliant on prominent action segments, leading to less-than-ideal action localization. To alleviate this problem, we propose the contrastive learning-based action salience network (CLASNet), comprising two pivotal modules: feature contrast separation module (FCSM) and boundary refinement module (BRM). FCSM utilizes a contrastive learning approach to effectively separate action features from background features, thereby enhancing the discriminability of features. Concurrently, BRM introduces boundary refinement loss to rectify the temporal boundaries of actions, thereby further elevating the precision of temporal localization. The collaborative functioning of these two key modules effectively resolves the ambiguity issues in temporal action localization under weak supervision, markedly enhancing localization accuracy. Furthermore, CLASNet is versatile and can be integrated into different WTAL frameworks, achieving enhanced localization performance while preserving the original end-to-end training manner. Utilizing three large-scale benchmark action localization datasets, THUMOS14, ActivityNet v1.2, and ActivityNet v1.3, we embed CLASNet into various cutting-edge weakly supervised temporal action localization methods, such as CO2-Net, DELU, and ACRNet, for empirical substantiation. The experimental outcomes reveal that CLASNet significantly enhances the efficacy of these methods in action localization, offering novel perspectives for the advancement of temporal action localization technology.

Keywords

Temporal action localization weakly supervised learning foreground–background separation boundary refinement

1. Introduction

With the continuous growth of online video creators and users, video content output has reached an unprecedented level, bringing video understanding to the forefront of computer vision research. However, the visibility and discriminability of actions in videos are influenced by their diverse characteristics, including type, duration, and frequency. Therefore, the task of temporal action localization is critical, which can accurately identify essential action information from complex videos, thereby enhancing the effectiveness and efficiency of video understanding. This task also has broad potential applications, which can bring convenience to various domains, including intelligent video analysis (Guo et al., 2021), surveillance (Vishwakarma and Agrawal, 2013), video summarization (Habib et al., 2021), video retrieval (Lin et al., 2022), and driving behavior detection (Gabeur et al., 2020).

Existing methods for temporal action localization can be classified into two categories based on the type of supervision: fully supervised temporal action localization and weakly supervised temporal action localization (WTAL). The majority of current research methods focus on the fully supervised setting, where precise frame-level annotations are available, enabling these methods to achieve high localization accuracy. However, video annotation is a time-consuming, laborious, and error-prone process, and the consistency and quality of the annotations are hard to guarantee. Consequently, the weakly supervised setting has recently garnered more attention from researchers. This task aims to recognize the categories of action instances (e.g., running, swimming, and high jump) in videos and precisely locate the temporal boundaries of each action, given only video-level labels without any additional information (Baraka and Mohd Noor, 2022).

Currently, researchers have conducted comprehensive investigations into the task of weakly supervised temporal action localization, proposing two categories of approaches: attention mechanism-based methods (Dou and Hu, 2023; He et al., 2022; Huang et al., 2022; Lee et al., 2022) and foreground–background modeling-based methods (Lee et al., 2021; Li et al., 2022; Moniruzzaman and Yin, 2023; Yang et al., 2021). The attention mechanism-based action localization method leverages the attention mechanism to identify the segments with a high probability of containing actions, thereby enabling the model to concentrate more effectively on the keyframes in the video and enhance the precision of action localization. However, this method is not without its limitations, as each video usually consists of multiple action segments, and with only video-level annotations, the model is prone to concentrate on the most distinctive action segments, or mistakenly attend to background segments. To deal with the complex background, researchers proposed foreground–background modeling-based action localization methods, which improve the robustness and accuracy of action localization by differentiating the action foreground from the irrelevant background in the video. However, this method is still plagued by underlocalization or overlocalization of ambiguous segments. There is an urgent need for more efficacious weakly supervised learning methods and models to enhance the discriminability of action features, the precision of action boundaries, and pave the way for broader applications.

To mitigate the aforementioned challenges, we propose the contrastive learning-based action salience network (CLASNet). Drawing inspiration from the successful deployment of contrast learning in weakly supervised semantic segmentation, we design a novel feature contrast separation module (FCSM). This module encompasses a time attention weight generation mechanism and a contrast separation mechanism. The former leverages temporal convolution network (TCN) to expand the receptive field of the model, enabling it to obtain the temporal information of the entire action instance and eliminate the temporal discontinuity caused by the short duration of the segments. It also captures the temporal dependencies among different segments within the receptive field. The latter utilizes the attention weights to separate the features and determine more reliable foreground regions. However, contrast learning may introduce noise separation of the features, which may degrade the localization performance of the model. Therefore, we introduce a boundary refinement module (BRM) to achieve more precise boundary localization. This module inputs the attention scores from the feature separation module and the temporal class activation sequence (TCAS) from the WTAL model. It employs a boundary refinement loss to enhance the alignment between the TCAS and the attention scores, which enables the correction of the temporal boundaries and the achievement of precise temporal localization. Through collaborative functioning, two modules effectively resolve the ambiguity issues in temporal action localization under weak supervision and enhance localization accuracy. Furthermore, the proposed CLASNet is a plug-and-play module that can be incorporated with any existing WTAL method, thereby boosting the action localization performance of the methods and achieving more accurate localization results.

The rest of the paper is organized as follows. Section 2 discusses related work about weakly supervised temporal action localization. Section 3 describes the proposed method in detail, including the conceptual framework for model construction, the basic modules, and the training strategies. Experimental results and discussions are presented in Section 4. Finally, we conclude the paper in Section 5.

2. Related Works

In recent years, weakly supervised temporal action localization tasks have gained increasing attention due to their advantages of low annotation cost, diverse action categories, and efficient localization. Researchers have actively explored various innovative methods to address the challenges in this domain. Some studies leverage convolutional neural network to extract video features and then utilize recurrent neural network to capture temporal features for action localization. However, these methods are limited in handling long-duration actions and consuming computational resources, thereby hindering the achievement of accurate action localization in videos. In contrast, humans can usually focus on the segments of interest when watching videos and thus capture action details more precisely. Inspired by this, researchers have incorporated attention mechanisms into weakly supervised temporal action localization. HAM-Net (Jalayer et al., 2022) employs a hybrid attention mechanism that captures the most discriminative frames by considering their relationships. CO2-Net (Hong et al., 2021) introduces a cross-modal consistency network that uses a cross-modal attention mechanism to readjust features by exploiting the global information of the primary modality and the cross-modal local information of the auxiliary modality to reduce the information redundancy irrelevant to the tasks. However, these methods often require additional regularization terms to ensure the distinctiveness or complementarity of scores across branches or attention mechanisms, making it challenging to determine an appropriate number of branches or attention for all action categories. To address this issue, researchers attempt to use intervideo information to guide attention allocation. ASM-Loc (He et al., 2022) uses intra- and intersegment attention to model action dynamics and capture temporal dependencies. RSPK (Huang et al., 2022) adopts a framework that effectively summarizes and propagates snippet-level knowledge across videos, which utilizes the expectation–maximization attention to process and capture the critical semantics of each video as representative segments. However, due to the complex background in the videos, the model tends to focus on the most distinctive action snippets or mistakenly focus on the background snippets.

To overcome the interference of complex backgrounds in videos, researchers have proposed foreground–background modeling-based methods. However, foreground–background separation is still extremely challenging due to the lack of video-level labels for background classes. Lee et al. (2021) apply uncertainty modeling to model background frames and separate action frames from them. BaS-Net (Kim and Cho, 2022) robustly separates the context (such as backgrounds with similar actions) by establishing a reliable background probability, which enables more accurate localization of action time intervals. Moreover, Chen et al. (2023) propose a cascade evidential learning framework at an evidence level, which combines multiscale temporal context and knowledge-guided prototype information to gradually collect cascaded and enhanced evidence for separating known actions, unknown actions, and backgrounds. However, these methods might neglect subactions with low identifiability or misclassify background segments resembling actions as actual actions, leading to underlocalization or overlocalization. Some researchers attempt to use contrastive learning to achieve foreground–background separation and enhance the discriminability of features. Tang et al. (2023) propose an overconfidence suppression strategy to mitigate the influence of overconfident pseudo-labels. Then, a simplified contrastive learning method be used to fine-tune the feature representation and increase the separation of foreground and background segments. Li et al. (2022) propose a novel de-noising cross-video contrastive algorithm, which aims to reduce the impact of segmentation errors on positive/negative sample pairs. Furthermore, CoLA (Zhang et al., 2021) exploits contrastive learning to perceive accurate temporal boundaries and avoid the interruption of time intervals, achieving classification and localization of ambiguous segments in videos. However, the absence of precise time-stamped labels might introduce noise from foreground and background segments. We propose the CLASNet. First, through the feature contrastive separation module, we utilize TCN (Bai et al., 2018) to capture the temporal information in the video, thereby guiding the multihead attention (MHA) to generate reliable attention weights. Then, we achieve the foreground–background separation by the contrastive loss, thus increasing the feature discriminability. Meanwhile, we introduce a BRM to solve the noise separation problem caused by contrastive learning, obtain fine-grained boundary information, and achieve accurate action localization. The BRM is a little similar to fine-grained temporal contrastive learning (Gao et al., 2022), which focuses on optimizing the distinction between actions and backgrounds through dynamic programming and contrastive learning, as well as extracting the longest common subsequence in videos. BRM focuses more on improving the precision of action boundaries rather than overall temporal contrastive learning. It should be emphasized that prior work designs fixed WTAL networks, that is, the overall framework process, which cannot be flexibly integrated as a plugin into other models. However, our CLASNet is a versatile module capable of improving various WTAL frameworks. Unlike existing studies focusing on enhancing localization performance based on preextracted segment-level features, this study emphasizes the general benefits derived from different WTAL frameworks through effective foreground–background separation.

3. Methodology

3.1. Structure Overview

The temporal localization of actions in unedited videos necessitates the application of the multiinstance paradigm (Maron and Lozano-Pérez, 1997). This paradigm views the entire video as a set of foreground action frames and background non-action frames, extracts the TCAS through the classifier, and optimizes it based on the provided video-level label to obtain accurate localization results. However, this process faces several challenges, which can be summarized in two aspects: (a) The attention mechanism-based action localization method tends to prioritize the most discriminative action segments or erroneously emphasize background segments. (b) While the foreground–background modeling-based action localization method enhances the robustness and accuracy of action localization, it encounters issues of underlocalization or overlocalization of ambiguous segments.

To tackle these challenges, we propose the CLASNet, comprising two modules: the FCSM and the BRM. Firstly, in the FCSM, we utilize the temporal attention weights generate mechanism and the contrast separation method. The former captures complete temporal information of action instances and produces reliable attention weights, while the latter enhances feature identifiability by effectively separating action features from background features. Secondly, the BRM expands the similarity between the TCAS output by the WTAL model and the attention score obtained by the feature separation module through a boundary refinement loss. This correction of temporal boundaries addresses issues caused by ambiguous model localization resulting from feature noise separation during contrast separation, ultimately achieving precise temporal localization.

By seamlessly integrating these two modules, we implement an end-to-end weakly supervised temporal action localization method that tackles ambiguity in weakly supervised environments. It achieves more precise localization performance by embedding different WTAL frameworks. Figure 1 illustrates the overall framework of the proposed CLASNet.

Figure 1.

Overall framework of the proposed CLASNet. Note. CLASNet = contrastive learning-based action salience network.

3.2. Feature Contrast Separation Module

TCAS or attention score inaccuracy in video action recognition primarily arises from the insufficient discriminability of segment-level features. To mitigate this issue, we propose a FCSM that captures the temporal information of the complete action instance within the video feature and generates reliable cross-modal attention weights through a mechanism for generating temporal attention weights and employing a contrast separation strategy. Subsequently, based on these attention weights, the video features are divided into action-related and non-action-related features. Contrastive learning is utilized to amplify their differences, thereby obtaining a more discriminative feature representation.

The existing approaches for video action recognition demonstrate insufficient utilization of temporal information, leading to limited localization performance. To address this issue, we propose the adoption of multilayer dilated convolution to expand the receptive field and capture long-range dependencies among segments within the field. This design enables the recognition model to comprehensively learn and exploit temporal features while leveraging segment information across the entire receptive field for enhanced discrimination of action instances. Specifically, we concatenate the red–green–blue (RGB) feature ${F_{i}}^{RGB}$ obtained from feature extraction with the optical flow feature ${F_{i}}^{Flow}$ , denoted as $F_{i} \in R^{L \times 2 D}$ , where $L$ represents the number of sample segments. Subsequently, $F_{i}$ is fed into the first layer $f_{d, 1}$ of dilated convolution with a dilation factor $d$ set to 1. After passing through $ReLU$ activation layer and $WeightNorm$ weight normalization layer, an intermediate result $F_{i, 1}$ is obtained. The computation of $f_{d, k}$ at $k - layer$ is illustrated by equation (1).

\begin{aligned} F_{i, k} & = WeightNorm (ReLU (f_{d, k} (F_{i, k - 1}, 2^{k - 1}))), \\ k & = 1, \dots, K, F_{i, 0} = F_{i}, \end{aligned}

(1)

where

F_{i, k} \in R^{D \times L}

is the output of

k

th dilated convolution.

The receptive field size is expanded to $2^{k}$ segments in $k$ -layer. By utilizing the residual layer consisting of dropout and Sigmoid function, we obtain the processed feature $F_{i}^{*}$ through dilated convolution, as depicted in equation (2).

F_{i}^{*} = σ (Dropout (I_{i, K})) \cdot F_{i}

(2)

where

σ

represents

sigmoid

function.

The utilization of multiple modalities facilitates the acquisition of a more comprehensive range of information compared to using a single modality. However, integrating different modalities may result in a reduction in specific intramodality details. To ensure reliable fusion and maximize cross-correlation between modality features, we utilize a cross-modal MHA mechanism inspired by previous studies (Ren et al., 2022) to generate attention weights. This mechanism not only enhances the representative capacity of the model but also captures a diverse array of features. Specifically, feature $F_{i}^{*}$ is divided into RGB feature ${F_{i}}^{RGB *}$ and optical flow feature ${F_{i}}^{Flow *}$ . Random iteration matrices ${M_{s}^{RGB}}_{s = 1}^{S}$ and ${M_{s}^{Flow}}_{s = 1}^{S}$ are used to project them into matrices ${Φ_{s}}_{s = 1}^{S}$ and ${Ψ_{s}}_{s = 1}^{S}$ , respectively, where $S$ represents the number of heads in MHA. The specific projection process is illustrated in equation (3).

Φ_{s} = {F_{i}}^{RGB *} \times M_{s}^{RGB}, Ψ_{s} = {F_{i}}^{Flow *} \times M_{s}^{Flow}, s . t . M_{s}^{RGB} \in R^{D \times ⌊ D / S ⌋}, M_{s}^{Flow} \in R^{D \times ⌊ D / S ⌋}, s = 1, 2, \dots, S .

(3)

where

Φ_{s} = {[ϕ_{1}^{s}, ϕ_{2}^{s}, \dots, ϕ_{L}^{s}]}^{T}, ϕ_{i}^{s} \in R^{⌊ D / S ⌋}

represents the segment-level RGB features of a specific video, and

Ψ_{s} = {[ψ_{1}^{s}, ψ_{2}^{s}, \dots, ψ_{L}^{s}]}^{T}, ψ_{i}^{s} \in R^{⌊ D / S ⌋}

represents the segment-level optical flow features of a specific video. Together,

{Φ_{s}, Ψ_{s}}

constitute the multihead input.

After obtaining the multihead input, we employ a learnable weight matrix $W_{s} \in R^{⌊ D / S ⌋ \times ⌊ D / S ⌋}$ for each attention head to facilitate an enhanced exploration of the cross-correlation between different modal features. This approach aims to minimize the discrepancy between the two modes, thereby yielding a more precise correlation measure. The specific calculation process is illustrated in equation (4).

A_{s} = {\bar{Φ}}_{s} W s {\bar{Ψ}}_{s}^{T}

(4)

where the

l_{2}

normalization results of each row of

Φ_{s}

and

Ψ_{s}

are denoted as

{\bar{Φ}}_{s}

and

{\bar{Ψ}}_{s}

respectively.

A_{s} \in R^{L \times L}

represents the correlation matrix at the

s

th head between RGB feature

Φ_{s}

and flow feature

Ψ_{s}

, with its correlation value indicating the relationship between corresponding RGB and flow segment features. The correlation matrices

A_{s}

and

A_{s}^{T}

are employed to generate the cross-modal attention weights

A_{s}^{RGB}

and

A_{s}^{Flow}

, respectively, by applying a column-by-column

Softmax

operation. Subsequently, all attention weights are aggregated according to equation (5) to obtain the final attention weights.

\begin{aligned} A^{RGB} & = \tanh (Concat ([A_{1}^{RGB}, \dots, A_{S}^{RGB}])), \\ A^{Flow} & = \tanh (Concat ([A_{1}^{Flow}, \dots, A_{S}^{Flow}])), \end{aligned}

(5)

where

Concat

is the weight that connects all and

\tanh

is the hyperbolic sinusoidal activation function.

According to the obtained attention weights, we employ the pseudo-label generation method proposed in the literature (Ma et al., 2021) for selecting top-k segments that constitute the action segment set $T_{act}$ , while assigning the remaining action-irrelevant segments to form the background segment set $T_{bac}$ . The specific calculation process is illustrated in equations (6) and (7).

T_{c} = \underset{l \subseteq {1, \dots,L}}{\arg max} \sum_{k \in l} h (A^{RGB} + A^{Flow}, f_{cas} (F^{*})), s . t . | l | = k

(6)

where

h (\cdot)

is a convex combination function and

f_{cas} (\cdot)

is a classifier.

\begin{aligned} T_{act} & = ⋃ T^{c} \\ T_{bac} & = {1, \dots, L} / T_{act} \end{aligned}

(7)

The action segment set

T_{act}

and background segment set

T_{bac}

are used to select the respective action segment features

v^{a}

and background segment features

v^{b}

. Positive pairs are formed by selecting either the action segment features

(v^{a}, v^{a})

or the background segment features

(v^{b}, v^{b})

. However, the pair of action-background segment feature representation

(v^{a}, v^{b})

constitutes a negative pair. This process enables contrast separation, which is visualized in Figure 2. The features are mapped into the feature space, where contrastive learning is used to amplify the differences between foreground features (green) and background features (blue). This process makes the action features more compact with each other, and similarly, the background features become more compact with each other. The proposed contrastive loss function from Tang et al. (2023), as shown in equation (8), is utilized for foreground and background separation.

\begin{aligned} L_{sep} & = \sum_{i = 1}^{n} \sum_{j = 1}^{n} (I (i \in T_{act}) I (j \in T_{act}) \exp (- \frac{v_{i}^{a} \cdot v_{j}^{a}}{| | v_{i}^{a} | | \cdot | | v_{j}^{a} | |}) \\ + I (i \in T_{bac}) I (j \in T_{bac}) \exp (- \frac{v_{i}^{b} \cdot v_{j}^{b}}{| | v_{i}^{b} | | \cdot | | v_{j}^{b} | |}) \\ + I (i \in T_{act}) I (j \in T_{bac}) \exp (\frac{v_{i}^{a} \cdot v_{j}^{b}}{| | v_{i}^{a} | | \cdot | | v_{j}^{b} | |})) \end{aligned}

(8)

where

I (\cdot)

denotes the indicator function.

Figure 2.

Visualization of the contrast separation.

3.3. Boundary Refinement Module

The FCSM proposed in Section 3.4 inevitably introduces some noise when selecting action-background segments, which may result in the noisy separation of features and consequently impact the action localization performance of the model. To address this issue, we present a BRM. It aims to achieve accurate action localization by performing more fine-grained optimization on the generated incomplete or overcomplete action segments using the attention scores $A^{RGB}$ and $A^{Flow}$ , as well as the TCAS $Q$ output by the WTAL module. The module converts these inputs into a probability distribution and utilizes the boundary refinement loss function $L_{BF}$ . The steps are as follows:

Step 1: Fusion of RGB information and optical flow information. The attention weights $A^{RGB}$ and $A^{Flow}$ obtained from equation (5) are utilized to calculate the average value $A_{c}$ of the weights, generating the final action score as depicted in equation (9).

A_{c} = \frac{A^{RGB} + A^{Flow}}{2}

(9)

Step 2: Sparse operation on attention weights. To mitigate overfitting risks and enhance operational efficiency, the attention weights are sparsified using the $l_{2}$ norm, with specific calculations shown in equation (9).

L_{norm} = {‖ A_{c} ‖}_{2}

(10)

Step 3: Transformation of attention weights into a prior distribution. The attention weights $A_{c}$ obtained from the FCSM are transformed into a prior distribution $P \in R^{(C + 1) \times L}$ , with specific calculations shown in equation (10).

P_{i} = softmax (f_{cas} (A_{c}))

(11)

where

i

is the time step.

Step 4: Conversion of class activation sequence into a probability distribution is performed. The TCAS produced by the WTAL module is transformed into a prior distribution through function, with detailed calculations presented in equation (11).

O_{j} = softmax (Q)

(12)

where

j

is the time step.

Step 5: Kullback–Leibler (KL) divergence approximates the probability distribution. After obtaining both distributions, boundary refinement loss is introduced to approximate the TCAS outputted by the WTAL module to its corresponding prior distribution via minimizing KL divergence between these distributions. Specific calculations can be found in equation (12).

L_{BF} = \sum P_{i} (\log P_{i} - \log O_{j})

(13)

By implementing these aforementioned steps, boundary refinement can be achieved for more accurate action localization without requiring additional modules or annotations.

3.4. Training Objective and Action Localization

In order to obtain the optimal TCAS, CLASNet adopts a joint optimization method that utilizes four distinct loss functions to achieve more accurate classification and localization effects. These four loss functions are as follows: (a) Baseline model loss $L_{base}$ is utilized for training the baseline model and minimizing its value to derive the corresponding time-series class activation sequence. (b) Action-background separation loss $L_{sep}$ aims to distinguish action and non-action features to obtain a more discriminative feature representation. (c) Normalization loss $L_{norm}$ is used for sparsing attention weights to ensure generalization during the learning process and avoid overfitting on specific samples. (d) Boundary refinement loss $L_{BF}$ aims to accurately map TCAS to its corresponding prior distribution for more precise action localization. By augmenting sensitivity toward action boundaries, it effectively improves overall precision. During training, these four loss functions are integrated into an overall loss function denoted as $L_{toal} = L_{base} + α L_{sep} + β L_{norm} + γ L_{BF}$ . Here, $α$ , $β$ , and $γ$ represent hyperparameters that balance the effects of different losses.

In the action localization stage, CLASNet is utilized for joint optimization to obtain the optimal TCAS of the video, representing the probability distribution of each action class. However, due to the impracticality of directly utilizing this TCAS for action localization, a threshold is set to determine the specific action category within the video. The continuous sequence formed by all instances with a probability higher than this threshold is considered an actionable proposal. We use the outer–inter score method (Shou et al., 2018) on these proposals to assess their confidence scores. Additionally, attention weights are subjected to multiple thresholds to generate proposals at different scale levels. Finally, Soft-NMS (Bodla et al., 2017) is applied to eliminate overlapping action proposals and derive accurate final results for action localization.

4. Experiments

To validate the effectiveness and accuracy of CLASNet, we embed CLASNet into three various cutting-edge weakly supervised temporal action localization methods (CO2-Net, Hong et al., 2021; DELU, Chen et al., 2022; and ACRNet, Ren et al., 2023) and compare them with the current leading WTAL methods. The CLASNet undergoes extensive ablation experiments on various components to showcase the effectiveness of each module. By visually analyzing the localization results, it is demonstrated that the proposed model achieves precise localization of ambiguous segments.

4.1. Datasets and Implementation Details

To evaluate the performance improvement of CLASNet on temporal action localization compared with different methods, we conduct experiments on three datasets: THUMOS14 (Idrees, 2017), ActivityNet v1.2 (Heilbron et al., 2015), and ActivityNet v1.3. THUMOS14 dataset is a subset of UCF101 dataset, which consists of a validation set of 1,010 videos and a test set of 1,574 videos, covering 101 action categories, among which 20 categories have temporal annotations. This dataset is challenging because some videos contain multiple actions that are hard to distinguish. We adopted the same setting as the existing methods, using a validation set of 200 videos for model training and a test set of 213 videos for evaluation. ActivityNet v1.2 dataset contains 4,819 training videos, 2,383 validation videos, and 2,480 test videos, covering 100 action categories. ActivityNet v1.3 dataset is an extended version of ActivityNet v1.2, which contains 10,024 training videos, 4,926 validation videos, and 5,044 test videos, covering 200 action categories. Each video average contains 1.6 action instances. We implement the proposed CLASNet on PyTorch (Paszke et al, 2019), running on the Windows 10 operating system, with GeForce RTX 3060Ti GPU and Intel i7-11700K CPU. To ensure the fairness of comparison, we follow the previous methods (Chen et al., 2022; Hong et al., 2021; Ren et al., 2023). We use two-stream I3D (Kay, 2017) pretrained on the Kinetics-400 dataset (Carreira and Zisserman, 2018) to extract both the RGB and optical flow features. Video snippets are sampled every 16 frames and the feature dimension of each snippet is 1024. In the training process, for THUMOS14 dataset, we set the number of sampled segments to 800, the batch size to 32, and the hyperparameters $α$ and $β$ to 0.8 and 0.1, respectively. For ActivityNet dataset, we set the number of sampled segments $L$ to 80, the batch size to 128, and the hyperparameters $α$ and $β$ to 0.5 and 0.1, respectively. The hyperparameter $γ$ is set to 1. We use Adam optimizer for optimization, with a learning rate of $1 \times 10^{- 4}$ and a weight decay of $1 \times 10^{- 3}$ for both datasets.

4.2. Evaluation Metrics

We adopt the standard evaluation method for temporal action localization, using the mean average precision (mAP) under different temporal intersection over union (t-IoU) thresholds to evaluate the WTAL performance on three benchmark datasets. The higher the mAP, the better the WTAL performance. For a fair comparison, we use the benchmark code provided by ActivityNet (Heilbron et al., 2015) to calculate the results. Specifically, the t-IoU thresholds for THUMOS14 are set to $[0.1 : 0.1 : 0.7]$ , and the t-IoU thresholds for ActivityNet are set to $[0.5 : 0.05 : 0.95]$ .

4.3. Comparison With State-of-the-Art Methods

To verify the performance improvement of the proposed CLASNet on temporal action localization compared with different methods, we conduct comparative experiments on THUMOS14, ActivityNet v1.2, ActivityNet v1.3 datasets, and select the latest three models CO2-Net, DELU, and ACRNet as baselines for retraining. The experimental results are shown in Tables 1 to 3.

Table 1.
Performance Comparisons on THUMOS14 Testing Set.

mAP@IoU(%) AVG

Method 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1:0.7

Frame-level (fully supervised)

BSN (Lin et al., 2018) ECCV2018 – – 53.5 45 36.9 28.4 20.0 36.76

AGCN (Park et al., 2020) AAAI2020 59.3 59.6 57.1 51.6 38.6 28.9 17.0 44.6

TAL-Net (Chao et al., 2018) CVPR2018 59.8 57.1 53.2 48.5 42.8 33.8 20.8 39.8

P-GCN (Zeng et al., 2019) ICCV2019 69.5 67.5 63.6 57.8 49.1 – – –

GTAN (Long et al., 2019) ICCV2019 69.1 63.7 57.8 47.2 38.8 – – –

G-TAD (Xu et al., 2020) CVPR2020 66.1 64.2 54.5 47.6 40.2 30.8 23.4 39.3

BSN++ (Su et al., 2021) AAAI2021 – – 59.9 49.5 41.3 31.9 22.8 41.1

Video-level (weakly supervised)

BaS-Net (Lee et al., 2020) AAAI2020 58.2 52.3 44.6 36.0 27.0 18.6 10.4 35.3

CoLA (Zhang et al., 2021) CVPR2021 66.2 59.5 51.5 41.9 32.2 22.0 13.1 40.9

UM (Lee et al., 2021) 67.5 61.2 52.3 43.4 33.7 22.9 12.1 41.9

MMSD (Huang et al., 2022) TIP2022 69.7 64.3 54.6 45.0 36.4 23.0 12.3 43.6

RSPK (Huang et al., 2022) CVPR2022 71.3 65.3 55.8 47.5 38.2 25.4 12.5 45.1

BoostingWTAL (Li et al., 2023) CVPR2023 – – 56.2 47.8 39.3 27.5 15.2 –

ASM-Loc (Huang et al., 2022) CVPR2022 71.2 65.5 57.1 46.8 36.6 25.2 13.4 45.1

TFE-DCN (Zhou and Wu, 2023) WACV2023 72.3 66.5 58.6 49.5 40.7 27.1 13.7 46.9

GauFuse_WSTAL (Zhou et al., 2023) CVPR2023 74.0 69.4 60.7 51.8 42.7 26.2 13.1 48.3

CO2-Net (Hong et al., 2021) MM2021 70.1 63.6 54.5 45.7 38.3 26.4 13.4 44.6

CO2-Net (Hong et al., 2021)* 69.3 62.9 54.1 45.1 37.5 25.3 12.9 43.9

+CLASNet 71.6 $^{+ 1.3}$ 64.0 $^{+ 1.1}$ 55.7 $^{+ 1.6}$ 46.9 $^{+ 1.8}$ 39.0 $^{+ 1.5}$ 26.3 $^{+ 1.0}$ 13.8 $^{+ 0.9}$ 45.3 $^{+ 1.4}$

DELU (Chen et al., 2022) ECCV2022 71.5 66.2 56.5 47.7 40.5 27.2 15.3 46.4

DELU (Chen et al., 2022)* 69.2 64.0 54.6 46.1 39.0 26.1 14.8 44.8

+CLASNet 70.9 $^{+ 1.7}$ 65.8 $^{+ 1.8}$ 55.6 $^{+ 1.0}$ 47.2 $^{+ 1.1}$ 39.7 $^{+ 0.7}$ 26.9 $^{+ 0.8}$ 14.9 $^{+ 0.1}$ 45.9 $^{+ 1.1}$

ACRNet (Ren et al., 2023) ICME2023 76.9 70.7 61.0 49.0 37.0 24.8 13.4 47.5

ACRNet (Ren et al., 2023)* 76.2 70.0 60.5 48.9 36.6 25.0 13.6 47.2

+CLASNet 77.9 $^{+ 1.8}$ 71.4 $^{+ 1.4}$ 63.0 $^{+ 2.5}$ 51.1 $^{+ 1.2}$ 38.1 $^{+ 1.5}$ 25.7 $^{+ 0.7}$ 13.4 $_{- 0.2}$ 48.7 $^{+ 1.5}$

	mAP@IoU(%)	AVG
Frame-level (fully supervised)
BSN (Lin et al., 2018) ECCV2018	–	–	53.5	45	36.9	28.4	20.0	36.76
AGCN (Park et al., 2020) AAAI2020	59.3	59.6	57.1	51.6	38.6	28.9	17.0	44.6
TAL-Net (Chao et al., 2018) CVPR2018	59.8	57.1	53.2	48.5	42.8	33.8	20.8	39.8
P-GCN (Zeng et al., 2019) ICCV2019	69.5	67.5	63.6	57.8	49.1	–	–	–
GTAN (Long et al., 2019) ICCV2019	69.1	63.7	57.8	47.2	38.8	–	–	–
G-TAD (Xu et al., 2020) CVPR2020	66.1	64.2	54.5	47.6	40.2	30.8	23.4	39.3
BSN++ (Su et al., 2021) AAAI2021	–	–	59.9	49.5	41.3	31.9	22.8	41.1
Video-level (weakly supervised)
BaS-Net (Lee et al., 2020) AAAI2020	58.2	52.3	44.6	36.0	27.0	18.6	10.4	35.3
CoLA (Zhang et al., 2021) CVPR2021	66.2	59.5	51.5	41.9	32.2	22.0	13.1	40.9
UM (Lee et al., 2021)	67.5	61.2	52.3	43.4	33.7	22.9	12.1	41.9
MMSD (Huang et al., 2022) TIP2022	69.7	64.3	54.6	45.0	36.4	23.0	12.3	43.6
RSPK (Huang et al., 2022) CVPR2022	71.3	65.3	55.8	47.5	38.2	25.4	12.5	45.1
BoostingWTAL (Li et al., 2023) CVPR2023	–	–	56.2	47.8	39.3	27.5	15.2	–
ASM-Loc (Huang et al., 2022) CVPR2022	71.2	65.5	57.1	46.8	36.6	25.2	13.4	45.1
TFE-DCN (Zhou and Wu, 2023) WACV2023	72.3	66.5	58.6	49.5	40.7	27.1	13.7	46.9
GauFuse_WSTAL (Zhou et al., 2023) CVPR2023	74.0	69.4	60.7	51.8	42.7	26.2	13.1	48.3
CO2-Net (Hong et al., 2021) MM2021	70.1	63.6	54.5	45.7	38.3	26.4	13.4	44.6
CO2-Net (Hong et al., 2021)*	69.3	62.9	54.1	45.1	37.5	25.3	12.9	43.9
+CLASNet	71.6 $^{+ 1.3}$	64.0 $^{+ 1.1}$	55.7 $^{+ 1.6}$	46.9 $^{+ 1.8}$	39.0 $^{+ 1.5}$	26.3 $^{+ 1.0}$	13.8 $^{+ 0.9}$	45.3 $^{+ 1.4}$
DELU (Chen et al., 2022) ECCV2022	71.5	66.2	56.5	47.7	40.5	27.2	15.3	46.4
DELU (Chen et al., 2022)*	69.2	64.0	54.6	46.1	39.0	26.1	14.8	44.8
+CLASNet	70.9 $^{+ 1.7}$	65.8 $^{+ 1.8}$	55.6 $^{+ 1.0}$	47.2 $^{+ 1.1}$	39.7 $^{+ 0.7}$	26.9 $^{+ 0.8}$	14.9 $^{+ 0.1}$	45.9 $^{+ 1.1}$
ACRNet (Ren et al., 2023) ICME2023	76.9	70.7	61.0	49.0	37.0	24.8	13.4	47.5
ACRNet (Ren et al., 2023)*	76.2	70.0	60.5	48.9	36.6	25.0	13.6	47.2
+CLASNet	77.9 $^{+ 1.8}$	71.4 $^{+ 1.4}$	63.0 $^{+ 2.5}$	51.1 $^{+ 1.2}$	38.1 $^{+ 1.5}$	25.7 $^{+ 0.7}$	13.4 $_{- 0.2}$	48.7 $^{+ 1.5}$

Note. AVG is the average mAP under the thresholds 0.1:0.1:0.7, while “–” means that corresponding results are not reported in the original papers, and * represents the results are obtained by retraining the official code on our own machine. The bolded values could represent the outcomes that we want to emphasize. These are the results that we consider most important or that answer the primary research questions.

Table 2.

Performance Comparisons on ActivityNet1.2 Validation Set.

	mAP@IoU(%)
Method	0.5	0.75	0.95	AVG
BaS-Net (Lee et al., 2020) AAAI2020	38.5	24.2	5.6	24.3
UM (Lee et al., 2021) AAAI2021	41.2	25.6	6.0	25.9
CoLA (Zhang et al., 2021) CVPR2021	42.7	25.7	5.8	26.1
STCL-Net (Fu et al., 2023) PAMI2023	44.0	26.1	5.3	26.6
CO2-Net (Hong et al., 2021) MM2021	43.3	26.3	5.2	26.4
CO2-Net (Hong et al., 2021)*	42.9	25.5	5.0	25.8
+CLASNet	44.1 $^{+ 1.2}$	26.4 $^{+ 0.9}$	5.5 $^{+ 0.5}$	26.9 $^{+ 1.1}$
DELU (Chen et al., 2022) ECCV2022	44.2	26.7	5.4	26.9
DELU (Chen et al., 2022)*	43.5	26.0	5.0	26.3
+CLASNet	44.4 $^{+ 0.9}$	26.9 $^{+ 0.9}$	5.6 $^{+ 0.6}$	27.1 $^{+ 0.8}$
ACRNet (Ren et al., 2023) ICME2023	46.2	28.4	5.7	28.4
ACRNet (Ren et al., 2023)*	46.1	28.3	5.7	28.4
+CLASNet	46.4 $^{+ 0.3}$	28.8 $^{+ 0.5}$	5.8 $^{+ 0.1}$	28.7 $^{+ 0.3}$

Note. Validation Set. AVG is the average mAP under the thresholds $0.5 : 0.05 : 0.95$ . *Represents the results are obtained by retraining the official code on our own machine. The bolded values could represent the outcomes that we want to emphasize. These are the results that we consider most important or that answer the primary research questions.

Table 3.

Performance Comparisons on ActivityNet1.3 Validation Set.

	mAP@IoU(%)
Method	0.5	0.75	0.95	AVG
BaS-Net (Lee et al., 2020) AAAI2020	34.5	22.5	4.9	–
UM (Lee et al., 2021) AAAI2021	37.0	23.9	5.7	23.7
RSPK (Huang et al., 2022) CVPR2022	40.6	24.6	5.9	25.0
ASM-Loc (He et al., 2022) CVPR2022	41.0	24.9	6.2	25.1
TFE-DCN (Zhou and Wu, 2023) WACV2023	41.4	24.8	6.4	25.3
STCL-Net (Fu et al., 2023) PAMI2023	40.6	24.0	6.0	24.7
ACRNet (Ren et al., 2023) ICME2023	40.9	26.0	5.4	25.7
ACRNet (Ren et al., 2023)*	40.6	25.8	5.3	25.5
+CLASNet	41.9 $^{+ 1.3}$	26.5 $^{+ 0.7}$	5.5 $^{+ 0.2}$	26.3 $^{+ 0.8}$

The experimental results on the THUMOS14 dataset (shown in Table 1) demonstrate a significant enhancement in the performance of the three localization networks under various IoU thresholds with the introduction of CLASNet. Notably, when the IoU threshold is set at 0.5, CO2-Net, DELU, and ACRNet achieve absolute improvements of 1.6%, 1.0%, and 2.5% on mAP, respectively. This enhancement can primarily be attributed to the ability of CLASNet to capture comprehensive temporal information of action instances through dilated convolution, generate reliable attention weights, enhance identifiability of action features through contrast separation, and ultimately achieve precise action localization via boundary refinement techniques. It is worth mentioning that there is no significant improvement observed in the DELU due to its enhanced identifiability of segment features through background modeling. However, incorporating our proposed BRM into DELU architecture enables it to capture more detailed boundary information, improving localization accuracy. The obtained results validate the efficacy of leveraging action-background separation in a weakly supervised setting, leading to significant enhancements in the localization performance of WTAL that are comparable to those achieved by certain fully supervised approaches.

The experimental results on ActivityNet v1.2 and v1.3 datasets (shown in Tables 2 and 3, respectively) are similar to the observations on THUMOS14. The CLASNet enhances the existing WTAL frameworks under all IoU thresholds, especially the improvement on CO2-Net, which is remarkable. For example, on the ActivityNet v1.2 dataset, when the IoU threshold is 0.5, CO2-Net, DELU, and ACRNet improve the mAP by 1.2%, 0.9%, and 0.3%, respectively. On the ActivityNet v1.3 dataset, ACRNet improves the mAP by 1.3%. This result mainly benefits from CLASNet improving the discriminability of action features by contrastive separation, effectively overcoming the ambiguity problem of action segments under weakly supervised information, and achieving precise action localization.

To evaluate the improvement effect of CLASNet on the baseline models more comprehensively, we perform a visual analysis of the experimental results (shown in Figure 3). As depicted in Figure 3, CLASNet demonstrates performance enhancements across various models and datasets. Especially on the THUMOS14 dataset, the localization performance improvement exceeded that of the ActivityNet v1.2 dataset, which may be due to the longer video length, more diverse and less distinguishable video categories in the ActivityNet v1.2 dataset, resulting in a relatively small improvement in localization performance on this dataset. However, the experimental results show that CLASNet effectively improved the action localization effect by embedding it into different WTAL methods and achieving more accurate localization performance, further proving the superiority of the proposed network.

Figure 3.

Performance comparison of different models under different datasets.

4.4. Ablation Study

To verify the effectiveness of each component, we choose ACRNet[26] as the baseline and conduct a series of ablation studies. ACRNet performs the best among the three baseline models due to its flexibility and effectiveness. Following the methods of previous works (He et al., 2022; Ren et al., 2022; Tang et al., 2023), all ablation experiments are conducted on the THUMOS14 test set. The ablation study results of the CLASNet loss function (shown in Table 4) aimed to explore the contribution of each component to the final loss function. In the experiment, it was not changed since it is the objective function of the baseline model. The specific experimental settings included: (A) baseline model; (B) sparsifying the attention weights based on the baseline model; (C) adding only the feature contrastive separation module to the baseline model; (D) adding only the BRM to the baseline model; (E) adding the feature contrastive separation module based on (B); (F) adding the BRM based on (B); (G) adding the BRM based on (C); and (H) sparsifying the attention weights based on (G).

Table 4.
Ablation of Different Components of the Final Loss Function on the THUMOS14 Test Set.

ID $L_{base}$ $L_{norm}$ $L_{sep}$ $L_{BF}$ AVG (%) $(0.1 : 0.5)$

$(A)$ ✓ 58.4

$(B)$ ✓ ✓ 58.5

$(C)$ ✓ ✓ 59.0

$(D)$ ✓ ✓ 58.9

$(E)$ ✓ ✓ ✓ 59.2

$(F)$ ✓ ✓ ✓ 59.1

$(G)$ ✓ ✓ ✓ 60.1

$(H)$ ✓ ✓ ✓ ✓ 60.3

ID	$L_{base}$	$L_{norm}$	$L_{sep}$	$L_{BF}$	AVG (%) $(0.1 : 0.5)$
$(A)$	✓				58.4
$(B)$	✓	✓			58.5
$(C)$	✓		✓		59.0
$(D)$	✓			✓	58.9
$(E)$	✓	✓	✓		59.2
$(F)$	✓	✓		✓	59.1
$(G)$	✓		✓	✓	60.1
$(H)$	✓	✓	✓	✓	60.3

Note. AVG = average.

By comparing (A) with (B), (C) with (E), and (G) with (H), it is observed that sparsifying the attention score can lead to a slight improvement in mAP. This could be attributed to eliminating non-critical attention through sparsification, thereby marginally enhancing model performance. Adding only the FCSM results in an increased mAP of 59.0%, indicating its effectiveness in segregating action features from background features using contrastive learning methodology, thereby improving feature identifiability and positively impacting performance. Similarly, incorporating only the BRM increases mAP to 58.9%, suggesting its positive effect on performance by expanding the similarity between TCAS and attention score, thus correcting temporal boundaries. Combining the FCSM and BRM further improves mAP to 60.1%, highlighting significant enhancement in action localization accuracy achieved by effectively integrating these two modules. The CLASNet comprising all integrated modules within (H) achieves optimal performance, underscoring their collective contribution toward achieving more comprehensive action localization results.

Furthermore, we validate the effectiveness of the components in the FCSM. The module encompasses two parts: an attention weight generation mechanism and a contrast separation part. Multilayer dilated convolution and MHA are essential for generating attention weights. Firstly, we enlarge the receptive field using multilayer dilated convolution, which captures the temporal dependency among segments. Then, MHA models the intermodality relationships for generating attention weights. However, an excessive number of receptive fields may capture irrelevant segments, while increasing the number of attention heads can reduce the feature dimension of each attention head and impact its representation ability. We conduct ablation studies by varying the numbers of dilated convolution layers $K$ and attention heads $S$ .

We compare the performance under different settings of $K$ and $S$ in Figure 4. Figure 4 shows that the performance of CLASNet first increases and then decreases as the number of dilated convolution layers increases. CLASNet achieves the highest average performance when $K = 3$ . This is because the receptive field covers the temporal information within 5.76 s when $K = 3$ , which is consistent with the average duration of most action instances in the THUMOS14 test set (4–5 s). When $K < 3$ , the receptive field is too small to cover most action instances, when $K > 3$ , the receptive field is too large and includes irrelevant background segments, which impairs the recognition of action instance segments. Hence, $K = 3$ is an optimal trade-off between covering the entire action instances and minimizing the interference of background segments. Furthermore, we observe that the number of heads $S$ in the MHA mechanism is a crucial factor for the performance. The model achieves the best performance when $K = 3$ , as each attention head balances its representation capability and diversity. On the other hand, when $S = 2$ is too small, the model suffers from insufficient versity. When $S = 8$ is too large, the model loses representation capability. Therefore, we choose $S = 4$ as the best setting. As a result, our contrastive separation module attains the highest performance, with an mAP of 60.3%. Figure 4 illustrates the changing trend and confirms the effectiveness of our module.

Figure 4.

Experimental results of different expansion convolution layers $K$ and attention heads $S$ on the THUMOS14 test set.

It is worth mentioning that after embedding CLASNet into ACRNet, the parameters increased from 22.33 million to 33.24 million. In addition, we counted the number of parameters and floating-point operations for the other two baseline models CO2-Net and DELU (shown in Table 5). Although this increased the complexity of the model, the increase in complexity is acceptable considering the flexibility and significant performance improvement of CLASNet.

Table 5.

The Number of Parameters and Floating-Point Operations for the Baseline Models.

Method	FLOPs/G	Parameters/M	AVG map (%) $(0.1 : 0.7)$
CO2-Net (Hong et al., 2021)*	15.16	46.71	43.9
+CLASNet	22.04	57.34	45.3
DELU (Chen et al., 2022)*	6.89	21.51	44.8
+CLASNet	13.39	32.35	45.9
ACRNet (Ren et al., 2023)*	7.36	22.33	47.2
+CLASNet	14.67	33.24	48.7

Note. AVG = average; FLOPs=floating operations per second. *Represents the results are obtained by retraining the official code on our own machine.

4.5. Qualitative Results

To visually demonstrate the efficacy of CLASNet in enhancing action localization, we present a comprehensive qualitative visual analysis of the detected action segments using Figure 5. Figure 5 showcases video clips along with baseline ground truth (GT) actions, CAS, and localization results produced by the baseline model ACRNet (Baseline), as well as CAS and localization results obtained from CLASNet+ACRNet (ours). Among these representations, GT, ACRnet (Baseline), and CLASnet+ACRnet (ours) are depicted in blue, brown, and purple, respectively. Additionally, specific intricate cases undetected by ACRnet but successfully located by our proposed model are highlighted within a red box.

Figure 5.

Qualitative visualization of a video example on the THUMOS14 dataset.

Through comparative visual analysis, it becomes evident that our proposed model demonstrates a remarkable improvement in localizing ambiguous segments. Initially, ACRNet misidentified three regions, marked by red boxes, but with the aid of CLASNet, these were accurately corrected. For instance, the correct segment within the first box should include two teams sequentially participating in a volleyball game. However, ACRNet erroneously treats these two segments as one continuous action. By leveraging CLASNet introduced in this study, these two segments are correctly identified and distinguished. In the region marked by the second red box, ACRNet incorrectly identifies a segment devoid of action as containing an action, due to complex background interference. Conversely, CLASNet achieves identical accuracy of the GT prediction. Regarding the third area demarcated by the red box, ACRNet erroneously partitions a continuous hitting volleyball action into two segments, leading to inaccurate truncation. However, despite intricate background interference, CLASNet successfully identifies this constant action. Furthermore, based on all localization results obtained thus far, it is clear that our proposed model can precisely locate easily recognizable segments and accurately discern ambiguous ones. This indicates the effective segregation of action features from non-action features and refinement of localization boundaries, leading to improved performance in action localization tasks. The model has improved both classification accuracy and localization precision.

5. Conclusion

The weakly supervised temporal action localization technique, which relies solely on video-level annotations, thereby holds significant scientific significance and practical value in enhancing the efficiency and precision of video analysis. We propose the CLASNet to enhance the localization problem of ambiguous segments. By effectively distinguishing features through contrastive separation, thereby improving localization accuracy. Firstly, CLASNet resolves the interference issue of non-action features on localization by employing a feature comparison separation module, effectively distinguishing action features from non-action ones. Secondly, it addresses the noise problem in contrast separation and enhances contrast accuracy by capturing precise boundary fine-grained information through a BRM. The extensive experiments conducted on the THUMOS14 and ActivityNet datasets show that CLASNet outperforms both baseline methods and state-of-the-art approaches in weakly supervised temporal action localization tasks. This plug-and-play module is applicable to WTAL tasks and shows promise for broader domains, including fully supervised temporal action localization and action recognition.

Footnotes

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments and suggestions.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Bai

Kolter

J. Z.

Koltun

(2018, April 19). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv. http://arxiv.org/abs/1803.01271

Baraka

Mohd Noor

M. H.

(2022, June). Weakly-supervised temporal action localization: A survey. Neural Computing and Applications, 34(11), 8479–8499. https://doi.org/10.1007/s00521-022-07102-x

Bodla

Singh

Chellappa

Davis

L. S.

(2017) Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, Venice, Italy (pp. 5562–5570). https://doi.org/10.1109/ICCV.2017.593

Carreira

Zisserman

(2018, February 12). Quo Vadis, action recognition? A new model and the kinetics dataset. arXiv. http://arxiv.org/abs/1705.07750

Chao

Y.-W.

Vijayanarasimhan

Seybold

Ross

D. A.

Deng

Sukthankar

(2018). Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA. (pp. 1130–1139). http://doi.org/10.1109/CVPR.2018.00124

Chen

Gao

(2023, June). Cascade evidential learning for open-world weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada (pp. 14741–14750). https://doi.org/10.1109/CVPR52729.2023.01416

Chen

Gao

Yang

(2022). Dual-evidential learning for weakly-supervised temporal action localization. Springer Nature Switzerland.

Dou

, & Hu

(2023). Complementary attention network for weakly supervised temporal action localization. Neural Processing Letters, 55(5), 6713–6732. https://doi.org/10.1007/s11063-023-11156-w

Gao

(2023). Semantic and temporal contextual correlation learning for weakly-supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 1–16. https://doi.org/10.1109/TPAMI.2023.3287208

10.

Gabeur

Sun

Alahari

Schmid

(2020). Multi-modal transformer for video retrieval. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020. Proceedings, Part IV 16 (pp. 214–229). Springer International Publishing.

11.

Gao

Chen

, & Xu

(2022). Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA (pp. 19967–19977). https://doi.org/10.1109/CVPR52688.2022.01937

12.

Guo

Liang

Tang

(2021). Improved safety checklist analysis approach using intelligent video surveillance in the construction industry: A case study. International Journal of Occupational Safety and Ergonomics, 27(4), 1064–1075. https://doi.org/10.1080/10803548.2019.1685781

13.

Habib

Hussain

Albattah

Islam

Khan

R. U.

Khan

(2021). Abnormal activity recognition from surveillance videos using convolutional neural network. Sensors, 21(24), 8291. https://doi.org/10.3390/s21248291

14.

Yang

Kang

Cheng

Zhou

Shrivastava

(2022). ASM-Loc: Action-aware segment modeling for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA (pp. 13915–13925). https://doi.org/10.1109/CVPR52688.2022.01355

15.

Heilbron

F. C.

Escorcia

Ghanem

Niebles

J. C.

(2015, June). ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA (pp. 961–970). IEEE. https://doi.org/10.1109/CVPR.2015.7298698

16.

Hong

F. T.

Feng

J. C.

Shan

Zheng

W.-S.

(2021) Cross-modal consensus network for weakly supervised temporal action localization. In Proceedings of the 29th ACM international conference on multimedia, New York, NY, USA. https://doi.org/10.1145/3474085.3475298

17.

Huang

Wang

(2022). Weakly supervised temporal action localization via representative snippet knowledge propagation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3272–3281).

18.

Huang

Wang

(2022). Multi-modality self-distillation for weakly supervised temporal action localization. IEEE Transactions on Image Processing, 31, 1504–1519. https://doi.org/10.1109/TIP.2021.3137649

19.

Idrees

Zamir

A. R.

Jiang

Y. G.

Gorban

Laptev

Sukthankar

Shah

(2017). The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155, 1–23. https://doi.org/10.1016/j.cviu.2016.10.018

20.

Jalayer

Kahani

Pourmasoumi

Beheshti

(2022). HAM-Net: Predictive business process monitoring with a hierarchical attention mechanism. Knowledge-Based Systems, 236, 107722. https://doi.org/10.1016/j.knosys.2021.107722

21.

Kay

, et al. (2017, May 19). The kinetics human action video dataset. arXiv. http://arxiv.org/abs/1705.06950

22.

Kim

Cho

(2022). Background-aware robust context learning for weakly-supervised temporal action localization. IEEE Access, 10, 65315–65325. https://doi.org/10.1109/ACCESS.2022.3183789

23.

Lee

Byun

(2020). Background suppression network for weakly-supervised temporal action localization. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7), 11320–11327.

24.

Lee

Wang

Byun

. (2021). Weakly-supervised temporal action localization by uncertainty modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3), 1854–1862. https://doi.org/10.48550/arXiv.2006.07006

25.

Lee

Yun

Jain

(2022). Leaky gated cross-attention for weakly supervised multi-modal temporal action localization. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3213–3222).

26.

Cheng

Ding

Wang

Gao

(2023, June). Boosting weakly-supervised temporal action localization with text information. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10648–10657).

27.

Chen

(2022). Forcing the whole video as background: An adversarial learning strategy for weakly temporal action localization. In Proceedings of the 30th ACM international conference on multimedia (pp. 5371–5379).

28.

Yang

Wang

Cheng

(2022, June). Exploring denoised cross-video contrast for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19914–19924).

29.

Lin

Zhao

Wang

Yang

(2018). BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).

30.

Lin

Zhong

Fares

(2022). Deep hierarchical LSTM networks with attention for video summarization. Computers and Electrical Engineering, 97, 107618. https://doi.org/10.1016/j.compeleceng.2021.107618

31.

Long

Yao

Qiu

Tian

Luo

Mei

(2019). Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 344–353).

32.

Gorti

S. K.

Volkovs

(2021). Weakly supervised action selection learning in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7587–7596).

33.

Maron

Lozano-Pérez

(1997) A framework for multiple-instance learning. Advances in Neural Information Processing Systems, 10, 570–576. https://doi.org/10.5555/3008904.3008985

34.

Moniruzzaman

Yin

(2023). Collaborative foreground, background, and action modeling network for weakly supervised temporal action localization. IEEE Transactions on Circuits and Systems for Video Technology, 33(11), 6939–6951.

35.

Park

(2020). AGCN: Attention-based graph convolutional networks for drug–drug interaction extraction. Expert Systems with Applications, 159, 113538.

36.

Paszke

, et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

37.

Ren

Ran

Liu

Ren

Zhang

Jin

(2023). Weakly-supervised temporal action localization with adaptive clustering and refining network. In 2023 IEEE international conference on multimedia and expo (ICME) (pp. 1008–1013). IEEE.

38.

Ren

Ran

Jin

(2022). Weakly-supervised temporal action localization with multi-head cross-modal attention. In Pacific Rim international conference on artificial intelligence (pp. 281–295). Springer Nature Switzerland.

39.

Shou

Gao

Zhang

Miyazawa

Chang

S.-F.

(2018). AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European conference on computer vision (ECCV) (pp. 154–171).

40.

Gan

Qiao

Yan

(2021). BSN++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3), 2602–2610.

41.

Tang

Guo

Zheng

Liang

(2023). Towards better utilization of pseudo labels for weakly supervised temporal action localization. Information Sciences, 623, 693–708. https://doi.org/10.1016/j.ins.2022.12.044

42.

Vishwakarma

Agrawal

(2013). A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer, 29(10), 983–1009. https://doi.org/10.1007/s00371-012-0752-6

43.

Zhao

Rojas

D. S.

Thabet

Ghanem

(2020). G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10156–10165).

44.

Yang

Han

Zhao

Lin

Zhang

Chen

(2021). Background-click supervision for temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 9814–9829.

45.

Zeng

Huang

Tan

(2019). Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7094–7103).

46.

Zhang

Cao

Yang

Chen

Zou

(2021). CoLa: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16010–16019).

47.

Zhou

Huang

Wang

Liu

(2023). Improving weakly supervised temporal action localization by bridging train-test gap in pseudo labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 23003–23012).

48.

Zhou

(2023) Temporal feature enhancement dilated convolution network for weakly-supervised temporal action localization. In 2023 IEEE/CVF winter conference on applications of computer vision (WACV), Waikoloa, HI, USA (pp. 6017–6026). https://doi.org/10.1109/WACV56688.2023.00597

	mAP@IoU(%)							AVG
Method	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.1:0.7
Frame-level (fully supervised)
BSN (Lin et al., 2018) ECCV2018	–	–	53.5	45	36.9	28.4	20.0	36.76
AGCN (Park et al., 2020) AAAI2020	59.3	59.6	57.1	51.6	38.6	28.9	17.0	44.6
TAL-Net (Chao et al., 2018) CVPR2018	59.8	57.1	53.2	48.5	42.8	33.8	20.8	39.8
P-GCN (Zeng et al., 2019) ICCV2019	69.5	67.5	63.6	57.8	49.1	–	–	–
GTAN (Long et al., 2019) ICCV2019	69.1	63.7	57.8	47.2	38.8	–	–	–
G-TAD (Xu et al., 2020) CVPR2020	66.1	64.2	54.5	47.6	40.2	30.8	23.4	39.3
BSN++ (Su et al., 2021) AAAI2021	–	–	59.9	49.5	41.3	31.9	22.8	41.1
Video-level (weakly supervised)
BaS-Net (Lee et al., 2020) AAAI2020	58.2	52.3	44.6	36.0	27.0	18.6	10.4	35.3
CoLA (Zhang et al., 2021) CVPR2021	66.2	59.5	51.5	41.9	32.2	22.0	13.1	40.9
UM (Lee et al., 2021)	67.5	61.2	52.3	43.4	33.7	22.9	12.1	41.9
MMSD (Huang et al., 2022) TIP2022	69.7	64.3	54.6	45.0	36.4	23.0	12.3	43.6
RSPK (Huang et al., 2022) CVPR2022	71.3	65.3	55.8	47.5	38.2	25.4	12.5	45.1
BoostingWTAL (Li et al., 2023) CVPR2023	–	–	56.2	47.8	39.3	27.5	15.2	–
ASM-Loc (Huang et al., 2022) CVPR2022	71.2	65.5	57.1	46.8	36.6	25.2	13.4	45.1
TFE-DCN (Zhou and Wu, 2023) WACV2023	72.3	66.5	58.6	49.5	40.7	27.1	13.7	46.9
GauFuse_WSTAL (Zhou et al., 2023) CVPR2023	74.0	69.4	60.7	51.8	42.7	26.2	13.1	48.3
CO2-Net (Hong et al., 2021) MM2021	70.1	63.6	54.5	45.7	38.3	26.4	13.4	44.6
CO2-Net (Hong et al., 2021)*	69.3	62.9	54.1	45.1	37.5	25.3	12.9	43.9
+CLASNet	71.6 $^{+ 1.3}$	64.0 $^{+ 1.1}$	55.7 $^{+ 1.6}$	46.9 $^{+ 1.8}$	39.0 $^{+ 1.5}$	26.3 $^{+ 1.0}$	13.8 $^{+ 0.9}$	45.3 $^{+ 1.4}$
DELU (Chen et al., 2022) ECCV2022	71.5	66.2	56.5	47.7	40.5	27.2	15.3	46.4
DELU (Chen et al., 2022)*	69.2	64.0	54.6	46.1	39.0	26.1	14.8	44.8
+CLASNet	70.9 $^{+ 1.7}$	65.8 $^{+ 1.8}$	55.6 $^{+ 1.0}$	47.2 $^{+ 1.1}$	39.7 $^{+ 0.7}$	26.9 $^{+ 0.8}$	14.9 $^{+ 0.1}$	45.9 $^{+ 1.1}$
ACRNet (Ren et al., 2023) ICME2023	76.9	70.7	61.0	49.0	37.0	24.8	13.4	47.5
ACRNet (Ren et al., 2023)*	76.2	70.0	60.5	48.9	36.6	25.0	13.6	47.2
+CLASNet	77.9 $^{+ 1.8}$	71.4 $^{+ 1.4}$	63.0 $^{+ 2.5}$	51.1 $^{+ 1.2}$	38.1 $^{+ 1.5}$	25.7 $^{+ 0.7}$	13.4 $_{- 0.2}$	48.7 $^{+ 1.5}$

Weakly Supervised Temporal Action Localization With Contrastive Learning-Based Action Salience Network

Abstract

Keywords

1. Introduction

2. Related Works

3. Methodology

3.1. Structure Overview

4. Experiments

4.1. Datasets and Implementation Details

4.2. Evaluation Metrics

4.3. Comparison With State-of-the-Art Methods

Table 4. Ablation of Different Components of the Final Loss Function on the THUMOS14 Test Set. ID L base L norm L sep L BF AVG (%) ( 0.1 : 0.5 ) (A) ✓ 58.4 (B) ✓ ✓ 58.5 (C) ✓ ✓ 59.0 (D) ✓ ✓ 58.9 (E) ✓ ✓ ✓ 59.2 (F) ✓ ✓ ✓ 59.1 (G) ✓ ✓ ✓ 60.1 (H) ✓ ✓ ✓ ✓ 60.3

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Notes

References