Sage Journals: Discover world-class research

Abstract

With the significant success of transformers in natural language processing, an increasing number of researchers are introducing them into the field of computer vision, particularly for action recognition. As a crucial task in video understanding, action recognition has significant applications in live broadcasting, autonomous driving, and medical diagnostics. The attention mechanisms in transformers mimicking human visual attention allocation, thereby enhancing the processing capabilities and comprehension of long video sequences. However, they often overlook the aggregation of multiscale detail features and the hierarchical representations of early visual information. Additionally, attention networks are computationally intensive and parameter-heavy, complicating model training and extending inference times, rendering them unsuitable for real-time applications. To crack these nuts, we propose a lightweight multiscale action recognition model based on convolutional enhancement block (ConvEB) and multiscale average pooling encoder. The ConvEB aims to establish long-range dependencies among multiscale local features in the early stages of the network, providing effective inductive biases for the attention network to compensate for the loss of detailed information. Moreover, we introduce a parallel pooling mixer to replace the original attention mixer, ensuring model lightweight while maintaining recognition accuracy. Finally, we deploy this model in the construction of a virtual panorama live broadcasting system. Experimental results demonstrate that our action recognition algorithm achieves competitive performance, and the constructed panoramic system basically meets the needs of daily live broadcasting.

Keywords

action recognition attention mechanisms convolutional enhancement parallel pooling mixer virtual panorama live broadcasting

1. Introduction

The rapid growth of internet multimedia technologies and the explosive popularity of short video platforms have led to an unprecedented surge in video content. Analyzing and understanding such vast video streams has become a critical challenge in computer vision. Among these tasks, action recognition—the process of enabling machines to interpret human behaviors in videos through spatiotemporal feature extraction and classification—plays a foundational role. Its applications span intelligent surveillance, human–computer interaction, and immersive entertainment systems (Figure 1). However, action recognition faces inherent challenges: actions vary widely in duration, exhibit complex spatiotemporal dynamics, and often occur against cluttered backgrounds, demanding robust and efficient models for accurate analysis.

Figure 1.

Action recognition technology application scenarios.

Early approaches relied on handcrafted features (e.g., Histogram of Oriented Gradients and Scale-Invariant Feature Transform) combined with traditional machine learning methods. While effective in constrained scenarios, these methods struggled with scalability and failed to capture the temporal evolution of actions in complex real-world videos. The advent of deep learning, particularly convolutional neural networks (CNNs), revolutionized the field by enabling automatic hierarchical feature extraction. CNNs excel at capturing local spatial patterns, yet their limited receptive fields hinder the modeling of long-range dependencies and multiscale contextual relationships—critical for distinguishing fine-grained actions (e.g., walking vs. jogging).

Transformers, renowned for their global self-attention mechanisms, emerged as a powerful alternative. By dynamically focusing on critical spatiotemporal regions, transformers achieve superior sequence modeling, as evidenced by their success in natural language processing and image recognition (Vaswani et al., 2017). However, their direct application to video action recognition faces two key limitations: (a) the quadratic computational complexity of self-attention (relative to input length) imposes prohibitive costs for long video sequences (Vaswani et al., 2017) and (b) excessive emphasis on global interactions often neglects subtle local details (e.g., limb movements), degrading performance in fine-grained recognition tasks (Yang et al., 2021). These issues are exacerbated in resource-constrained scenarios such as real-time live broadcasting, where latency and model size are critical.

To address these challenges, we propose ConvEB-MAPFormer, a lightweight architecture that synergizes convolutional enhancement block (ConvEB) and multiscale average pooling encoder (MAPFormer). Our key innovations are: (1)

We propose a ConvEB that leverages the excellent local spatial feature capturing capabilities of a CNN to provide effective local detail information for subsequent attention networks.

(2)

We improve the MAPFormer, reducing parameters and lightweighting the model while mitigating their negative impacts, thereby maintaining model accuracy.

(3)

We apply the complete action recognition model to a virtual panoramic live broadcasting system, demonstrating the feasibility and effectiveness of the model in real-world application.

2. Related Work

In recent years, emerging action recognition methods can be broadly categorized into CNN-based and transformer-based approaches, depending on how they handle features.

2.1. CNN-Based Action Recognition Methods

2.1.1. Two-dimensional (2D) CNN

Early 2D CNN-based action recognition methods typically employed a two-stream architecture. Simonyan and Zisserman (2014) introduced a revolutionary approach with the two-stream network, which models spatial and temporal relationships by processing single-frame Red–Green–Blue and multiframe dense optical flow fields separately. However, the late fusion strategy used in this method may result in the loss of critical information. Feichtenhofer et al. (2016) proposed various convolutional feature fusion strategies, shifting the fusion point from the SoftMax layer to the convolutional layers to match features at the same location across the two streams, effectively reducing the parameter count. Subsequently, in order to improve the depth of the network, Feichtenhofer et al. (2017) replaced the backbone network of the two-stream network with residual network (He et al., 2016) and trained the model end-to-end, achieving excellent results in short video recognition tasks. Liu et al. (2025) proposed active redundancy reduction for static images but lacked video-temporal modeling.

Despite these advancements, 2D CNN models fail to effectively model temporal information and rely heavily on optical flow, which incurs significant computational costs, limiting their further development in action recognition.

2.1.2. Three-Dimensional (3D) CNN

Unlike 2D CNNs, 3D CNN-based action recognition methods add a depth dimension to the input images, resulting in an additional dimension for convolution kernels and output feature maps. Carreira and Zisserman (2017) extended the 2D CNN convolution kernels into the temporal dimension, forming the “Inflated” 3D (I3D) structure, which demonstrated excellent performance across multiple action recognition datasets. Extensible 3D Graphics (Feichtenhofer, 2020) attempted to extend the 2D convolutions from different dimensions (temporal depth, sampling rate, spatial resolution, width, etc.) to handle different dimensions of 3D spatiotemporal data. Conversely, R(2+1)D (Tran et al., 2018) decomposed 3D convolutions into 2D convolutions for spatial information and one-dimensional convolutions for temporal information, aiming to reduce model complexity while maintaining spatiotemporal feature capture capabilities. Separated 3D convolutional network (Xie et al., 2018) further reduced parameter count and improved model performance by learning spatial and temporal features separately based on the R(2+1)D structure. Manakitsa et al. (2024) systematically reviewed deep learning approaches, identifying temporal occlusion handling as a critical gap despite advancements in transformer architectures. While their survey highlights 3D CNNs for global temporal dependencies, these methods struggle with localized actor-specific occlusions.

Although 3D CNNs enhance recognition performance at the cost of increased computational resources and parameters, they pose a challenge to model generalization and fail to improve the capture of global contextual information.

2.2. Transformer-Based Action Recognition Methods

Compared to CNNs, transformer-based action recognition networks exhibit significant advantages in handling video sequences and capturing long-term dependencies. Girdhar et al. (2019) first introduced the action transformer model for video action recognition, which autonomously learns semantic contextual information from human behaviors. Subsequently, Dosovitskiy et al. (2020) proposed the vision transformer (ViT), which cleverly transformed the classification problem into a sequence modeling problem by dividing images into multiple $16 \times 16$ patches and feeding these linear embedding sequences into a transformer encoder to extract temporal features. Based on ViT, models such as Deata-efficient image Transformer, Swin transformer (Liu et al., 2021), and transformer-in-transformer (TNT) improved performance and computational efficiency by improving the training strategy, introducing hierarchical structure, and adopting TNT architecture, respectively. Additionally, some models focused on extracting information from different dimensions. TimeSformer (Bertasius et al., 2021) specifically modeled temporal correlations, offering flexible scalability to handle longer video segments and larger models. Neimark et al. (2021) introduced video transformer networks, which integrated spatiotemporal information by focusing on the entire video sequence. Zhang et al. (2022) proposed the spatial–temporal transformer network, designing spatial self-attention and temporal self-attention modules to capture intraframe and interframe interaction information, respectively.

However, transformer networks contain a large number of parameters, necessitating substantial computational resources for training and inference. Recent studies (Yang et al., 2021) further indicate that while transformers excel at modeling global dependencies, their self-attention mechanisms tend to underemphasize fine-grained local features, such as subtle limb movements or transient spatial patterns, which are critical for distinguishing similar action categories (e.g., walking vs. jogging). This limitation stems from the dominance of global token interactions over localized spatial modeling. Nevertheless, transformer-based action recognition methods are still an active research area and will continue to improve the model efficiency and recognition performance in the future by improving the network structure of the transformer and other methods.

3. Action Recognition Algorithm

In this section, we present a lightweight multiscale action recognition model based on ConvEB and MAPFormer. The overall framework is illustrated in Figure 2. ConvEB is utilized early in the network to capture rich multiscale local features across channels, providing effective inductive biases and more expressive feature representations for the subsequent MAPFormer. The parallel pooling mixer with multiscale pooling receptive fields captures global contextual information at different spatial scales, aggregating multiscale spatial features from different channels. Experimental results show that this model achieves higher accuracy and fewer parameters compared to several baseline models. This paper presents two core innovations: (1) ConvEB module incorporating pyramid squeeze attention (PSA) to enhance multiscale local features and (2) lightweight MAPFormer to replace the attention mechanism with parallel pooling to significantly reduce the computational overhead.

Figure 2.

Action recognition model framework based on ConvEB-MAPFormer. Note. ConvEB = convolutional enhancement block; MAPFormer = multiscale average pooling encoder.

3.1. Convolutional Enhancement Block (ConvEB)

The proposed ConvEB is primarily based on the mobile inverted bottleneck convolution (MBConv) structure introduced by Sandler et al. (2018). As shown in Figure 3(a), an MBConv block typically consists of a $1 \times 1$ expansion convolution, a $3 \times 3$ depthwise convolution, a squeeze-and-excitation (SE) module, and a $1 \times 1$ pointwise convolution. The expansion convolution increases the number of channels in the input feature map to capture more feature information. The depthwise convolution is applied independently to each input channel, with each convolution kernel focusing on the spatial information of a single channel, effectively capturing local spatial features. The SE module reduces (squeezes) the output channels of the depthwise convolution back to the original size, aiming to reduce the feature map dimensions and improve computational efficiency. The final pointwise convolution integrates the local features extracted from different channels, capturing global information within a certain range. Research (Gupta & Tan, 2019) improved the MBConv structure, resulting in the Fused-MBConv, as shown in Figure 3(b). This modification merges the expansion and depthwise convolutions into a single $3 \times 3$ standard convolution, which directly operates on the input channels, followed by pointwise convolution for feature fusion. This structure was shown in Li et al. (2021) to capture broader local detail features effectively in the early stages of the network.

Figure 3.

Basic unit of convolutional enhancement block (ConvEB).

However, understanding and analyzing complex, subtle action categories is also crucial in action recognition tasks. For example, running and walking may appear similar but differ subtly in leg and arm movements and duration. Recognizing and classifying these actions typically involves integrating highly abstract and multidimensional feature information. The SE module in the Fused-MBConv structure uses global average pooling to squeeze spatial dimension information, which may result in the loss of spatial structural information crucial for recognition tasks. Additionally, the SE module focuses on single-channel feature reshaping, limiting its ability to learn complex spatial patterns. To address this, we replace the SE module in the Fused-MBConv with pyramid squeeze attention (Zhang et al., 2022), proposing the pyramid squeeze fused-MBConv (PSF-MBConv) structure as shown in Figure 3(c). PSA is a lightweight channel attention mechanism that dynamically adjusts the importance of features across channels and spatial positions, enhancing the model’s adaptability and generalization in handling action diversity and temporal scale variations. PSA can significantly improve the model’s ability to recognize and express key features. Its lightweight design effectively mitigates the challenge faced when applying attention mechanisms in resource-constrained environments, making it an optimal choice for integrating multiscale spatial information across channels.

PSA dynamically enhances critical spatial regions by adaptively weighting multiscale channel features. Specifically, it employs grouped convolutions with varying kernel sizes to extract multiscale local patterns (see Appendix A for implementation details). The channel-wise attention weights are then recalibrated via Softmax-based fusion, enabling the model to focus on discriminative regions while suppressing redundant information.

3.2. Multiscale Average Pooling Encoder (MAPFormer)

In recent years, ViTs have shown impressive performance in video action recognition tasks, leading to the development of various ViT-based models for this field. A traditional transformer encoder comprises two components: the attention module, which mixes information from different tokens, and the feedforward network, which includes a channel multilayer perceptron (MLP) for learning complex and abstract feature representations. Both components integrate residual connections and normalization. To verify which part of the transformer encoder contributes more to the model, researchers (Tolstikhin et al., 2021) have done extensive experiments. It was ultimately demonstrated that ViT’s superior performance in action recognition tasks is attributed to the general architecture of MetaFormer (Yu et al., 2023), as shown in Figure 4(a), rather than a specific token mixer. Consequently, researchers (Liu et al., 2022; Yu et al., 2022) have begun replacing the traditional attention mixer with various token mixers. For instance, Guibas et al. (2021) used an adaptive Fourier neural operator as a mixer and achieved high recognition accuracy. To reduce computational complexity and parameter count, making the model lightweight and reducing memory burden during training, Lee-Thorp et al. (2021) replaced the token mixer in MetaFormer with a parameter-free pooling operation, resulting in PoolFormer, as illustrated in Figure 4(b). Applying PoolFormer in video action recognition tasks enhances computational efficiency and reduces parameters. However, related research (Tan et al., 2021) indicates that using separate max-pooling or average-pooling layers in pooling mixer for dimensionality reduction can lead to the loss of critical local detailed spatial information, blurring spatial relationships, and affecting multiscale spatial dimension feature representations. These impacts are crucial for capturing dynamic action changes and modeling long-term dependencies in sequential features.

Figure 4.

Basic composition of multiscale average pooling encoder (MAPFormer).

To address this issue, we propose replacing the attention mixer with a parallel pooling mixer, which substitutes the simple average pooling mixer in PoolFormer with a parallel combination of average pooling layers of different kernel sizes. Combined with the convolutional-batch normalization (Conv-BN) fully connected layer structure, we introduce a novel encoder structure, MAPFormer. As shown in Figure 4(c), MAPFormer consists of two parts: the parallel pooling mixer residual block and the Conv-BN residual block. The parallel pooling technique used in the parallel pooling mixer is a common feature processing method in image processing and action recognition tasks. By applying multiple pooling layers with different kernel sizes in parallel, the features of the input data at different scales are captured, thus aggregating different types of spatial feature information. This approach enables the model to comprehensively understand the input data from multiple perspectives, enhancing feature extraction quality and capturing richer fine-grained features.

Research (Doğan, 2023) indicates that average pooling layers generally outperform max-pooling layers in image classification tasks. Therefore, the proposed parallel pooling mixer primarily comprises average pooling layers of different kernel sizes, and each layer performs pooling operations on the input sequence with different pooling windows, respectively. After concatenation and dimensionality reduction, this yields different levels of feature information. Specifically, for an input sequence $I \in R^{C \times H \times W}$ with $T$ frames, the average pooling layers in the parallel pooling mixer sample it in the spatial dimension and perform traversal operations on the input feature map with a certain size of pooling windows and strides, and ultimately select the average value within each window as the output. The output $O_{avg}$ of the average pooling layer for each time stride and channel is defined as:

O_{avg} (i, j) = \frac{1}{k_{h} k_{w}} I (s_{h} \cdot i + m, s_{w} \cdot j + n),

(1)

where

O_{avg} \in R^{C \times H^{'} \times W^{'}}

H^{'} = ⌊ [(H - k_{h}) / s_{h}] + 1 ⌋

W^{'} = ⌊ [(W - k_{w}) / s_{w}] + 1 ⌋

k_{h}

and

k_{w}

represent the pooling window sizes, and

s_{h}

and

s_{w}

denote the strides. The indices

i

and

j

indicate the dimensions of the output feature map. Average pooling layers achieve smoother, more generalized feature representations, reducing the risk of overfitting.

The structure of the parallel pooling mixer, shown in Figure 4(d), includes three parallel average pooling layers. For an input sequence $I \in R^{C \times H \times W}$ , the output can be expressed as:

F_{ppm} = {Conv}_{1 \times 1} (\sum_{k \in {3, 7, 11}} O_{avg}^{k \times k} (i, j)),

(2)

where

F_{ppm} \in R^{C \times H \times W}

{Conv}_{1 \times 1}

is used to reduce the dimensionality of the feature map generated by pooling results from multiple layers.

By merging the pooling results on multiple layers, the shortcoming of a single pooling layer’s weak extraction of image information can be effectively compensated, allowing the model to generalize to different data and scenes. This enhancement is especially important for the model to understand subtle actions and complex scenes. Additionally, a recent study (Li et al., 2022) suggests that layer normalization (LN) or group normalization (GN) followed by linear operations in MLPs require computing statistics of the current data, leading to increased parameter count and latency. To further lightweight the model, we replace the standard fully connected layer and LN (or GN) with $1 \times 1$ convolutional layers and batch normalization (BN) layers.

4. Virtual Panorama Live Broadcasting System

To validate the performance of the action recognition model proposed in this paper, we applied it to a panoramic live broadcasting system. By recognizing human actions in panoramic videos across different scenarios, the system enhances the smoothness and intuitiveness of user interactions. Furthermore, we develop a virtual panoramic live broadcasting system based on action and pose recognition, utilizing Unity3D to synchronize human actions with virtual animation model poses in panoramic broadcasting. Finally, we conduct experiments to verify the reliability and stability of this system.

4.1. System Composition

4.1.1. Collector

The system employs eight iZugar MKX22 fisheye cameras (Blackmagic micro Studio 8K) for panoramic video capture. Each camera features a 220-degree ultra-wide-angle lens with an f/2.5 aperture, ensuring clear image capture even in low-light conditions. The cameras are arranged in four groups, each containing two cameras, forming binocular camera groups, which achieve a 360-degree full-view perspective. To achieve real-time panoramic video stitching and smooth streaming, a virtual reality (VR) live broadcasting tool is used to read the real-time footage from the binocular camera groups (see Figure 5(a)). Precise stitching is performed using PT Gui to create a complete image template (see Figure 5(b)). Finally, remapping technology is used to output the panoramic video, providing a seamless immersive viewing experience.

Figure 5.

Schematic diagram of panoramic video generation.

Figure 6.

Virtual panoramic system based on action and gesture recognition.

4.1.2. Cloud Forwarder

The system uses the H.266/VVC encoding standard to compress and encode the panoramic video, then streams it to a cloud server. Compared to H.265/HEVC, H.266/VVC offers a higher data compression rate, reducing transmission data volume by 50%, and occupying less bandwidth and storage space, ensuring a smooth high definition visual experience. During transmission, the system uses the real-time messaging protocol (RTMP) to encode the video data, transmitting it via a 5G network to the Simple Real-time Server streaming server for forwarding. RTMP offers high compatibility, stability, and low latency, supporting various video encoding formats and dynamic adjustment of video stream quality, ensuring an excellent viewing experience.

4.1.3. Receiver

The receiver uses PotPlayer and HTC Vive device to create an efficient immersive viewing environment. PotPlayer decodes the encoded video stream pulled from the cloud server and restores it to clear image frames, supporting 8K video decoding to ensure extreme clarity. After decoding, PotPlayer generates left-eye and right-eye images to simulate human eye parallax effects (see Figure 5(c)). It renders panoramic images through depth information estimation and spherical projection technology. HTC Vive provides a high-resolution display and precise head tracking, ensuring the viewpoint moves with the user’s gaze, achieving a 360-degree panoramic view. Additionally, HTC Vive supports controllers and other input devices, enhancing immersion and interactivity.

4.2. Algorithm Deployment

To validate the practical application of the proposed action recognition model, this section builds a virtual panoramic live broadcasting system based on the panoramic system described in Section 4.1. The framework is shown in Figure 6. Before the live broadcasting, the system uses the Mixamo tool to generate various common action types of virtual 3D animation models and configure an Animation Controller to define transition logic between different animation states. During the live broadcasting, a vision-based deep learning model and camera position sensors capture the broadcaster’s action categories and body pose data, synchronizing them with Unity3D. C# script receives and parses these data, dynamically setting parameters of the Animator Controller to trigger corresponding animations, achieving synchronization between the broadcaster’s actions and the virtual animation model. After testing in Unity3D, the built-in virtual camera renders the virtual animation model in real-time and outputs it to the live broadcasting interface. Users can experience the virtual panoramic live broadcasting immersively through VR devices or players. During the action and pose capture, the proposed action recognition model captures the broadcaster’s action features and determines the action commands based on a preset action category library, achieving more accurate synchronization. The visual output is shown in Figure 7. Additionally, camera and pose recognition sensors acquire the broadcaster’s pose and position data, synchronously mapping them to the virtual character animation controllers in Unity3D for temporal and spatial calibration and adjustment.

Figure 7.

Visualization of action recognition results in panoramic video.

To achieve synchronization between the broadcaster’s actions and the virtual character animations, Unity3D is used. Precreated 3D character models with common action categories are imported into Unity3D. For each model, an animation controller is created, setting the state for each action and assigning an animation clip. C# script receives and analyzes the action categories and poses position data mapped into Unity3D, dynamically setting parameters (e.g., booleans and triggers) in the Animator Controller to trigger corresponding animation states, ultimately achieving animation state transitions. The specific process is shown in Figure 8.

Figure 8.

Unity realizes virtual character model action synchronization operation flow.

5. Experiments

HMDB51 and UCF101 are classic benchmark datasets in the field of action recognition, covering diverse action categories across indoor and outdoor scenarios. They are suitable for validating the model’s robustness to complex backgrounds and varying lighting conditions. Kinetics-400 contains large-scale long video sequences with rich spatiotemporal dynamics, which effectively evaluates the model’s capability for multiscale feature modeling. This paper adheres strictly to the official training and test set splits of the HMDB51, UCF101, and Kinetics-400 datasets to verify the model’s training effectiveness. We adopt a sparse sampling strategy to process video clips, evenly dividing the input video into $k$ segments (where $k = 5$ ) and randomly selecting one frame from each segment as input. During training, data augmentation and regularization strategies such as RandAugment, Mixup, and Cutmix are employed. To enhance recognition accuracy, the model is pre-trained on the ImageNet-1k dataset, with a weight decay coefficient set to 0.05 to prevent overfitting, and the pre-trained weights are used as initialization parameters.

During the training phase, dropout is set to 0.9, the batch size is 64, and the frame size is $224 \times 224$ . The model is trained for 120 epochs on the UCF101 and HMDB51 datasets and for 300 epochs on the Kinetics-400 dataset. The initial learning rate follows a cosine learning rate scheduler with a 30-epoch linear warm-up, and the AdamW optimizer with a weight decay of 0.05 is used. Cross-entropy loss and L2 regularization are employed to optimize the model parameters, which are updated at the end of each training epoch to improve recognition performance.

5.1. Action Recognition Performance

This section presents extensive experiments to evaluate the proposed method on relevant datasets, comparing it with several state-of-the-art methods. The results are shown in Tables 1 and 2. Our action recognition model achieves recognition accuracies of 74.9% on the HMDB51 dataset and 95.8% on the UCF101 dataset, which are improvements of 1.4% and 1.2%, respectively, over the PoolFormer-S24 using a single average pooling mixer. ConvEB enriches low-level details through multiscale local feature enhancement, while MAPFormer’s parallel pooling strategy efficiently aggregates global context. Their synergy achieves complementary local–global feature integration. Ablation studies (last two rows of Table 1) demonstrate that removing ConvEB alone leads to a 1% decrease in accuracy for HMDB51 and a 0.9% decrease for UCF101, thus validating the tight coupling between these two modules. Additionally, the proposed method achieves Top-1 and Top-5 accuracies of 83.8% and 95.9% on the Kinetics-400 dataset, respectively, surpassing CETNet-S, which uses convolutional enhancement and multihead attention network, by 0.4% and 0.3%, and multimodal video transformer (MM-ViT), which employs cross-attention mixer, by 0.3% and 0.8%. Although their recognition results are close to our method, their computational cost is 158% and 385% higher, and their parameter counts are 63% and 319% higher, respectively. The experimental results demonstrate that our model not only outperforms existing methods in terms of recognition accuracy, but also effectively reduces computational and parameter costs, shortening training times and improving model generalization.

Table 1.
Comparison of Accuracy on HMDB51 and UCF101.

Model Pretraining strategy HMDB51 UCF101

EPSANet (Zhang et al., 2022) – 73.3% 93.5%

MM-ViT (Chen & Ho, 2022) ImageNet-1k 71.2% 93.4%

DivST (Bertasius et al., 2021) – – 93.9%

ACTION-Net (wang et al., 2021) ImageNet-1K 73.8% 94.3%

STA-CNN (Yang et al., 2020) ImageNet-1K 70.2% 89.5%

I3D (Carreira et al., 2017) – 67.8% 91.6%

DiNAT (Hassani et al., 2022) ImageNet-1K 73.9% 94.5%

PVT-S (Wang et al., 2021) ImageNet-1k – 93.8%

PoolFormer-S24 (Yu et al., 2022) ImageNet-1k 73.5% 94.6%

MAPFormer (without ConvEB) ImageNet-1k 73.9% 94.9%

Ours ImageNet-1k 74.9% 95.8%

Model	Pretraining strategy	HMDB51	UCF101
EPSANet (Zhang et al., 2022)	–	73.3%	93.5%
MM-ViT (Chen & Ho, 2022)	ImageNet-1k	71.2%	93.4%
DivST (Bertasius et al., 2021)	–	–	93.9%
ACTION-Net (wang et al., 2021)	ImageNet-1K	73.8%	94.3%
STA-CNN (Yang et al., 2020)	ImageNet-1K	70.2%	89.5%
I3D (Carreira et al., 2017)	–	67.8%	91.6%
DiNAT (Hassani et al., 2022)	ImageNet-1K	73.9%	94.5%
PVT-S (Wang et al., 2021)	ImageNet-1k	–	93.8%
PoolFormer-S24 (Yu et al., 2022)	ImageNet-1k	73.5%	94.6%
MAPFormer (without ConvEB)	ImageNet-1k	73.9%	94.9%
Ours	ImageNet-1k	74.9%	95.8%

Note. MM-ViT=multimodal video transformer; I3D = Inflated three-dimensional; MAPFormer = multiscale average pooling encoder; ConvEB = convolutional enhancement block; STA-CNN = spatial–temporal attention convolutional neural network.

Table 2.

Comparison of Accuracy, Computation and Parameter Count on Kinetics-400.

Model	Pretraining strategy	Top-1	Top-5	FLOPS (G)	Param (M)
MViT-S (Fan et al., 2021)	–	80.2%	95.1%	32.9	26.1
PoolFormer-S24 (Yu et al., 2022)	ImageNet-1k	81.4%	93.8%	32	21.4
Swin-T (Liu et al., 2021)	ImageNet-1k	80.6%	94.6%	88	28.2
CETNet-S (Wang et al., 2022)	ImageNet-1K	83.4%	95.6%	67	34
MM-ViT (Chen & Ho, 2022)	ImageNet-1K	83.5%	95.1%	126	87.23
Ours	ImageNet-1k	83.8%	95.9%	26	20.8

Note. FLOPS = floating point operations per second; Swin-T = Swin transformer; MM-ViT = multimodal video transformer; Param = parameter.

5.2. Ablation Study

To systematically evaluate the contributions and optimal configurations of various modules in the proposed action recognition model based on ConvEB-MAPFormer, this section conducts ablation studies in several aspects. First, to explore the generalization of the proposed ConvEB when combined with other types of attention backbone networks for action recognition tasks, we combine the ConvEB with several advanced attention networks and conduct comparative experiments on the Kinetics-400 dataset under the same experimental settings. The results are shown in Table 3.

Table 3.
Comparison of Accuracy of ConvEB in Combination With Other Different Models.

Network configuration Top-1 Param (M)

Swin-T (Liu et al., 2021) 80.6% 28.2

ConvEB + Swin-T 81.5% 27.4

MViT-S (Fan et al., 2021) 81.2% 26.1

ConvEB + MViT-S 82.0% 24.8

PVT-S (Wang et al., 2021) 81.5% 24.5

ConvEB + PVT-S 82.7% 23.2

Network configuration	Top-1	Param (M)
Swin-T (Liu et al., 2021)	80.6%	28.2
ConvEB + Swin-T	81.5%	27.4
MViT-S (Fan et al., 2021)	81.2%	26.1
ConvEB + MViT-S	82.0%	24.8
PVT-S (Wang et al., 2021)	81.5%	24.5
ConvEB + PVT-S	82.7%	23.2

Note. ConvEB = convolutional enhancement block; Param = parameter; Swin-T = Swin transformer.

The table shows that when combined with ConvEB to form hybrid models, operations such as patch embedding and positional embedding in the original models are replaced by convolutional embedding, resulting in a slight reduction in the number of model parameters. In terms of recognition accuracy, ConvEB effectively extracts multiscale local features and preserves spatial information. When combined with ConvEB, the accuracy of Swin-T increases by 0.9%, MViT-S by 0.8%, and PVT-S by 1.2%, verifying the effectiveness and generalizability of the ConvEB.

Furthermore, this section compares the configurations of the MAPFormer. Based on the composition of the parallel pooling layer, several groups of average pooling layers with different numbers and kernel sizes are used. The results are shown in Figure 9, the parallel pooling mixer used in this paper adopts the (3, 7, 11) kernel size combination. The choice of pooling kernel sizes is empirically determined through experimental validation: smaller kernels ( $3 \times 3$ ) capture local details, medium kernels ( $7 \times 7$ ) model mid-range dependencies, and larger kernels ( $11 \times 11$ ) aggregate global context. Beyond three pooling layers, performance gains plateaued while parameter count increased significantly. Thus, the three-kernel combination is selected to balance efficiency and accuracy. Additionally, Table 4 shows that using traditional MLP in MAPFormer results in a recognition accuracy of 83.6% with 21.7M parameters. When replacing the MLP with Conv-BN structure, the network’s recognition accuracy increases to 83.8%, and the parameter count decreases to 20.8M. This demonstrates that using Conv-BN structure instead of MLP in the model can enhance performance and reduce parameter count, making the model more lightweight and efficient.

Figure 9.

Accuracy variation curves for different combinations of pooling sizes.

Table 4.

Performance Comparison of Different Perceptual Layer Structures of MAPFormer.

Structure	Top-1	Param (M)
MLP	83.6%	21.7
Conv-BN	83.8%	20.8

Note. MAPFormer = multiscale average pooling encoder; Param = parameter; MLP = multilayer perceptron; Conv-BN = convolutional-batch normalization.

5.3. Live System Performance

To test the accuracy of the proposed action recognition model in the panoramic live broadcasting system, we record numerous panoramic videos in various scenarios, including library interior, indoor laboratory scene, and outdoor campus environment. In each scenario, 10 different individuals perform 20 common action categories (e.g., waving, drinking, touching the nose, standing up, squatting, walking, and jumping). Ensure the video quality and that each clip contains the corresponding action category, and each video segment is 15 s long. For system performance testing, we use a sparse sampling strategy and apply the trained ConvEB-MAPFormer model to recognize actions in panoramic videos. The video action recognition accuracies in different scenarios are shown in Table 5.

Table 5.
Recognition Accuracy of ConvEB-MAPFormer Model in Panoramic System.

Location Library Laboratory Campus

Accurate identifications 183 186 180

Incorrect identifications 10 8 12

Unidentified identifications 7 6 8

Accuracy rate 91.5% 93% 90%

Location	Library	Laboratory	Campus
Accurate identifications	183	186	180
Incorrect identifications	10	8	12
Unidentified identifications	7	6	8
Accuracy rate	91.5%	93%	90%

Note. ConvEB = convolutional enhancement block; MAPFormer = multiscale average pooling encoder.

The table shows that in the relatively simple indoor laboratory scene, the recognition accuracy reaches 93%. In the library, also an indoor environment but with a more complex background and lighting, the accuracy drops to 91.5%. In the outdoor campus environment, affected by outdoor lighting, background, and meteorological conditions, the recognition rate is lower. However, the overall accuracy remains above 90%.

Additionally, this section conducts an action recognition experiment in virtual panoramic live broadcasting within a laboratory scene. According to Section 4.2, the receiver of the panoramic video achieves synchronization between live actions and virtual character animation model postures. As shown in Figure 10, a scene from the virtual panoramic live broadcasting captured from the PotPlayer window shows that when the broadcaster makes a hand-raising gesture, the 3D virtual character animation model also synchronizes with the hand-raising gesture. Finally, this section tests the system latency before and after applying the action recognition algorithm. During live broadcasting in the laboratory scene, the real-time performance of the system is obtained by calculating the time difference between the collector and receiver. Specifically, the latency test is conducted in five rounds, with 10 sets of latency data collected at different times in each round, and the average value of each round is obtained, as shown in Table 6.

Figure 10.

Virtual panoramic live broadcasting screen.

Table 6.

Latency Test Results of the Live Broadcasting System.

Test round	Predeployment (s)	Postdeployment (s)
Round 1	3.3	4.2
Round 2	3.0	4.1
Round 3	3.1	3.9
Round 4	3.4	4.3
Round 5	3.2	4.0
Average delay	3.2	4.1

The table indicates that the average latency of the system without the action recognition algorithm is 3.2 s, while the average latency with the ConvEB-MAPFormer model is 4.1 s. This demonstrates that the application of the action recognition algorithm in panoramic live broadcasting only causes a slight delay, which is generally acceptable for daily live broadcasting needs. The main reasons for the system latency are as follows: limited hardware resources in the laboratory, leading to latency during video collection, panoramic video stitching, and action recognition; small network bandwidth in the laboratory, resulting in slower video transmission at the cloud forwarder; and the time-consuming of real-time rendering of panoramic videos.

6. Limitations and Future Work

While the proposed model achieves competitive performance, several limitations remain. First, the recognition accuracy in outdoor complex scenarios (e.g., dynamic lighting changes, occlusions, or crowded backgrounds) still has room for improvement. Second, the system latency is constrained by hardware computational power and network bandwidth. Future work could explore model quantization or knowledge distillation techniques to further lightweight the architecture while maintaining accuracy. Additionally, integrating multimodal signals (e.g., audio–visual joint modeling) may enhance action understanding in scenarios where visual cues are ambiguous.

7. Conclusion

In this paper, we proposed a lightweight multiscale action recognition model based on ConvEB and MAPFormer, and deployed it in practical application. The ConvEB incorporates PSA into the Fused-MBConv structure to form PSF-MBConv, which serves as the building block of the network. This module aims to establish long-range dependencies of multiscale features between channels in the early stages of the network, providing effective inductive biases for the attention network. The MAPFormer employs multiple average pooling layers with different pooling sizes in parallel to create a parallel pooling mixer, replacing the traditional attention mixer in the attention network and integrating it with Conv-BN structure. This design seeks to maintain recognition accuracy while further enhancing the model’s computational and inference efficiency.

Experimental results on various datasets demonstrated the effective improvement of the proposed action recognition algorithm in terms of accuracy, parameter count, and computational volume. Ablation studies indicated that the proposal and optimization of our two-part modules have achieved positive effects. Tests on panoramic systems have shown that our algorithm can indeed be applied effectively in real-world scenarios. In the future, we plan to explore methods to reduce hardware dependency and runtime, thereby minimizing system latency and improving real-time performance.

Footnotes

ORCID iD

Zihao Zhao

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix A. Implementation Details of PSA Module

PSA primarily consists of a squeeze pyramid concat (SPC) module, SEWeight module, Softmax operation, and Concat fusion, as illustrated in Figure 11. The SPC module (Lin et al., 2017) divides the channels and extracts multiscale spatial information from the channel feature maps. As shown in Figure 12, for the feature map $X$ with input channels $C$ , the map is divided into $S$ parts (four in the figure) according to the number of channels, that is, the channel number of the divided sequences is $C^{'} = C \cdot (1 / S)$ . Then the divided channel feature maps $X_{i} \in R^{C^{'} \times H \times W} (i = 0, 1, \dots, S - 1)$ are convolved in groups using multiscale convolution kernels to obtain the spatial information of the feature maps at different scales. The specific calculation is as follows: (3)

F_{i} = conv (k_{i} \times k_{i}, G_{i}) (X), i \in [0, 1, \dots, S - 1],

where

k_{i} = 2 \times (i + 1) + 1

represent the kernel sizes,

G_{i} = 2^{(k_{i} - 1) / 2}

denote the group sizes in grouped convolution, and

F_{i} \in R^{C^{'} \times H \times W}

are the extracted multiscale spatial feature maps over the channel dimensions. Finally, the multiscale feature maps from each channel are fused in the spatial domain to produce the output

F \in R^{C \times H \times W}

with multiscale spatial features: (4)

F = Concat ([F_{0}, F_{1}, \dots, F_{S - 1}]) .

For the different channel-fused feature maps obtained from the SPC module, the PSA employs the SEWeight module to process them and derive channel attention for the multiscale feature maps. The SEWeight module consists of two parts: squeeze and excitation. The squeeze part utilizes global average pooling to process the input feature maps, generating channel-wise data and embedding spatial information into the channel description for global information encoding. The excitation part uses two fully connected layers and a linear mapping unit to linearly combine the information across channels to help the interaction of channel information in different dimensions. The calculation is as follows: (5)

Z_{i} = σ (W_{1} \cdot δ (W_{0} \cdot GAP (F_{i}))),

where

Z_{i} \in R^{C^{'} \times 1 \times 1}

represent the multiscale channel attention weight vector output by the SE module.

W_{0}

and

W_{1}

are the weight matrices of the two fully connected layers used for dimensionality reduction and restoration, respectively.

Finally, PSA uses the Softmax operation to recalibrate the weights of the channel attention weight vectors obtained from the SEWeight module, resulting in the final attention weights. These attention weights are then multiplied with the corresponding scale feature maps and fused to produce the final output, as shown below: (6)

\begin{matrix} Y_{i} = F_{i} ⊙ Softmax (Z_{i}), \end{matrix}

(7)

\begin{matrix} X_{out} = Concat ([Y_{0}, Y_{1}, \dots, Y_{s - 1}]), \end{matrix}

where

Y_{i} \in R^{C^{'} \times H \times W}

X_{out} \in R^{C \times H \times W}

represents the multidimensional concatenated feature map after multiscale channel weighting.

Thus, the ConvEB effectively integrates multiscale spatial information across channels, enhancing the information flow between different feature levels and aiding subsequent attention layers in learning higher-level abstract features more effectively.

References

Bertasius

Wang

Torresani

(2021). Is space–time attention all you need for video understanding? In ICML (vol. 2, p. 4).

Carreira

Zisserman

(2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 6299–6308).

Chen

C. M.

(2022). MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of IEEE/CVF winter conference on applications of computer vision (pp. 1910–1921).

Doğan

(2023). Which pooling method is better: Max, avg, or concat (max, avg). Communications Faculty of Sciences University of Ankara Series A2–A3 Physical Sciences and Engineering, 66(1), 95–117.

Dosovitskiy

Beyer

Kolesnikov

, et al (2020). An image is worth

16 \times 16

words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Fan

Xiong

Mangalam

, et al (2021). Multiscale vision transformers. In Proceedings of IEEE/CVF international conference on computer vision (pp. 6824–6835).

Feichtenhofer

(2020). X3D: Expanding architectures for efficient video recognition. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (pp. 203–213).

Feichtenhofer

Pinz

Wildes

R. P.

(2017). Spatiotemporal multiplier networks for video action recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 4768–4777).

Feichtenhofer

Pinz

Zisserman

(2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1933–1941).

10.

Girdhar

Carreira

Doersch

, et al (2019). Video action transformer network. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (pp. 244–253).

11.

Guibas

Mardani

, et al (2021). Efficient token mixing for transformers via adaptive Fourier neural operators. In International conference on learning representations.

12.

Gupta

Tan

(2019). EfficientNet-edgeTPU: Creating accelerator-optimized neural networks with autoML. Google AI Blog, 2(1).

13.

Hassani

Shi

(2022). Dilated neighborhood attention transformer. arXiv preprint arXiv:2209.15001.

14.

Zhang

Ren

, et al (2016). Deep residual learning for image recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 770–778).

15.

Lee-Thorp

Ainslie

Eckstein

, et al (2021). Fnet: Mixing tokens with Fourier transforms. arXiv preprint arXiv:2105.03824.

16.

Tan

Pang

, et al (2021). Searching for fast model families on datacenter accelerators. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (pp. 8085–8095).

17.

Yuan

Wen

, et al (2022). Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35, 12934–12949.

18.

Lin

T. Y.

Dollár

Girshick

, et al (2017). Feature pyramid networks for object detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2117–2125).

19.

Liu

Jia

Zhong

, et al (2025). Dynamic and static mutual fitting for action recognition. Pattern Recognition, 157, 110948.

20.

Liu

Lin

Cao

, et al (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF international conference on computer vision (pp. 10012–10022).

21.

Liu

Zhou

, et al (2022). Tokenmix: Rethinking image mixing for data augmentation in vision transformers. In European conference on computer vision (pp. 455–471). Springer Nature Switzerland.

22.

Manakitsa

Maraslidis

G. S.

Moysis

, et al (2024). A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision. Technologies, 12(2), 15.

23.

Neimark

Bar

Zohar

, et al (2021). Video transformer network. In Proceedings of IEEE/CVF international conference on computer vision (pp. 3163–3172).

24.

Sandler

Howard

Zhu

, et al (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 4510–4520).

25.

Simonyan

Zisserman

(2014). Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, 27.

26.

Tan

C. H.

Chen

Wang

, et al (2021). Ponet: Pooling network for efficient token mixing in long sequences. arXiv preprint arXiv:2110.02442.

27.

Tolstikhin

I. O.

Houlsby

Kolesnikov

, et al (2021). MLP-mixer: An all-MLP architecture for vision. Advances in Neural Information Processing Systems, 34, 24261-–24272.

28.

Tran

Wang

Torresani

, et al (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 6450–6459).

29.

Vaswani

Shazeer

Parmar

, et al (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

30.

Wang

She

Smolic

(2021). Action-net: Multipath excitation for action recognition. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (pp. 13214–13223).

31.

Wang

Xie

, et al (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of IEEE/CVF international conference on computer vision (pp. 568–578).

32.

Wang

Zhang

, et al (2022). Convolutional embedding makes hierarchical vision transformer stronger. In European conference on computer vision (pp. 739–756). Springer Nature Switzerland.

33.

Xie

Sun

Huang

, et al (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of European conference on computer vision (ECCV) (pp. 305–321).

34.

Yang

Zhang

, et al (2021). Focal self-attention for local–global interactions in vision transformers. arXiv preprint arXiv:2107.00641.

35.

Yang

Yuan

Zhang

, et al (2020). STA-CNN: Convolutional spatial–temporal attention learning for action recognition. IEEE Transactions on Image Processing, 29, 5783-–5793.

36.

Luo

Zhou

, et al (2022). Metaformer is actually what you need for vision. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (pp. 10819–10829).

37.

Zhou

, et al (2023). Metaformer baselines for vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2), 896–912.

38.

Zhang

C. L.

(2022). Actionformer: Localizing moments of actions with transformers. In European conference on computer vision (pp. 492–510). Springer Nature Switzerland.

39.

Zhang

, et al (2022). EPSANet: An efficient pyramid squeeze attention block on convolutional neural network. In Proceedings of Asian conference on computer vision (pp. 1161–1177).

Action Recognition Based Virtual Panorama Live Broadcasting System

Abstract

Keywords

1. Introduction

2.1. CNN-Based Action Recognition Methods

2.1.1. Two-dimensional (2D) CNN

2.1.2. Three-Dimensional (3D) CNN

2.2. Transformer-Based Action Recognition Methods

3. Action Recognition Algorithm

4.1. System Composition

4.1.1. Collector

4.1.3. Receiver

4.2. Algorithm Deployment

5.1. Action Recognition Performance

Table 5. Recognition Accuracy of ConvEB-MAPFormer Model in Panoramic System. Location Library Laboratory Campus Accurate identifications 183 186 180 Incorrect identifications 10 8 12 Unidentified identifications 7 6 8 Accuracy rate 91.5% 93% 90%

7. Conclusion

Footnotes

ORCID iD

Funding

Declaration of Conflicting Interests

Appendix A. Implementation Details of PSA Module

References

Table 5.
Recognition Accuracy of ConvEB-MAPFormer Model in Panoramic System.

Location Library Laboratory Campus

Accurate identifications 183 186 180

Incorrect identifications 10 8 12

Unidentified identifications 7 6 8

Accuracy rate 91.5% 93% 90%