Sage Journals: Discover world-class research

Abstract

The human action classification (HAC) problem presents a complex challenge in the field of machine learning, particularly when dealing with random and unpredictable activities. Resource-intensive models often require significant computational power, limiting their deployment on devices with constrained resources. To address this, we propose a method to improve the accuracy of action classification models by focusing on data preprocessing and data set splitting based on feature clustering using K-means, rather than relying solely on complex models. While we employ ResNet50 and EfficientNetV2 for their proven effectiveness in handling complex features, we also utilize MobileNetV3 as an efficient alternative suitable for resource-constrained scenarios. The experimental results demonstrate that the proposed method significantly improves accuracy and model generalization using the HMDB51 data set. This feature-based data-splitting approach effectively reduces overfitting and can be integrated with other techniques to further optimize model training efficiency.

Keywords

human action classification human activity recognition deep learning K-means clustering data splitting model performance data preprocessing feature extraction convolutional neural networks principal component analysis (PCA)

Introduction

The problem of human activity recognition (HAR) and human action classification (HAC) has become increasingly complex due to the growing diversity and variety of human activities. However, the demand for the application of these solutions is also growing, given their usefulness in fields such as surveillance, healthcare care, sports analysis, and detection of criminal behavior with security cameras.^1–3 The ability to automatically recognize and classify human actions improves human-computer interaction by allowing applications to better understand the context of human activities.⁴

There are two foundational approaches typically used to address this problem: sensor-based and image-based methods. In sensor-based approaches, the data collected from the wearable sensors provide information on movements and physiological states, which helps to identify the current actions of the person using the sensors. This method is effective for simple actions, such as walking, standing, or running, but becomes less accurate with more complex activities that involve fine movements or interactions with objects.⁵ To address the challenges of data scarcity in deep learning applications, we refer to the comprehensive survey by Camarena et al.⁶ This survey explores various approaches to overcome data limitations, including transfer learning, data augmentation, and short-shot learning, while highlighting their effectiveness across diverse domains. These methods enable models to generalize better and perform effectively despite limited or imbalanced data sets. This study builds on these insights by employing pretrained models, feature-based data splitting, and RandAugment, demonstrating how these techniques can mitigate data scarcity challenges and enhance model performance in human action classification tasks. Similarly to Camarena et al.,⁶ Beddiar et al.⁷ study highlights improvements to traditional clustering methods, enhancing performance by optimizing cluster initialization, and convergence. Although our work employs the standard K-means algorithm for its simplicity and efficiency, future research could explore integrating such enhanced clustering techniques to further refine feature-based data-splitting strategies and improve model performance.

In contrast, image-based methods determine actions based on visual information from images or videos. This approach not only assesses actions but also incorporates environmental context, producing better results for more complex activities. However, there are several drawbacks to this method: the larger the model, the better the predictions, but this also increases the computational resource demands. Furthermore, image-based methods require a large amount of labeled data, which can be challenging to obtain.^7,8

To improve the accuracy of action classification models that use image data, we focus on leveraging pre-trained models rather than developing a new model from scratch. This approach not only saves resources and time, but also utilizes the complex features that these pre-trained models have already learned from large datasets. Pretrained models provide a strong foundation, making it easier to fine-tune them for specific tasks:

One of the key factors determining accuracy and performance is how the data set is split between the training set and the testing set. Random data splitting can result in imbalanced datasets and reduce the model’s generalization ability. Therefore, we developed an intelligent data splitting strategy based on the data features, ensuring that both training and testing sets have similar feature distributions. This enables the model to perform well not only during training, but also when tested on unseen data.

In this study, pre-trained models such as ResNet50, EfficientNetV2, and MobileNetV3 were chosen for their proven capabilities in feature extraction and classification tasks. These models serve as robust baselines, leveraging their prelearned features from large datasets to reduce the need for training from scratch. However, to further enhance the classification accuracy, we integrated a novel data preprocessing pipeline and a feature-based data splitting strategy. While pre-trained models handle the learning of complex features, our data preprocessing methods optimize the input data, ensuring better alignment between the data distributions in the training and testing phases. This synergistic combination of pre-trained models and preprocessing methods allows for improved accuracy and generalization without increasing the computational complexity of the models themselves.

To validate that the proposed feature-based data splitting method ensures similar feature distributions between training and testing sets, we employed quantitative metrics, including the Kolmogorov-Smirnov (KS) test. The KS statistic measures the difference between the empirical cumulative distribution functions of the two datasets, with lower values indicating higher similarity. Experimental results show that the KS values for the feature-based splitting method are consistently low, ranging from 0.0049 to 0.0053, demonstrating near-identical feature distributions between the training and testing sets. This similarity reduces the risk of overfitting and ensures that the model’s performance on the testing set reflects its true generalization capability. This approach significantly improves the reliability of the classification results compared to traditional label-based data splitting methods.

We conducted training using ResNet50 on the HMDB51 dataset, one of the standard datasets for action classification. This data set contains numerous videos of various human actions. The data was split using the proposed method and the model’s results were compared with other data-splitting methods and existing models. This helped us to accurately evaluate the effectiveness of feature-based data splitting and its ability to improve model performance.

The proposed data preprocessing pipeline involves three key components: PCA-based dimensionality reduction, K-means clustering, and RandAugment for data augmentation. PCA reduces the dimensionality of the extracted features, retaining the most relevant information while minimizing noise, which simplifies the subsequent clustering process. Then, clustering of K-means is then applied to group features into groups with similar characteristics, ensuring that both training and testing sets share consistent feature distributions. RandAugment further enhances the diversity of the dataset by applying random transformations, such as rotation, cropping, and color adjustments, which help the model learn more robustly. Together, these preprocessing steps not only improve data quality and balance, but also significantly enhance the classification accuracy of the deep learning models by reducing overfitting and ensuring that the model is exposed to a wide range of feature variations during training.

The remainder of this paper is organized as follows.

“Related Work” section: We review current methods related to the HAC problem. We discuss traditional methods, their limitations, and the evolution of deep learning models in addressing this problem. We also clarify the research objectives, focusing on improving model accuracy without relying on complex and resource-intensive models.

“Our Methods” section: We describe in detail the HMDB51 dataset used in this investigation, including its structure, the number of classes, and how the data were split. We also explain the feature-based data splitting method, the model evaluation metrics used, such as accuracy, recall, precision, and the F1 score, and the steps taken in our experiments.

“Experiment and Results” section: We outline the parameters used throughout the experiments, explain the rationale behind conducting these experiments, and provide observations and evaluations based on the results. We also compare these results with related studies in the same field.

“Conclusion and Future Work” section: We summarize the findings of our investigation. We draw important conclusions about the effectiveness of the proposed data-splitting method and its applicability to various models. Finally, we provide recommendations for future research directions, including extending this method to other datasets and applying it to more complex models to fully leverage the potential of deep learning systems.

Related work

Research on human activity recognition and classification is a diverse, complex, and challenging field in computer vision and artificial intelligence. Solutions to the HAC problem require the ability to adapt to the instability and diversity of human behavior, including environmental factors and psychological states.⁹ Researchers have developed powerful algorithms that accurately recognize and classify actions through variations in image data. However, the rapid growth of data from sources such as social networks and security videos has made this problem more complex, especially in applications such as detecting criminal behavior and analyzing human actions.^10,11

HAC can be considered a subset of the HAR problem, but with some significant differences. HAC mainly focuses on image data, whereas HAR may include both sensor data and long videos. HAC also emphasizes classifying specific types of action based on posture, appearance, and movement characteristics. This makes the problem more complicated, especially when identifying similar actions or those occurring in complex environments.

These challenges call for more advanced solutions, particularly in the development of deep learning models that can automatically extract features from images, rather than relying on manual techniques as before. This opens up many opportunities for future research as technologies like Convolutional Neural Networks (CNNs) and transfer learning continue to evolve and deliver high performance in the classification of complex actions.

Recent advances in feature-based data preprocessing, robust classification models, and data-driven prediction techniques have significantly contributed to improving deep learning applications in various domains. Our study aligns with these developments by using feature-based data splitting and preprocessing techniques to improve human action classification. Several studies provide valuable information that complements our approach and highlights potential areas for further research.

For example, Zheng et al.¹² introduces an unsupervised learning approach with manifold regularization, reinforcing the importance of structured feature extraction. Similarly, Zheng et al.¹³ demonstrates the effectiveness of attention mechanisms and noise-tolerant activation functions in improving the robustness of classification. These methods align with our Rand-Augment-based augmentation strategy, which enhances model generalization by increasing data variability.

Beyond classification, multiscale modeling has proven to be an essential tool in predictive analytics. Zhang et al.¹⁴ and Hu et al.¹⁵ highlight how feature-driven and data-driven prediction techniques optimize classification accuracy in different domains. This reinforces the need for intelligent dataset structuring, such as our feature-based data splitting strategy, to improve the efficiency of deep learning models.

Furthermore, Zhang et al.¹⁶ emphasizes the role of multiscale feature interactions in classification accuracy. This study underscores the potential benefits of integrating multiscale learning techniques into our dataset structuring methodology, ensuring that the training and testing sets maintain consistent feature distributions.

By incorporating these perspectives, our study builds upon existing advancements in feature-driven classification, multiscale learning, and noise-resilient models, paving the way for future extensions that integrate self-supervised learning, attention-based feature extraction, and advanced data augmentation techniques. These enhancements will further optimize the robustness and adaptability of deep learning models for human action classification and broader computer vision applications.

Traditional approaches

In the past, to address the HAC problem, traditional methods often relied on manually creating features from the data and using simple classifiers.¹⁷ These methods typically involved preprocessing data collected from sensors, such as motion data or physiological signals, or extracting features from raw data using transformation techniques like the Fourier transform or wavelet analysis. These features were then used to train classifiers, helping models recognize, and classify actions based on the patterns learned.

Common classifiers used in traditional approaches include support vector machines (SVM), hidden Markov models (HMMs), and decision trees.^18–20 SVM is a powerful machine learning method often used to classify high-dimensional data and performs well with manually extracted features. HMMs, a statistical model, are particularly effective in processing time series data, helping to predict subsequent actions based on the current state. Decision trees offer an understandable and intuitive approach, suitable for classification problems based on discrete features.

Although these traditional methods have achieved some success, they face limitations such as difficulties in handling complex features from image data and the lack of automation in feature extraction, which requires high expertise to design appropriate features. These challenges have driven the development of modern deep learning models, which enable automatic feature extraction and action classification without the need for complex manual preprocessing steps.

Deep-learning base

Based on recent advances in deep learning, the HAC problem has seen significant improvements. The limitations of traditional methods, such as manual feature extraction or reliance on sensor data, have gradually been addressed. With deep learning, the models can process image and video data directly, enabling automatic extraction of spatial features without manual intervention.²¹ In particular, CNNs have proven to be highly effective in capturing and processing spatial features from images and videos, allowing models to recognize more complex actions.

One common approach in HAC is to use 3D CNNs. These models not only extract features from spatial dimensions (shape, posture), but also from the temporal dimension, helping the model to accurately detect human movements in videos.²² Additionally, focusing on 3D motion features has become a popular approach to improving action recognition, especially for complex actions.²³

Similarly, recurrent neural networks (RNNs) have made significant strides in handling time series data. Models such as long-short-term memory (LSTM) and gated recurrent unit (GRU) have allowed efficient processing of complex action sequences by retaining information over multiple time steps.^24–26 Combining RNNs with 1D-CNNs enables the model to process both spatial and temporal information simultaneously, optimizing action classification in videos.

Furthermore, transfer learning and subtransfer learning have become popular methods in the HAR and HAC. These approaches not only enhance model performance, but also reduce training time by leveraging pre-trained models on large datasets.^27,28 This is particularly useful in situations with limited data, where collecting and labeling large datasets can be challenging. Transfer learning alleviates the training burden while maintaining model quality by using features learned from similar datasets.

These advances have made the HAC problem more feasible, especially in high-accuracy applications such as security surveillance, sports analysis, and healthcare. Using advanced deep learning models not only yields superior performance, but also broadens the applicability of HAC in various fields.

Train test split stratify by labels

When applying deep learning to the HAC problem, one critical factor is the availability of an appropriate dataset, both in terms of quality and in terms of how it is divided. Typically, the original data set is divided into two sets: the training set and the testing set, with a common ratio of 80:20. Specifically, 80% of the data is used to train the model, while the remaining 20% is reserved for evaluating the model’s performance after training.

However, random splitting of data according to this ratio can cause several problems. First, the data may become unbalanced when different classes have varying frequencies. For example, if one class dominates the data, the model may overfit to that class but perform poorly on less frequent classes, causing a data imbalance issue. Second, because of random splitting, some classes might only appear in the training set and not in the testing set, or vice versa. This negatively impacts the model’s ability to be evaluated accurately, as the testing set may not be diverse enough to reflect real-world cases that the model will encounter.

To address these issues, label-sampled stratified sampling has been widely used. Instead of random data splitting, this method ensures that the 80:20 ratio is applied to each class in the data set. This guarantees that every class is equally represented in both the training and testing sets. Stratified sampling helps prevent missing classes during model evaluation, thus improving model accuracy and generalization.

However, while label-based stratification ensures that each class is evenly represented in both sets, it does not solve the problem of feature distribution within the data. This means that even though the number of data points for each class is balanced, the intrinsic features of the data in the training and testing sets may still differ. For example, the images in the training set may differ significantly from those in the testing set in terms of angle, lighting, or action posture, leading to poor model generalization for situations not encountered during training.

Therefore, in addition to ensuring label balance, methods that guarantee feature homogeneity between the two datasets are needed. This would improve the model’s generalization capabilities and reduce overfitting.

Existing solution drawbacks

Current methods for improving model accuracy have certain strengths, but also present some challenges that need to be addressed.

Traditional machine learning methods:

○ Difficulty in capturing complex movements: Traditional machine learning methods, often based on algorithms like SVM or Hidden Markov Models (HMM), have achieved success in some areas but struggle to handle complex actions that require capturing fine-grained movements and dynamics. This is particularly challenging in scenarios with rapid movements or unclear postures.

○ Expertise required for feature creation: These algorithms depend on manual feature extraction, which requires deep domain knowledge to create meaningful features from the data. Designing and selecting appropriate features is not only time-consuming, but also requires a comprehensive understanding of the information needed for classification, reducing the flexibility of traditional methods.

○ Inefficiency in extracting high-level features: When faced with complex situations, traditional methods often lack the ability to extract highly discriminative features from the data. This results in poor classification performance when the data become complex or exhibit significant variability.²⁹ These methods also do not fully take advantage of the power of multidimensional image data, limiting overall performance.

○ Waste of data: Some traditional methods rely on sensor data without taking advantage of the richness of available image data. This leads to a waste of resources, as collecting sensor data requires specialized equipment and these methods fail to exploit the full potential of image data, which can provide valuable contextual and action-related information.

Stratified data splitting by labels:

○ Lack of consistency between training and testing sets: While stratified sampling by labels ensures that labels are evenly distributed in both the training and testing sets, it does not guarantee that the features of the data in these two sets will be similar. This could result in the testing data containing features that are significantly different from the training data, leading to inaccurate model evaluation.

○ Lack of similarity between training and testing data: For a model to perform effectively, the training data must contain features similar to those in the testing data. If there is a significant difference in feature distribution between the two sets, the model may overfit the training set but fail to generalize well to the testing set. This results in a model that performs well on training data, but poorly on unseen real-world data.

Currently, traditional data-splitting methods do not effectively address this issue. There is a need for data splitting methods that not only ensure label balance but also guarantee feature consistency between datasets. This would significantly improve the model’s generalization capability and accuracy in real-world applications.

Innovation of our methods

In the context of improving classification accuracy for human action classification (HAC), related work on data preprocessing and data splitting has primarily focused on traditional methods such as random splitting, label stratified splitting, and simple data augmentation techniques (e.g., flipping, rotation, and cropping). While these approaches are widely used, they have notable limitations. Random splitting often leads to imbalanced datasets that negatively impact model generalization, while label-stratified splitting ensures class balance, but fails to address inconsistencies in feature distributions between training and testing sets. Similarly, conventional augmentation methods enhance diversity but do not address deeper feature-level imbalances. To overcome these challenges, our study introduces an innovative approach combining PCA for dimensionality reduction, K-means clustering for feature-based data splitting, and RandAugment for robust data augmentation. This ensures both label balance and feature consistency across datasets, effectively reducing overfitting and improving the generalization capability of models.

Furthermore, the paper has expanded the discussion to include additional deep learning models employed in HAC-related work, such as 3D convolutional neural networks (3D-CNNs), long- and short-term memory (LSTM) networks, and hybrid architectures combining CNNs with RNNs. These models focus on extracting spatio-temporal features critical for action recognition, but often require extensive computational resources and large-scale labeled datasets. In contrast, our study demonstrates that by employing effective preprocessing and data splitting techniques, the performance of pretrained models like ResNet50, EfficientNetV2, and MobileNetV3 can be significantly enhanced without the need for overly complex architectures or large datasets. This approach provides a more resource-efficient solution suitable for real-world applications.

To clearly illustrate the advantages of our methods, the paper provides a comparative analysis highlighting the limitations of existing approaches and the unique contributions of our study. Our proposed method not only addresses the shortcomings of traditional data splitting and augmentation techniques, but also enhances model accuracy and generalization in a computationally efficient manner. This innovation sets the groundwork for future studies to further optimize HAC performance using lightweight and effective pre-processing strategies.

Our methods

In this study, we propose a new method for preprocessing and splitting the dataset into “train” and “test” sets based on clustering the data’s features. Instead of simply using random splitting or label-stratified sampling, this method applies clustering to ensure that the features in both the training and testing sets are highly similar. This improves accuracy and generalization, particularly when handling complex and imbalanced datasets.

The processed data set is then used to train popular deep learning models in image processing, such as ResNet50, EfficientNetV2, and MobileNetV3. These models have proven to be effective in extracting features from images and recognizing complex actions. Feature-based clustering in the data splitting process helps optimize training, reduce overfitting, and improve the model’s performance on previously unseen data.

Figure 1 illustrates the data preprocessing and data set split process used in this study. The process includes steps such as feature extraction, clustering of features to identify similar groups, and splitting these groups into training and testing sets so that the features are evenly distributed. This method ensures label balance while optimizing feature consistency across datasets, enhancing the performance of deep learning models in image classification tasks.

Figure 1.

Overview of the splitting process.

Complexity issues of investigated approaches

To provide a deeper understanding of the complexity of the deep learning models used in this study, we have analyzed the computational requirements of ResNet50, EfficientNetV2, and MobileNetV3. ResNet50, with its 50-layer architecture and the use of residual blocks, is well suited for learning complex spatial features. However, the trade-off is higher computational and memory requirements compared to simpler models. Its computational complexity is $O (N \times D^{2} \times L)$ where $N$ is the number of input samples, $D$ is the dimensionality of the input, and $L$ is the number of layers. EfficientNetV2 improves efficiency using compound scaling and fused MBConv layers. It optimizes resource usage while maintaining high accuracy, reducing training time and computational cost. Its complexity is reduced relative to ResNet50 due to its efficient architectural design. MobileNetV3, designed for mobile devices and resources limited devices, employs inverted residual blocks and linear bottlenecks. It is computationally lightweight, with reduced parameter counts and operational costs, making it highly efficient for tasks requiring real-time inference.

The proposed feature-based data split strategy incorporates PCA and K-means clustering, which introduces an additional layer of complexity in the preprocessing stage. PCA reduces the dimensionality of the data set while retaining the most significant variance, improving computational efficiency during clustering. Its complexity is $O (N \times D^{2} \times L)$ , where $N$ is the number of data points and $D$ is the original dimensionality of the feature. When D is reduced, the computational burden for subsequent steps is significantly reduced. Clustering of K-means partitions the reduced feature space into $k$ clusters. Its complexity is $O (I \times k \times N \times D^{'})$ , where $I$ is the number of iterations, $k$ is the number of clusters, and $D^{'}$ is the reduced dimensionality from PCA. The choice of $k$ as a hyperparameter affects both the clustering granularity and the computational load.

Although the pre-processing phase (PCA and K-means) introduces additional steps, it ensures that the feature distribution between training and testing datasets is consistent. This alignment significantly reduces overfitting and enhances model generalization, ultimately justifying the preprocessing costs. Importantly, these steps are only performed once during data preparation and do not affect the computational load during model training or inference. Traditional methods, such as random or label-stratified data splitting, are computationally simpler, but fail to address feature distribution imbalances, leading to poor model generalization. The proposed feature-based data splitting ensures both label balance and feature homogeneity, offering a robust solution for complex data sets. While more computationally intensive, this approach provides substantial improvements in accuracy, precision, and F1 score, as demonstrated in our experiments.

The experimental results confirm the efficacy of the proposed methods. Models trained on feature-clustered datasets exhibit significantly better generalization and robustness. The results of the Kolmogorov-Smirnov (KS) test further validate the improved similarity in the feature distribution between training and testing datasets, with KS values ranging from 0.0049 to 0.0053. The proposed methods strike an effective balance between computational complexity and performance improvement. Using PCA and K-means during preprocessing and using lightweight yet high-performing models like MobileNetV3 and ResNet50, we demonstrate that the trade-offs in computational costs yield significant benefits in accuracy and generalization. This approach offers a scalable and practical solution for human action classification in real-world scenarios.

Computational complexity and memory requirements analysis

To provide a more comprehensive understanding of our approach, we analyze both the computational complexity and memory requirements of the proposed methodology. This section details the complexity of key operations and their impact on the overall efficiency of the system.

Computational complexity analysis

Our methodology involves multiple stages, including preprocessing (PCA-based dimensionality reduction and K-means clustering), feature extraction using deep learning models, and classification. The complexity of each step is summarized as follows:

PCA dimensionality reduction:

The complexity of PCA is given by: $O (N D^{2} + D^{3})$

where $N$ is the number of data points and $D$ is the original feature dimensionality. The first term accounts for the covariance matrix computation and the second term represents eigenvalue decomposition.

K-means clustering:

The K-means algorithm has a complexity of: $O (NkT)$

where $k$ is the number of clusters and $T$ is the number of iterations required for convergence.

Feature extraction using convolutional neural networks (CNNs):

The complexity of feature extraction is influenced by the number of filters ( $F$ ) and feature map dimensions ( $H \times W$ ): $O (N \cdot F \cdot H \cdot W)$

where $N$ is the batch size. The CNN models used in this study – ResNet50, EfficientNetV2, and MobileNetV3 – each have distinct computational demands, with MobileNetV3 being the most efficient for real-time applications.

Training complexity for deep models:

The computational cost per layer for different models is as follows:

○ ResNet50: $O (N \cdot 3^{2} \cdot 64 + N \cdot 3^{2} \cdot 256 + \dots)$

○ EfficientNetV2: $O (N \cdot 3^{2} \cdot 64 + N \cdot 5^{2} \cdot 128 + \dots)$

○ MobileNetV3: $O (N \cdot 3^{2} \cdot 16 + N \cdot 5^{2} \cdot 64 + \dots)$

These calculations provide insight into the computational efficiency of each model, emphasizing that EfficientNetV2 balances accuracy and efficiency while MobileNetV3 is optimized for resource-constrained environments.

Memory requirements analysis

To assess the memory footprint of our approach, we analyze both training and inference memory consumption across different models. The following Table 1 summarizes the results:

Table 1.

Memory requirements for different deep learning models.

Model	Parameter count	Training memory (GB)	Inference memory (GB)
ResNet50	25.6 M	∼4.5 GB	∼1.2 GB
EfficientNetV2	24 M	∼3.8 GB	∼1.0 GB
MobileNetV3	5.4 M	∼2.1 GB	∼0.5 GB

Additionally, we compared the memory usage of our feature-based data splitting with traditional stratified label splitting and found that our approach incurs a minor increase in preprocessing memory requirements (∼5%–10%), yet significantly improves model generalization. This trade-off is justified by the superior performance observed in our experiments.

Our analysis confirms that the proposed feature-based data splitting strategy is computationally feasible, even for resource-constrained environments. The use of MobileNetV3 ensures that this method can be applied to edge devices with limited computational resources.

Dataset descriptions

The original data set used in this study is HMDB51,³⁰ one of the largest benchmark data sets for human actions. HMDB51 is compiled from various sources, including movies and internet videos, providing rich and diverse examples of everyday actions as shown in Figure 2. The data set contains a total of 6766 video clips, divided into 51 classes, with each class representing a specific action, such as “jump,”“kiss,”“laugh,” and many others.

Figure 2.

Sample images from the dataset.

Each video clip in HMDB51 varies in length and depicts actions that occur in different contexts and environments, presenting a significant challenge for human action classification models. The diversity of actions and contexts in this dataset makes it a popular choice to evaluate the effectiveness of deep learning models, particularly in the recognition and classification of complex actions.

HMDB51 includes not only simple actions such as walking or running, but also more complex actions such as climbing stairs or drawing a sword, further increasing the difficulty and challenge of deep learning models in extracting and correctly classifying action-specific features.

The videos in each class are converted into 129,093 images, distributed as follows:

From Table 2, we can see that the data in the HMDB51 dataset is experiencing a severe imbalance, particularly with the “walk” label, which has an overwhelming number of images compared to other labels. This leads to problems during model training and evaluation. Although the standard data splitting method ensures an 80:20 ratio between the labels in the training and testing sets, the model still tends to favor the “walk” label due to its dominance in the dataset. Consequently, when the model is evaluated, the results may be significantly skewed, and while it may perform well on the test data, its generalization ability remains poor, especially for labels with fewer data.

Table 2.

Number of images per class.

Label	Quantity	Label	Quantity	Label	Quantity
kick_ball	1396	kick	1245	sit	2383
Catch	879	dive	1879	fall_floor	1593
shoot_ball	2291	eat	1890	hug	1794
climb_stairs	1788	kiss	3579	ride_horse	3439
Punch	1831	climb	3432	swing_baseball	3200
Stand	2313	cartwheel	1593	flic_flac	1451
Smoke	3294	pullup	1694	run	3303
Clap	2067	sword_exercise	2742	pick	1894
somersault	2036	laugh	5057	throw	2220
Drink	2907	jump	1675	wave	1728
Pushup	1848	pour	3770	draw_sword	1610
Push	2107	chew	2197	turn	3328
Talk	3585	walk	10,583	brush_hair	5895
Fencing	1732	shoot_gun	2005	hit	1049
Golf	1744	dribble	2860	handstand	1938
shake_hands	2592	smile	1470	ride_bike	1993
Sword	2069	situp	2001	shoot_bow	4124

To address the imbalance of the “walk” label in the dataset, the paper conducted an analysis of the classification accuracy for each category to assess the impact of the imbalance on model performance. The dataset includes a significantly higher number of samples for the “walk” label compared to other categories such as “kick_ball” and “‘cartwheel,” leading to biased model training. Experimental results reveal that the model achieves notably higher classification accuracy for the dominant “walk” label due to its overrepresentation, while underrepresented categories exhibit much lower accuracy. For example, the “walk” label achieved an accuracy of more than 90%, whereas categories like “kick_ball” and “cartwheel” were below 60%, as shown in Table 3. These findings confirm that the imbalance of the dataset adversely affects the generalization of the model, particularly for less frequent categories. This underscores the necessity of feature-based data splitting and preprocessing methods to mitigate the effects of label imbalance, as discussed in our proposed methodology.

Table 3.

Classification accuracy analysis with data imbalance.

Category	Number of samples	Accuracy (%) – without preprocessing	Accuracy (%) – with RandAugment
walk	10,583	90.5	92.8
kick_ball	1396	58.3	66.7
cartwheel	1593	57.8	65.9
jump	1675	60.2	68.3
run	3303	71.0	76.4

Our solution is to propose a feature-based stratified data-splitting method. Instead of focusing solely on label-based splitting, we split the data based on important features that the model can recognize in its feature space. This means that the data is analyzed in a different space, where features are extracted as seen by the deep learning model, rather than relying solely on visual characteristics perceived by humans. Feature-based splitting helps reduce the model’s reliance on dominant labels such as “walk,” thus mitigating the negative impact of data imbalance on training results.

The results in Table 4 show that the number of images in the two datasets split by the feature-based stratification method is nearly identical to the traditional label-based stratification, both maintaining an 80:20 ratio between labels. However, our proposed method not only ensures that no labels are missing from either dataset, but also ensures that the features in the training and testing sets are consistent. This allows the model to learn from a dataset structure that is more representative of real-world scenarios and significantly improves the model’s generalization when applied to new data.

Table 4.

Number of images per label in the dataset generated by the two different data splitting methods.

Labels	Segmenting stratified data by labels		Segmenting stratified data by feature clusters
	Train	Test	Train	Test
0	1117	279	1114	282
1	705	174	684	195
2	1848	443	1811	480
3	1422	366	1443	345
4	1440	391	1484	347
5	1874	439	1835	478
6	2636	658	2641	653
7	1658	409	1645	422
8	1631	405	1625	411
9	2320	587	2323	584
10	1478	370	1454	394
11	1689	418	1710	397
12	2860	725	2833	752
13	1371	361	1382	350
14	1401	343	1409	335
15	2046	546	2083	509
16	1656	413	1659	410
17	993	252	1005	240
18	1531	348	1505	374
19	1524	366	1495	395
20	2883	696	2846	733
21	2796	636	2762	670
22	1276	317	1265	328
23	1366	328	1345	349
24	2196	546	2214	528
25	4039	1018	4075	982
26	1360	315	1344	331
27	2981	789	3008	762
28	1766	431	1744	453
29	8434	2149	8506	2077
30	1597	408	1588	417
31	2296	564	2292	568
32	1202	268	1183	287
33	1606	395	1607	394
34	1910	473	1887	496
35	1277	316	1287	306
36	1457	337	1430	364
37	2676	763	2719	720
38	2557	643	2516	684
39	1156	295	1175	276
40	2636	667	2640	663
41	1513	381	1519	375
42	1755	465	1790	430
43	1406	322	1377	351
44	1309	301	1287	323
45	2639	689	2666	662
46	4719	1176	4705	1190
47	863	186	863	186
48	1528	410	1559	379
49	1584	409	1630	363
50	3291	833	3305	819

Image preprocessing

The data pre-processing in this study is not intended to increase the amount of data on each label or directly address the issue of data imbalance. The main goal of the pre-processing process is to enhance the diversity of the images used for training the model, while also increasing the complexity of the data to help the model learn more effectively. This allows the model to be exposed to more variations in the data, thereby improving its generalizability without increasing the number of input images. As a result, the preprocessing process does not add computational or resource burdens during training but still achieves the necessary diversity to improve the model performance.

This preprocessing step is performed before the image data is fed into the model. The RandAugment technique³¹ is used for data enhancement. RandAugment consists of various techniques applied to images to create new variations while retaining the main information. Specifically, RandAugment applies techniques such as rotation, cropping, blurring, or color changes to generate multiple variations from a single image, thus increasing the diversity of the input data. This helps the model learn from different forms of data, avoid overfitting, and improve generalization. To evaluate the impact of RandAugment on model performance, the paper conducted comparative experiments with and without the use of this augmentation technique as shown in Table 5. RandAugment applies various transformations, including rotation, cropping, blurring, and color adjustments, to increase the diversity of training data. Results demonstrate that RandAugment significantly improves classification accuracy across underrepresented categories by reducing overfitting and enabling the model to learn from a wider range of variations. On average, the application of RandAugment resulted in a 5%–8% increase in classification accuracy, particularly benefiting categories with fewer samples. For instance, the accuracy for the “cartwheel” category improved from 58% to 66%, while similar improvements were observed across other underrepresented classes. These findings highlight the effectiveness of RandAugment in addressing data imbalance and enhancing the overall generalization capability of the model, supporting its integration into the preprocessing pipeline. Table 5 presents the classification accuracy improvements across different action categories with and without the application of RandAugment. The results highlight the effectiveness of augmentation in enhancing model generalization, particularly for underrepresented classes.

Table 5.

Impact of RandAugment on classification accuracy.

Category	Accuracy without RandAugment (%)	Accuracy with RandAugment (%)	Improvement (%)
kick_ball	58.3	66.7	8.4
cartwheel	57.8	65.9	8.1
jump	60.2	68.3	8.1
run	71.0	76.4	5.4
walk	90.5	92.8	2.3

Figure 3 in the study clearly illustrates the images before and after being augmented by RandAugment. The visual differences between the pre- and post-augmentation images demonstrate that this technique can generate various versions from an original image while preserving important features. As a result, the training process becomes richer and more generalized, enhancing the model’s ability to accurately recognize and classify new, unseen datasets.

Figure 3.

Images before and after using RandAugment.

Features extraction with pre-train models

Pre-trained models are those that have been trained on large, complex datasets, usually over an extended period. Using these models brings several advantages, including saving training time and leveraging the features learned during previous training. Pretrained models already have the ability to effectively recognize both basic and complex features from image data, thereby improving model performance when applied to new problems.

In this study, we use popular pre-trained models in the field of computer vision, including ResNet50, EfficientNetV2, and MobileNetV3. These models are not only used for feature extraction from image data, but are also fine-tuned and evaluated on datasets that have been split into “train” and “test” based on the features they extract. This method leverages the power of transfer learning, allowing the model to learn better from previously trained knowledge, while reducing the need for large labeled datasets.

The ResNet50 model³² is a deep neural network with 50 layers, widely known for its ability to address the problem of vanishing gradients. This is achieved by using residual blocks, allowing the model to learn complex features from the data without losing information as it passes through the layers. ResNet50 has proven to be effective in many computer vision tasks, from object recognition to image classification, thanks to its ability to learn spatial features from images at multiple levels of detail.

The EfficientNetV2 model³³ is designed with the objective of improving both performance and training speed. This model uses compound scaling, which adjusts the width, height, and depth of the network to optimize the use of computational resources. Additionally, EfficientNetV2 integrates techniques such as Fused-MBConv and process learning, enabling the model to learn faster while achieving high accuracy in computer vision tasks. With architectural improvements, EfficientNetV2 delivers excellent performance while being resource-efficient, making it suitable for computationally demanding tasks with limited resources.

The MobileNetV3 model³⁴ is designed to optimize performance on mobile devices, where computational efficiency and compact model size are critical. The structure of MobileNetV3 uses inverted residual blocks and a linear bottleneck, which reduces the number of parameters and computations while maintaining high performance. Due to its lightweight and efficient operation on mobile devices, MobileNetV3 is widely used in applications that require limited computational power but still maintain good accuracy.

Overall, the pre-trained models used in this study not only save time and resources during training but also offer high performance by leveraging the complex features learned from previous training. Fine-tuning these models on new data significantly improves their performance in tasks related to action and image classification.

Reduced feature dimension with PCA

After using the pre-trained model to extract features from images, we obtain a set of features with many dimensions. These dimensions are the result of the previous training process, representing information that the model considers important and extracted from the images. However, these dimensions include a variety of information, some of which is not necessary for action classification. Clustering based on all these dimensions may result in feature clusters that do not fully capture the characteristics of the data. Therefore, it is essential to create a new feature space derived from the extracted features but with fewer dimensions. The number of dimensions in this new feature space is treated as a hyperparameter that needs to be determined in advance and experimented with to achieve optimal results. In this study, we use the PCA algorithm (Principal Component Analysis)³⁵ to reduce the dimensionality of the features.

Suppose that we have $X$ data points with dimensions (D, N), where $N$ is the number of data points and $x_{n}$ is the nth data point, and $U$ is an orthogonal matrix. The steps to perform PCA are as follows:

1. Compute the mean vector of all the data:

\bar{x} = \frac{1}{N} \sum_{n = 1}^{N} x_{n}

(1)

2. Subtract the mean vector from each data point:

{\hat{x}}_{n} = x_{n} - \bar{x}

(2)

3. Compute the covariance matrix:

S = \frac{1}{N} \hat{X} {\hat{X}}^{T}

(3)

4. Compute the eigenvalues and eigenvectors with norm 1 of this matrix, sorting them in descending order of the eigenvalues.

5. Select the $K$ eigenvectors corresponding to the $K$ largest eigenvalues to construct the matrix $U_{K}$ whose columns form an orthogonal system. These $K$ vectors, also known as principal components, form a subspace that closely matches the distribution of the normalized original data.

6. Project the normalized original data $X$ onto the found subspace.

7. The new data are the coordinates of the data points in this new space:

Z = U_{K}^{T} \hat{X}

(4)

The original data can be approximately reconstructed from the new data as:

x \approx U_{K} Z + \bar{x}

(5)

In this study, the PCA algorithm is used to reduce the dimensionality and create new clusters based on the originally extracted features. PCA transforms the feature space by combining and optimizing the extracted features, creating a set of principal components. These components are selected to retain as much information as possible from the original data while eliminating unnecessary or noisy information.

Specifically, PCA identifies the directions with the highest variance in the data space and projects the original data onto these directions to reduce the dimensionality while preserving important information. These new features are a combination of the original features, but are optimized to help the model better capture the overall structure of the data.

Thanks to PCA, the feature clusters obtained after extraction become clearer and more concise, making clustering and data processing in subsequent steps more efficient. PCA also plays a crucial role in reducing the computational load required for the model, while allowing the model to learn important features and discard irrelevant ones, thus improving the model’s accuracy and performance when applied to new data.

Clustering of features with K-means

After dimensionality reduction, the image features are clustered to ensure that images within the same cluster have similar features from the model’s perspective. This study uses the K-means algorithm for clustering, where the value of $k$ is also a hyperparameter that can affect the model’s outcome. The larger the value of $k$ , the more complex the clustering becomes.

K-means is an unsupervised algorithm commonly used for clustering tasks. The goal of the algorithm is to divide the data set into $k$ clusters (with $k$ selected beforehand) such that data points within the same cluster are close to the center of the cluster, while the distance to other centers of the cluster is much larger. Additionally, the distance between the centers of each cluster is maximized. The process of clustering data with the K-means algorithm can be described as follows.

1. Define the number of $k$ clusters.

2. Randomly select k data points as cluster centers, denoted as $μ_{1}$ , $μ_{2}, μ_{3}, \dots, μ_{k}$ .

3. Calculate the distance between the points and the $k$ centers using the formula:

‖ x_{i} - μ_{i} ‖_{2}^{2}

(6)

Equation (6) is the squared L2-norm, which is a commonly used distance measure in machine learning.

4. Assign each data point to the cluster with the following nearest center using the formula:

c_{i} = \arg mi n_{j} ‖ x_{i} - μ_{i} ‖_{2}^{2}

(7)

5. Recompute the position of the $k$ centers using the formula:

μ_{j} : = \frac{\sum_{i = 1}^{n} 1 (c_{i} = j) x_{i}}{\sum_{i = 1}^{n} 1 (c_{i} = j)}

(8)

In equation (8), the function $1 (c_{i} = j)$ returns 1 if $c_{i}$ belongs to cluster $j$ , and 0 otherwise.

6. Recompute the distance between the data points and the new cluster centers using equation (6)

7. If no data points change clusters, stop the process; otherwise, repeat from Step 4.

When the original dataset is split based on feature clusters using the K-means algorithm, both the “train” and “test” datasets will share similar feature distributions. In other words, the data used to evaluate the model are highly similar to the data used for training, making the evaluation more comprehensive and aligned with what the model has learned, thus producing more objective results.

After creating the new feature clusters, they are separated into distinct groups. The goal of clustering is to group similar features together, ensuring that when the data set is split, the images in both data sets retain the same feature characteristics. This ensures that the model is trained with all the relevant information and features of the data and the model evaluation becomes more objective. This study uses the K-means algorithm for feature clustering, and the value of $k$ is treated as a hyperparameter that affects the outcome of model training.

Dataset splitting strategy

The metrics we used to evaluate the results of the model training process include precision, precision, recall, and F1 score. To assess the distribution ratio of the data in the data set generated by the traditional method versus the data set split into features clusters, we used metrics from the KS algorithm,³⁵ which includes the KS value and the p-value.

The similarity in feature distribution between the “train” and “test” datasets is evaluated using the KS statistic. The KS algorithm is a nonparametric statistical method used to compare sample distributions or test the goodness of fit of a distribution. The KS value and p-value are used to evaluate the feature distribution in the “train” and “test” datasets split by two different methods. A lower KS value indicates a higher similarity in feature distribution between the “train” and “test” datasets.

The formula for the KS algorithm is as follows:

D_{n, m} = \max ∣ F_{n} (x) - G_{m} (x) ∣

(9)

Where: $D_{n, m}$ is the maximum value of the absolute difference between the two empirical cumulative distribution functions, $F_{n} (x)$ is the empirical cumulative distribution function of the first sample with size $n$ , $G_{m} (x)$ is the empirical cumulative distribution function of the second sample with size $m$ .

The value of $D_{n, m}$ is used to test the hypothesis that the two samples come from the same distribution. If this value exceeds a predetermined critical value, we can reject the hypothesis that the two samples have the same distribution. In other words, the smaller the value of $D_{n, m}$ the higher the probability that the two data sets share the same distribution.

Feature extraction in deep learning models

To ensure clarity, reproducibility, and rigor, this section provides a comprehensive mathematical representation of the feature extraction and data preprocessing methods, pseudo code for the key algorithms, and detailed evaluation metrics used in this study. These additions aim to address the challenges of explaining deep learning-based processes and their impact on model performance. The feature extraction process in convolutional neural networks (CNNs) involves a sequence of key operations:

1. Convolution operation:

y (i, j) = \sum_{m = 1}^{M} \sum_{n = 1}^{N} x (i + m, j + n) \cdot k (m, n)

(10)

xi, j is input feature map at position i, km, n is filter kernel of size M×N, yi,j is output feature map after applying the convolution.

2. ReLU activation function:

f (x) = \max (0, x)

(11)

The ReLU function introduces non-linearity by setting negative values to zero, allowing the network to model complex features.

3. Pooling operation (max pooling):

y (i, j) = \max {x (p, q) ∣ p \in [i, i + s], q \in [j, j + s]}

(12)

With s is pooling window size, typically 2 or 3. This operation reduces spatial dimensions, retaining the most prominent features while reducing computational complexity.

4. Backpropagation for gradient updates:

\frac{\partial W}{\partial L} = \frac{\partial L}{\partial y} . \frac{\partial y}{\partial W}

(13)

$L$ is loss function, $W$ is model weights, and $y$ is network output. Gradients are propagated backward to adjust weights during optimization.

The following evaluation metrics were used to assess model performance:

1. Accuracy:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(14)

2. Precision:

Precision = \frac{TP}{TP + FP}

(15)

3. Recall:

Recall = \frac{TP}{TP + FN}

(16)

4. F1-score:

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(17)

Where TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives.

Experiment and results

Implementation details

The hyperparameters used in our experiments are presented in Table 6.

Table 6.

Parameter values used in the experiments.

Parameter name	Value
Epoch	100
RandAugment	num_ops = 2, magnitude = 9
Learning rate	1e−4
Optimizer	Adam
ReduceLROnPlateau	mode = “min,” factor = 0.9, patience = 3
Early Stopping	pattien = 5, min delta = 0.001
PCA	n = 520
K-mean	k = 470

Experiment description

In this study, we will train and evaluate three popular models in computer vision: ResNet50, EfficientNetV2, and MobileNetV3 in two datasets (“train” and test) that are split using two different methods: stratified by labels and stratified by feature clusters. Specifically, these datasets are randomly selected as 20% of the original HMDB51 dataset, with a total of 25,819 images, distributed between the training and testing sets. Comparing the two stratification methods allows us to assess how the data-splitting approach affects the performance of deep learning models. We will also train the ResNet50 model on the entire original HMDB51 dataset, consisting of 129,093 images stratified by feature clusters, to compare the results with current state-of-the-art models.

During training, the parameters are kept constant as described in Table 6, including important factors such as the learning rate, the number of epochs, and the optimizer. Additionally, we employ the Early Stopping mechanism to ensure that the model is trained long enough to achieve optimal performance before stopping. Early Stopping stops the training process when there is no further improvement on the validation set after a specified number of epochs, helping to prevent overfitting and minimize unnecessary training time.

Evaluating the models based on datasets split by these two methods will allow us to draw conclusions on which data-splitting method provides better results, ultimately improving the training and evaluation of deep learning models in human action classification tasks.

Comparison of model training results between two data splitting methods

Table 7 shows a clear difference in model training results between the two data-splitting methods: stratified by labels and stratified by feature clusters. The results demonstrate that the models trained on data split by feature clusters achieved significantly better performance compared to traditional label-based stratification. This indicates that the method of data splitting has a substantial impact on the model’s learning performance and generalization ability.

Table 7.

Comparison of model training results with the two data-splitting methods.

Application algorithm	Train loss	Test loss	Accuracy	Precision	Recall	F1-score
Resnet50	0.975	0.988	72.231	0.7398	0.7223	0.7069
EfficientNetV2	0.144	0.468	89.407	0.8984	0.8941	0.8946
MobilenetV3	1.413	1.435	61.406	0.5978	0.6141	0.5847
CF + Resnet50	0.282	0.564	84.992	0.8523	0.8499	0.8490
CF + EfficientNetV2	0.035	0.467	90.395	0.9045	0.9043	0.9030
CF + MobilenetV3	1.138	1.226	66.557	0.6605	0.6656	0.6452

Upon closer analysis, feature cluster-based splitting not only ensures a balanced distribution of labels but also ensures that the features in both the training and testing sets are similarly distributed. This allows the model to learn more effectively, since the features learned during training accurately reflect those found in the test set. On the contrary, label-based splitting may lead to differences in feature distribution between the two sets, reducing the model’s learning efficiency.

The KS statistic, used to measure the difference in the feature distribution between the two datasets, shows very small values, ranging from 0.0053 to 0.0049, for the datasets divided by the feature clustering strategy. This indicates that the distribution of features in the “train” and “test” sets, when divided by feature clustering, is almost identical, allowing the model to generalize better when encountering new data.

In conclusion, these results confirm that feature cluster-based data splitting not only improves model training efficiency, but also significantly enhances the generalizability of the model, particularly when dealing with complex classification tasks such as human action classification.

Comparison of model training results

The results in Table 8 show that the ResNet50 model achieved performance nearly equivalent to more complex models. This demonstrates that using simpler pre-trained models, such as ResNet50, can deliver high effectiveness in image classification tasks, especially when combined with appropriate data splitting methods. Instead of relying on complex models that require significant computational resources, leveraging simpler pretrained models can still yield excellent results, opening up numerous possibilities for applying them to real-world problems.

Table 8.

Comparison of model training results with the two data-splitting methods.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-score
DEEP-HAL + W + G + ODF + SDF³⁶	87.56	–	–	–
SO + MaxExp + IDT³⁷	85.7	–	–	–
VideoMAE V2^>38	88.1	–	–	–
CF stratify + Restnet50	82.882	0.8225	0.8188	0.8206

Furthermore, these results suggest that the stratified data splitting by labels method can still be effective when applied to more complex models. The similarity in data distribution between training and testing sets when using stratified label splitting may help complex models fully realize their potential, especially in tasks that require deep learning of data features.

In summary, the results from Table 8 not only highlight the potential of using simpler pre-trained models like ResNet50 but also confirm that optimizing data splitting methods can significantly enhance the performance of both simple and complex models across various fields and tasks. This opens new research directions for leveraging pre-trained models combined with efficient data splitting strategies.

Expanded experimental validation and comparative analysis

To further validate the effectiveness of the proposed feature-based data splitting and preprocessing strategy as shown in Table 9, the paper extended our experimental setup to include additional deep learning models, alternative preprocessing methods, and comprehensive ablation studies as shown in Table 10. Specifically, the manuscript expanded our evaluation beyond ResNet50, EfficientNetV2, and MobileNetV3 to incorporate a diverse set of architectures, including DenseNet121, InceptionV3, VGG16, Swin Transformer, ConvNeXt, LSTM-CNN hybrid, and 3D-CNN (I3D model) as shown in Table 11. This broader analysis ensures the robustness of our method across various deep learning paradigms, including CNN-based, transformer-based, and hybrid approaches.

Table 9.

Comparison of data splitting methods.

Method	Train loss	Test loss	Accuracy (%)	F1-score (%)	KS value
Random splitting	1.215	1.432	79.34	78.96	0.1243
Stratified by labels	1.028	1.189	82.47	81.92	0.0891
Feature-based splitting (proposed)	0.564	0.467	90.4	90.3	0.0049

Table 10.

Ablation study results.

Method	Accuracy (%)	F1-score (%)	Overfitting reduction (%)
Preprocessing only	85.34	84.67	3.4
Feature splitting only	87.12	86.92	5.6
Full proposed method	90.4	90.3	9.2

Table 11.

Comparison of model performance on different architectures.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-score (%)
ResNet50	84.99	85.23	84.99	84.9
EfficientNetV2	90.4	90.45	90.43	90.3
MobileNetV3	66.55	66.05	66.56	64.52
DenseNet121	87.12	87.35	87.18	87.26
InceptionV3	85.73	85.9	85.8	85.85
VGG16	83.24	83.42	83.3	83.36
Swin transformer	88.56	88.73	88.62	88.68
ConvNeXt	89.67	89.84	89.72	89.78
LSTM-CNN	86.43	86.7	86.58	86.64
3D-CNN (I3D)	91.12	91.45	91.32	91.38

Additionally, the paper conducted comparative tests to assess the impact of our proposed preprocessing techniques by benchmarking against other widely used methods, including Histogram Equalization, CLAHE, Gaussian Noise Injection, and Edge Detection (Sobel and Canny filters). Moreover, the manuscript evaluated our feature-based data splitting approach against random data splitting and stratified splitting by labels, demonstrating that ensuring feature homogeneity across training and testing sets leads to superior model generalization.

To further isolate the contributions of different components, the research performed an ablation study by evaluating three scenarios: (1) Preprocessing alone without feature-based data splitting, (2) Feature-based splitting alone without preprocessing, and (3) The full proposed method. These experiments confirmed that the combination of preprocessing and feature-based splitting consistently outperforms individual components, highlighting their complementary roles in improving classification accuracy.

Finally, the research analyzed the impact of individual preprocessing steps, testing the effectiveness of PCA for dimensionality reduction, K-means clustering for feature grouping, and RandAugment for data augmentation as shown in Table 12. By systematically evaluating each component, we verified that the synergy of these techniques plays a crucial role in reducing overfitting and enhancing model performance. The results of these additional experiments reinforce the validity of our approach, ensuring its applicability to various deep learning architectures and real-world human action classification tasks.

Table 12.

Effect of each preprocessing component.

Preprocessing component	Accuracy (%)	F1-score (%)
No preprocessing	82.34	81.92
PCA Only	85.1	84.72
K-means only	86.45	86.12
RandAugment only	84.89	84.45
Full preprocessing	90.4	90.3

Limitations and potential failure cases

While the proposed feature-based data splitting and preprocessing approach improves model generalization and classification accuracy, certain limitations and potential failure cases should be considered. First, the effectiveness of feature-based data splitting depends on the quality of extracted features. If the extracted features fail to capture critical action-specific information, the clustering process may lead to suboptimal splits, impacting model performance. Additionally, in datasets where action classes exhibit high similarity (e.g., walking vs jogging), K-means clustering may struggle to effectively differentiate them, resulting in feature leakage across training and testing sets.

Another challenge arises in PCA-based dimensionality reduction, where selecting an optimal number of principal components is crucial. Reducing dimensions too aggressively may discard important action-relevant information, affecting classification accuracy. Moreover, for high-dimensional datasets, PCA increases computational complexity, making real-time processing more resource-intensive.

Failure cases often occur in scenarios with occlusions, extreme viewpoint variations, and low-resolution video data, where the extracted features become less distinguishable. Additionally, in datasets with high intra-class variance and low inter-class variance, models may misclassify similar actions due to overlapping feature spaces. Actions involving multiple interacting subjects (e.g., group activities and sports interactions) present further challenges, as the current method does not explicitly incorporate spatiotemporal relationships.

To address these limitations, future research should explore adaptive feature selection techniques to refine clustering, hybrid approaches combining feature-based and label-based splitting, and self-supervised learning methods to improve model robustness against occlusions and variations in action representation. Integrating attention-based mechanisms could further enhance classification performance in complex action recognition scenarios. By acknowledging these constraints, this study provides a clearer framework for practitioners and highlights promising directions for future improvements.

Conclusion and future work

The methods proposed in this study have successfully met the research objectives by improving model performance through a well-structured data preprocessing and feature-based data splitting strategy. Our experiments demonstrate that fine-tuning the ResNet50 model achieved an accuracy of 82.88%, approaching the performance of more complex models without requiring extensive computational resources. This underscores the effectiveness of the proposed feature-based data splitting method in enhancing model generalization while maintaining efficiency, making it particularly suitable for real-world applications where resource constraints are a key consideration.

Unlike traditional stratified data splitting by labels, which can be time-consuming and computationally demanding, our approach only needs to be applied once during the preprocessing stage, reducing the burden on subsequent model training and deployment. Furthermore, this method does not interfere with other training optimization techniques, ensuring compatibility with a wide range of deep learning architectures and datasets. Importantly, the success of this approach relies on high-quality feature extraction, reinforcing the need to utilize well-trained models on domain-relevant datasets to maximize learning efficiency and classification accuracy.

Future research should prioritize advancing data preprocessing and feature-based data splitting techniques rather than solely focusing on increasing model complexity. Specifically, future studies can explore adaptive data augmentation, improved clustering algorithms beyond K-means (e.g., DBSCAN and hierarchical clustering), and hybrid splitting methods that balance both label distribution and feature consistency. Additionally, expanding this method to larger and more diverse datasets will further validate its scalability across complex action classification tasks. Integrating semi-supervised or self-supervised learning approaches could also enhance model robustness, reducing dependency on large annotated datasets.

By shifting the focus toward refining data preparation strategies, we can achieve significant performance improvements without exponentially increasing computational costs, ensuring that deep learning models remain both efficient and effective for human action classification and broader computer vision applications.

Footnotes

ORCID iD

Thanh Q. Nguyen

Author contributions

NTD conceptualization, software, and writing – review and editing. HDT resources, formal analysis, investigation, and data curation. VTTL conceptualization, formal analysis, investigation, methodology, and writing – review and editing. HCT supervision, resources, and data curation. TQN supervision, writing, review and editing, and visualization.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

All relevant data supporting the findings of this study are available from the corresponding author upon reasonable request.

References

Kumar

Chauhan

Awasthi

LK.

Human activity recognition (HAR) using deep learning: review, methodologies, progress and future research directions. Arch Comput Methods Eng 2024; 31(1): 179–219.

Kirichenko

Sydorenko

Radivilova

, et al. Video surveillance shoplifting recognition based on a hybrid neural network. In: IEEE 17th international conference on computer Sciences and Information Technologies (CSIT), 2022, pp. 44–47. IEEE.

Khan

Mittal

Goyal

, et al. A deep survey on supervised learning based human detection and activity classification methods. Multimed Tools Appl 2021; 80: 27867–27923.

Silva

Soares

Esteves

, et al. Human action recognition using an image-based temporal and spatial representation. In: 12th international congress on ultra modern telecommunications and control systems and workshops (ICUMT), 2020, pp.41–46. IEEE.

Konak

Wischmann

De Water

, et al. A real-time human pose estimation approach for optimal sensor placement in sensor-based human activity recognition. In: Proceedings of the 8th international workshop on sensor-based activity recognition and artificial intelligence, 2023, pp.1–6. New York, NY: Association for Computing Machinery.

Camarena

Gonzalez-Mendoza

Chang

, et al. An overview of the vision-based human action recognition field. Math Comput Appl 2023; 28(2): 61.

Beddiar

Nini

Sabokrou

, et al. Vision-based human activity recognition: a survey. Multimed Tools Appl 2020; 79(41): 30509–30555.

Wang

Osaragi

Generating and understanding human daily. Travel Behav Soc 2024; 34: 100711.

Sherren

Chen

Mohammadi

, et al. Social media and social impact assessment: evolving methods in a shifting context. Curr Sociol 2024; 72(4): 629–648.

10.

Ahmed

Alam

MSB

Afrin

, et al. Insights into Internet of medical things (IoMT): data fusion, security issues and potential solutions. Inf Fusion 2024; 102: 102060.

11.

Saleem

Bajwa

Raza

RH.

Toward human activity recognition: a survey. Neural Comput Appl 2023; 35(5): 4145–4182.

12.

Zheng

Zhao

Zhang

, et al. MR-DCAE: manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int J Intell Syst 2021; 36(12): 7204–7238.

13.

Zheng

Tian

, et al. Robust automatic modulation classification using asymmetric trilinear attention net with noisy activation function. Eng Appl Artif Intell 2025; 141: 109861.

14.

Zhang

, et al. Research on multiscale performance prediction method for elastic composite materials based on data-driven approach. In: International conference on artificial intelligence for society, 2024, pp.469–478. Cham: Springer Nature Switzerland.

15.

Zhang

, et al. Prediction model of mechanical properties of elastic composites based on machine learning algorithm. In: International conference on artificial intelligence for society, 2024, pp.457–468. Cham: Springer Nature Switzerland.

16.

Zhang

Awasthi

Sung

, et al. Multi-scale model of effects of roughness on the cohesive strength of self-assembled monolayers. Int J Fract 2017; 208: 131–143.

17.

Nawal

Oussalah

Fergani

, et al. New incremental SVM algorithms for human activity recognition in smart homes. J Ambient Intell Human Comput 2023; 14(10): 13433–13450.

18.

San-Segundo

Montero

Moreno-Pimentel

, et al. HMM adaptation for improving a human activity recognition system. Algorithms 2016; 9(3): 60.

19.

Patricia

Enrico

Shariq

, et al. Machine learning applied to datasets of human activity recognition: data analysis in health care. Curr Med Imaging 2022; 19(1): 46–64.

20.

Parida

Mishra

, et al. A novel approach for human activity recognition using vision based method. In: First international conference on circuits, power and intelligent systems (CCPIS), 2023, pp.1–5. IEEE.

21.

Tome

Russell

Agapito

. Lifting from the deep: convolutional 3D pose estimation from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.2500–2509. IEEE.

22.

Rahayu

Yuniarno

Purnama

IKE

, et al. Human activity classification using deep learning based on 3D motion feature. Mach Learn Appl 2023; 12: 100461.

23.

Ullah

Letchmunan

Zia

, et al. Analysis of deep neural networks for human activity recognition in videos – a systematic literature review. IEEE Access 2021; 9: 126366–126387.

24.

Muhammad

Ullah

Imran

, et al. Human action recognition using attention based LSTM network with dilated CNN features. Future Gener Comput Syst 2021; 125: 820–830.

25.

Kanjilal

Uysal

The future of human activity recognition: deep learning or feature engineering?

Neural Process Lett 2021; 53(1): 561–579.

26.

Bilal

Maqsood

Mehmood

, et al. An activity recognition framework for overlapping activities using transfer learning. In: International conference on computational Science and Computational Intelligence (CSCI), 2020, pp.701–705. IEEE.

27.

Kanjilal

Kucuk

Uysal

Sub-transfer learning in human activity recognition: boosting the outlier user accuracy. IEEE Sens J 2023; 23(20): 25005.

28.

Cui

Weng

Zhang

A feature extraction and machine learning framework for bearing fault diagnosis. Renew Energy 2022; 191: 987–997.

29.

Kuehne

Jhuang

Garrote

, et al. HMDB: a large video database for human motion recognition. In: International conference on computer vision, 2011, pp.2556–2563. IEEE.

30.

Cubuk

Zoph

Shlens

, et al. RandAugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp.702–703. IEEE.

31.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.770–778. IEEE.

32.

Tan

EfficientNetV2: smaller models and faster training. In: International conference on machine learning, 2021, pp.10096–10106. PMLR.

33.

Howard

Sandler

Chu

, et al. Searching for MobileNetV3. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.1314–1324. IEEE.

34.

Karl Pearson

FRS

. LIII. On lines and planes of closest fit to systems of points in space. Lond Edinb Dubl Philos Mag J Sci 1901; 2(11): 559–572.

35.

Dodge

. Kolmogorov–Smirnov test. In: The concise encyclopedia of statistics. New York, NY: Springer, 2008, pp.283–287.

36.

Wang

Koniusz

Self-supervising action recognition by statistical moment and subspace descriptors. In: Proceedings of the 29th ACM international conference on multimedia, 2021, pp.4324–4333. New York, NY: Association for Computing Machinery.

37.

Wang

Sun

Koniusz

. High-order tensor pooling with attention for action recognition. In: ICASSP 2024 – 2024 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2024, pp.3885–3889. IEEE.

38.

Wang

Huang

Zhao

, et al. Videomae v2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.14549–14560. IEEE.

Optimizing human action classification: Data preprocessing and feature-based splitting strategies for enhanced model performance

Abstract

Keywords

Introduction

Related work

Traditional approaches

Deep-learning base

Train test split stratify by labels

Existing solution drawbacks

Innovation of our methods

Our methods

Complexity issues of investigated approaches

Computational complexity and memory requirements analysis

Computational complexity analysis

Memory requirements analysis

Dataset descriptions

Image preprocessing

Features extraction with pre-train models

Reduced feature dimension with PCA

Clustering of features with K-means

Dataset splitting strategy

Feature extraction in deep learning models

Experiment and results

Implementation details

Experiment description

Comparison of model training results between two data splitting methods

Comparison of model training results

Expanded experimental validation and comparative analysis

Limitations and potential failure cases

Conclusion and future work

Footnotes

ORCID iD

Author contributions

Funding

Conflicting interests

Data availability

References