Abstract
Objective
Gait analysis plays a critical role in healthcare, biomechanics, and sports science, particularly for estimating energy expenditure (EE). This study introduces a hybrid machine learning approach integrating convolutional neural networks (CNNs), long-short-term memory (LSTM) networks, and transfer learning (TL) to estimate volume of oxygen (VO2) and detect heel strikes (HS) using data from a single 9-axis inertial measurement unit (IMU).
Methods
A clinical-grade VO2 machine provided reference data for model training. The hybrid model was designed to combine spatial and temporal feature extraction capabilities from CNNs and LSTM networks while leveraging pre-trained weights through TL. The study compared the performance of the hybrid model with an LSTM-only approach to quantify improvements in VO2 prediction.
Results
The hybrid model significantly reduced the VO2 prediction error from 20% to 3% compared to using LSTM-only approach. Additionally, the model demonstrated high accuracy for HS detection, achieving 93.53% accuracy as indicated by training and validation results. The lightweight IMU-based system proved effective for VO2 estimation, offering a practical alternative to traditional VO2 measurement systems, which are often complex, bulky, and uncomfortable for subjects.
Conclusions
This study highlights the potential of a hybrid machine learning approach using IMU-based systems for accurate VO2 estimation and HS detection. While the results are promising, the model's performance is constrained by 10 healthy subject datasets. Future work will require validation with more diverse datasets to enhance generalizability and robustness.
Introduction
Walking impairments resulting from stroke, musculoskeletal injuries, and neurological disorders significantly impact an individual's functional mobility and overall quality of life. To address these challenges, assistive devices have been developed, categorized into passive and active devices. Passive devices, such as orthotic splints and crutches, provide external support but are often limited by size, weight, and restricted degrees of freedom.1,2 On the other hand, active assistive technologies like exoskeletons and functional electrical stimulation (FES) have emerged as promising solutions for rehabilitation and mobility enhancement.3–5
A fundamental requirement for the optimal functioning of FES or exoskeleton systems is precise gait analysis, particularly the detection of critical gait events such as heel strikes (HS) and the estimation of EE.6,7 These parameters are essential for real-time control, adaptive interventions, and performance monitoring in both clinical and real-world environments.8,9
Conventional gait analysis methodologies primarily rely on optical motion capture systems and force plates, which, despite their high accuracy, suffer from limitations related to cost, complexity, and lack of portability. IMUs have emerged as a promising alternative, offering lightweight, wearable solutions for real-time gait event detection and EE estimation. However, existing research has predominantly addressed HS detection and EE estimation as separate tasks, often relying on large subject-specific datasets that limit generalizability. Furthermore, heuristic-based approaches and standard machine learning (ML) models have demonstrated variable accuracy due to data limitations and the complexity of human gait dynamics.
To provide a comprehensive overview of previous research in this domain, a comparative literature survey is presented in Table 1. This table summarizes key studies, their methodologies, key findings, and how they are compared to this work. To further strengthen the literature review, an additional dataset details column has been incorporated to highlight subject size and real-world applicability.
Summary of literature contributions.
This study presents a novel hybrid machine learning framework that integrates long-short-term memory (LSTM) networks with TL to enhance the accuracy and robustness of HS detection and EE estimation using a single nine-axis IMU sensor. The proposed approach leverages TL to adapt pre-trained models to new subjects and diverse gait patterns, mitigating the challenge of limited datasets. By unifying HS detection and EE estimation into a single predictive model, this work provides a scalable solution with significant implications for real-time gait analysis in rehabilitation, sports science, and mobility assistance applications.
Key contributions of this paper are,
Unified model for HS detection and EE estimation: This study introduces an integrated ML framework that simultaneously predicts HS events and estimates EE, unlike prior approaches that treat these tasks independently. Hybrid LSTM-transfer learning (TL) architecture: The incorporation of TL enhances the model's adaptability to new subjects and walking conditions with minimal additional training, addressing data scarcity challenges in clinical research. Comparative performance analysis: The proposed model is benchmarked against existing heuristic and ML-based methods, demonstrating superior HS detection accuracy (93%) and a substantial reduction in volume of oxygen (VO2) prediction error (from 20% to 9%). Rigorous statistical validation: A comprehensive statistical evaluation using Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) to ensures the robustness and reliability of the proposed methodology. Future Directions for Generalizability: While this study primarily evaluates healthy individuals, future research will extend the dataset to include individuals with gait impairments and various walking conditions to enhance real-world applicability.
The remainder of this paper is structured as follows: Methods section details the underlying machine learning methodologies, sensor integration, and experimental protocols. Results section presents the experimental results, followed by a critical discussion in Results discussions section. Finally, conclusion section outlines conclusions and future research directions for enhancing the generalizability and clinical translation of the proposed approach.
Methods
This section recalls the concepts of TL.17,18 This technique can achieve reasonable accuracy with limited target data due to knowledge transfer from the source domain. Previous work 14 used a small dataset (collected from two adult male subjects) which comprised data from inertial sensors—three axis accelerations, three axis angular velocities, and the roll, pitch, and yaw angular positions (in a common reference frame)—and data from a clinical grade VO2 machine which serves as the truth (reference) data for gait EE. Using LSTM, 14 estimated various gait cycle parameters such as HS, heel-off, and toe-off and with these parameters as inputs (along with the IMU data) an LSTM network was trained to estimate gait EE (represented by the VO2 values). Figure 1 shows the example of input data used for training hybrid machine learning algorithm; it is used to estimate the HS parameter. Next using these parameters, the horizontal and vertical walking speeds of the subjects were computed—which are finally used in the estimation of gait EE (represented by VO2 values).

Example of input data used for training TL algorithm.
Since the dataset was limited (only two subjects) no clear statistical conclusions could be drawn on the estimated EE (even though accuracy was closed to 80% over most of the data). Getting subject data is difficult, especially in clinical settings. In this work, we build upon this initial model (obtained through LSTM) by improving it every time a new dataset is available—using TL. The combination of TL on top of an LSTM-generated model is a reasonable approach, as LSTM networks can be adapted to various gait patterns with enough labeled data. The use of TL can help mitigate the computational demands of LSTM networks, making them more suitable for real-time deployment as and when new data are available.
Transfer learning
Below mentioned steps are used to estimate the EE using a fusion of TL and LSTM techniques.
Step 1: Pre-training: Train the model on gait data collected from sensors, which provides information about step length, step frequency, joint angles, and other relevant features.
(a) Data representation: Let
b) Model training: Train the LSTM model to minimize a loss function
Step 2: Feature extraction: Utilize the knowledge gained during pre-training phase to extract the most relevant features from new dataset. These extracted features are crucial for the accurate prediction and estimation of gait parameters, such as VO2 and HS. The LSTM model generates the weights W of the trained network.
Step 3: Transfer learning setup: Retain the weights W from the pre-trained model. This allows the model to leverage learned representations when encountering new data.
(a) Feature representation: Extract features from a new dataset
Here,
(a) Model adaptation: Fine-tuning adjusts the weights from the pre-trained model. The new weights
Step 5: Parameter estimation with newly available inertial sensor data: With the fine-tuned LSTM model, estimate the gait parameters, including VO2 and HS, from the new gait data.
Overall, the gait parameter prediction using transfer learning is expected to be more accurate. This is explained in the following steps:
a. Domain adaptation: Transfer learning allows knowledge from the source domain (initial dataset) to the target domain to inform learning or findings. The model retains the ability to recognize features relevant across different datasets.
b. Layer freezing: During fine-tuning, certain layers of the LSTM model can be frozen (not updated) to retain general features while updating others to adapt to the new task. This can be expressed as:
This comprehensive framework outlines the detailed mathematical steps involved in using IMU data with LSTM and transfer learning for predicting VO2 and HS.
Implementation methodology
The flowchart shown in Figure 2 is the proposed hybrid machine learning framework for predicting VO2 using a wearable and efficient system. This methodology is built to address the limitations of traditional VO2 estimation systems by leveraging advancements in machine learning and sensor technology. The steps involved in prediction of HS and VO2 are discussed further:
Data collection: Data are acquired from three key sources such as nine-axis IMU sensor that measures accelerations, angular velocities, and magnetic field intensities during movement. A clinical-grade VO2 machine, which provides ground-truth measurements of oxygen consumption for supervised learning. Tap sensor, which captures specific gait-related events to complement IMU data. This multi-modal data acquisition ensures the capture of both spatial (postural or movement patterns) and temporal (sequential time-series) information relevant to gait analysis and oxygen consumption. Data preprocessing: Preprocessing involves removing noise and synchronizing the data streams from IMU sensors, VO2 measurements, and HS events. This ensures that all signals are temporally aligned. Features are normalized to mitigate scaling differences across modalities, ensuring compatibility with the machine learning algorithms. Model training: The core of the framework is a hybrid machine learning model that combines convolutional neural networks (CNNs) and LSTM networks. CNNs extract spatial and temporal features from the sequential IMU data, focusing on local dependencies such as stride patterns or motion bursts. LSTM networks are incorporated to model long-term dependencies in the time-series data, capturing recurrent patterns like breathing rate changes or gait cycles. The hybrid architecture is designed to account for the hierarchical nature of movement data, where CNNs identify low-level movement characteristics, and LSTMs model high-level, long-range dependencies. Transfer learning: To generalize the model across different subjects, transfer learning is applied. Learned features from the CNN-LSTM network are fine-tuned using smaller datasets from new subjects or activity conditions. This reduces the dependency on extensive subject-specific training data, addressing variability and ensuring scalability for broader applications. VO2 prediction: After training, the model predicts VO2 values for new, unseen input data (e.g. data collected from a subject not included in the training dataset). This step integrates predicted VO2 outputs with clinical ground-truth measurements to validate the system's accuracy. Calibration, validation, and output visualization: Calibration involves fine-tuning the model's parameters to minimize prediction error, ensuring consistency with gold-standard VO2 machine outputs. The model is validated using statistical metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and correlation coefficients. This step ensures reliability, especially in dynamic settings like sports or clinical trials. Predicted VO2 values, alongside other related metrics, are visualized in a user-friendly format. This is essential for end-users such as clinicians or sports professionals to interpret and utilize the results effectively. The outputs enable applications such as real-time fitness tracking, rehabilitation monitoring, and personalized athletic training.

Flow diagram for estimation of VO2 from IMU sensor data.
Model architecture
The proposed architecture is well-suited for sequential data tasks, such as physiological signal analysis where accurate representation of spatial–temporal dependencies is critical. The proposed model integrates CNNs and LSTM networks, capitalizing on their complementary strengths. The architecture is designed to effectively extract hierarchical spatial features while capturing temporal dependencies, making it ideal for tasks requiring high-dimensional feature analysis and temporal reasoning.
Model structure and design rationale
Figure 3 provides a detailed illustration of the architecture used in this paper, outlining its various components and their interactions. The model consists of an input layer, multiple hidden layers with activation functions, and an output layer, where each layer performs specific transformations on the data. Training is achieved through a loss function and optimization techniques like gradient descent, allowing the model to adjust its parameters for improved performance.

Training model architecture.
Feature extraction using convolutional layers
Conv1D Layers: The initial layers utilize 1D convolutions with progressively decreasing filter sizes (512 and 256 filters, kernel size = 2) to extract spatial features from input sequences. These layers capture localized patterns, enabling hierarchical representation of features at different levels of granularity. The first Conv1D layer applies 512 filters (K = 2) to the input, followed by a second Conv1D layer with 256 filters, reducing feature dimensionality. The convolutional layers
19
aim to extract spatial features from the sequential input
X is the input,
W is the kernel/filter of size K,
MaxPooling layers: Following each convolutional operation, MaxPooling1D reduces dimensionality while retaining critical features, thereby improving computational efficiency and focusing on prominent patterns. The use of padding='same’ ensures that spatial dimensions are preserved when necessary. MaxPooling1D is applied to reduce the temporal resolution. The pooling operation is defined as in equation (9):
where P is the pooling window size (here P = 2).
Batch normalization: Applied after each convolutional block, batch normalization ensures faster convergence and prevents the vanishing or exploding gradient problem by stabilizing activation distributions across layers. Batch normalization
20
normalizes the output using is as in equation (10):
Temporal dependency modeling with LSTM
LSTM Layer: The model incorporates a single LSTM 21 layer with 128 units and ReLU activation to capture temporal dependencies inherent in sequential data. This enables the architecture to model long-term relationships within the data, which are often crucial for tasks like time-series forecasting or physiological signal processing.
The LSTM layer processes the output of the convolutional layers to model temporal dependencies. Given input sequences
Forget gate:
Input gate:
Cell update:
Output gate:
Hidden state:
Here, σ
Dropout regularization is applied to the LSTM outputs to reduce overfitting, defined as:
Activation and Non-Linearity: Additional ReLU activation layers ensure that the model captures complex non-linear patterns in the sequence data.
Dense layers for feature aggregation and decision making
Fully Connected Dense layers: The architecture includes a dense layer with 50 neurons to compress high-dimensional feature representations into a lower-dimensional space. This layer acts as a feature aggregator, synthesizing information from both convolutional and LSTM outputs.
Final output layer: The final dense layer outputs a single value, suitable for regression tasks, making the model applicable to problems like predicting continuous outputs from sequential data. Given input z, the dense layer output is in equation (17):
where
Hybrid extension using transfer learning
After initial training, a top-5-layer model is extracted to reuse the pretrained feature extraction capabilities is as shown in Figure 4. The top-5-layer model extracts features f from the pretrained model as in equation (18).

Hybrid CNN and LSTM model by leveraging transfer learning model architecture.
where g(⋅) represents the output of the frozen feature extractor.
This pre-trained model followed by extending the architecture with an additional convolutional-LSTM pipeline, comprising of a Conv1D layer with 512 filters and kernel size 3 to refine the extracted features, an LSTM layer with 128 units to further model sequential dependencies and additional dense layers, dropout regularization, and batch normalization to ensure robust training and prevent overfitting.
This hybrid extension leverages transfer learning by freezing the feature extraction layers (top-five-layer model), enabling efficient training on downstream tasks with limited computational resources or data.
The following points are the capabilities of the developed model:
Spatial and temporal feature learning: The model's combination of CNNs and LSTMs provides an optimal tradeoff between feature extraction and sequence modeling. CNNs effectively identify localized patterns, while LSTMs model temporal relationships across sequences. Regularization and stability: Dropout and batch normalization layers improve generalization by addressing overfitting and stabilizing training dynamics, respectively. Transfer learning efficiency: The two-stage training process (pretraining and hybrid extension) maximizes feature reuse, improving the model's performance on complex tasks while reducing computational overhead.
Experimental details
The deep learning model considers parameters like angular velocity, acceleration, and Euler angles as inputs. These inputs are obtained from healthy subjects who were walking at different speeds on flat ground, specifically on a treadmill. The model predicts gait events such as HS events using IMU data. Figure 3 shows the classification report for the HS prediction using TL with LSTM. Table 2 clearly indicates that the model performs exceptionally well, particularly for class 0, with high precision, recall, and F1 scores. While the performance for class 1 is slightly lower, the overall accuracy and weighted metrics suggest that the model is robust and effective for the given task. However, attention should be paid to the class imbalance, which may impact the model's performance on the minority class.
Classification Report for the prediction of HS from TL with LSTM.
Volume of oxygen computation
Gait EE computation involves measuring the EE of a person during an activity, typically assessed using a Clinical grade VO2 machine. This direct method requires wearing a mask to collect respiratory gases, determining the flow of inspired and expired O2 and CO2, and inferring metabolic rate from the obtained data. Although using a clinical-grade VO2 machine for measuring VO2_truth is a clinically accepted method, it is not always feasible and can be expensive. To address this, an indirect EE computation system is proposed, utilizing concepts of deep learning (specifically transfer learning and LSTM) and measurements from an IMU sensor strapped to the subject.
The procedure for training the machine learning models and estimating EE involves:
Collecting VO2 data while the subject walks on a treadmill, simultaneously capturing IMU data fixed at the calf muscle. The collected VO2 data serves as ground truth, and the transfer learning model is trained to compute EE using IMU data, with VO2 as reference data.
Experimental setup
Figure 5 shows a subject wearing a mask and IMU sensors. In this, only the IMU data from the sensor placed at the calf muscle is used as the data from the other sensor did not correlate well with gait parameters (possibly due to improper placement). Data from a VO2 machine is taken as ground truth. The Bruce Protocol is adopted during the experiment as shown in Table 3.

Subject wearing VO2 mask (middle), a medical expert is adjusting the treadmill using Bruce protocol (left side of image) and doctor who is operating manual push button (right side of image).
Data collection stages as per the Bruce protocol.
The Bruce protocol is a widely used exercise protocol for measuring and collecting data on the maximal oxygen consumption (VO2 max) of individuals. It involves a series of progressively increasing workload stages that aim to bring the participant to their maximum exertion level. Here is a brief explanation of the data collection stages as per the Bruce protocol:
The treadmill protocol involves five stages: starting with a 1-min walk at 2.575 km/h on a flat surface, then increasing to 4.345 km/h with a 10% incline for 3 min, followed by a 12% incline at the same speed for another 3 min. It then advances to 6.437 km/h with a 14% incline for 3 min and concludes with a cool down at 0 km/h and 0% incline.
During each stage, various physiological parameters are measured and recorded, including heart rate, blood pressure, respiratory rate, and possibly expired gases. Additionally, the participant's perceived exertion level may be assessed using a standardized scale, such as the Borg Rating of Perceived Exertion. By following this modified Bruce protocol and systematically progressing through the stages, researchers can collect valuable data on cardiovascular responses and estimate an individual's maximal oxygen consumption (VO2 max). During the experiment, measurements are also made with IMU sensor which is set at 100 Hz sampling rate. This results in a dataset that has close to 1,00,000 samples per subject and is sufficient for training the model using TL.
The new datasets are collected from eight additional subjects who are healthy with details shown in Table 4. The two subjects whose data was used to train the original LSTM model in, 14 their data are retained in this work, and data from three subjects are used for training transfer learning model. The remaining subject dataset is used for testing.
Subject's details.
The theoretical way of finding VO2 using IMU data is suggested by the ACSM approach.
23
Determining EE based on the ACSM method involves computing the oxygen consumption (VO2) using the equation (19):
where,
RMR = Resting Metabolic Rate calculated using equation (20),
VM = Vertical Movement calculated using equation (21)
HM = Horizontal Movement calculated using equation (22).The resting metabolic rate which contributes to additional HM and VM is calculated using,
Specifically, for Walking activity,
23
Both HM and VM are dependent on a person's weight, height, and speed of movement. VM computation also considers movement uphill. The speed can be directly estimated using the stride lengths through the equation (23):
Where k is a constant (0.415 for males and 0.413 for females) and Savg is the average stride per second. 23
This section presents an approach to estimating EE during gait by combining TL with LSTM networks. It builds on prior work where an LSTM model is estimated such as HS using data from IMUs and a clinical-grade VO2 machine. The main challenge addressed is the limited availability of subject data, especially in clinical settings, which is mitigated by applying TL. This allows the model to be continuously improved as new data becomes available, reducing the need for large datasets for training. The approach consists of several steps: pre-training an LSTM model on initial data, extracting features, retaining learned weights from the pre-trained model, and fine-tuning the model on new datasets. By leveraging TL, the model adapts to new subjects and gait patterns, making it more efficient for real-time deployment. The hybrid architecture combines CNNs and LSTMs, which enables both spatial and temporal feature extraction. Additionally, the proposed methodology ensures scalability and robustness by using multi-modal data, such as IMU sensors, VO2 machines, and gait sensors. Through this framework, the model can predict VO2 values and assess EE during walking activities, offering a viable solution for real-time and personalized health monitoring.
Results
This study evaluates the performance of a hybrid CNN-LSTM model in predicting HS events and estimating VO2 using single IMU sensor data. The results demonstrate the model's ability to accurately capture gait dynamics and physiological parameters, validated through statistical metrics and comparative analysis. Comparative analysis with an LSTM-only model demonstrates the superiority of the hybrid approach, leveraging CNN's spatial feature extraction combined with LSTM's temporal learning capabilities. The following sections provide a detailed breakdown of the model's performance for HS detection and VO2 estimation.
Heel strike prediction performance
The performance of the proposed hybrid CNN-LSTM model for HS detection was evaluated using key classification metrics. The model achieved an accuracy of 93.53%, demonstrating its robustness in correctly identifying HSs. The precision, recall, and F1-score were 85.87%, 85.59%, and 85.73%, respectively, indicating a balanced predictive capability with a minimal bias towards false positives or false negatives is also shown in Figure 6(a) and (b).

Hybrid CNN-LSTM model performance metric for HS prediction.
Statistical analysis revealed that the Hybrid CNN-LSTM model significantly outperforms the LSTM-only approach. The t-statistic (0.3355) and p-value (0.7372) suggest no significant difference in overall performance variability, while improvements in precision and recall indicate better robustness in real-world gait conditions (Table 5).
Performance metrics for HS prediction and VO2 estimation using hybrid CNN and LSTM model.
Table 6 compares HS detection accuracy across different studies, showing that the proposed Hybrid CNN-LSTM model achieved 93.53% accuracy using a single IMU, outperforming previous IMU-based models.
Comparison of HS detection accuracy estimation.
Vo2 estimation performance
For VO2 estimation, the Hybrid CNN-LSTM model demonstrated superior accuracy and robustness compared to the LSTM-only approach, as detailed in Table 5.
The Hybrid CNN-LSTM model significantly reduced the error rates (MSE, RMSE, MAE) while achieving a higher R² value (0.9934), indicating a better fit for VO2 estimation. Statistical analysis demonstrated a t-statistic of 2.1258 and a p-value of 0.0336, confirming the model's statistical significance and improved reliability compared to the LSTM-only approach.
To evaluate the effectiveness of the proposed method in estimating VO2, a comparative analysis was conducted against existing VO2 prediction models and summarized in Table 7.
Comparison of HS detection accuracy estimation.
Figure 7 further compares the VO2 predictions using different techniques. Figure 7(a) shows theoretical VO2 values calculated using established equations (see reference 14 ), while Figure 7(b) depicts VO2 values predicted using the basic LSTM model. The results reveal notable deviations and errors in the LSTM predictions. However, when transfer learning was applied, as shown in Figure 7(c), the predictions closely follow the ground truth, demonstrating the effectiveness of the deep learning approach.

Prediction VO2 using theoretical calculation, LSTM, and transfer learning with LSTM.
Error analysis of Vo2 estimation
The percentage error is determined using the equation:
The percentage error for VO2 prediction is illustrated in Figure 8(a) and (b). The basic LSTM model produced a maximum error of 11% with average errors around 3%–4%, while the percentage of error further, from 20% (see reference 14 ) to 11%, demonstrating the importance of training on a larger dataset for improving predictive accuracy. The error graph of the Hybrid CNN-LSTM model demonstrates significant advantages over the LSTM-only approach. This proposed model is particularly used in minimizing prediction errors and improving model generalization and it provides enhanced stability and robustness across different VO2 estimation scenarios.

Percentage error plots for the predicted value of VO2.
Results discussions
A statistical comparison between the two models was conducted using t-tests to determine whether the observed differences were statistically significant.
Heel Strike Prediction: The t-statistic (0.3355, p = 0.7372) suggests no significant difference in overall variability, though improvements in precision and recall indicate better performance in detecting true positive HS events.
VO2 Estimation: The Hybrid CNN-LSTM model exhibited a statistically significant improvement in VO2 prediction, with a t-statistic of 2.1258 and a p-value of 0.0336, indicating enhanced model reliability.
The results of this study emphasize the advantages of using transfer learning with LSTM for predicting VO2 values and detecting HS, particularly when dealing with limited datasets. While the LSTM model demonstrated potential for VO2 prediction, it exhibited significant errors due to the complexity of the task and the limited data available. LSTM networks are powerful tools for analyzing sequential data, but they often require large amounts of training data to generalize effectively. When data are insufficient, as in this study, LSTM models may struggle to capture the intricate patterns in the dataset, resulting in higher prediction errors. However the percentage error reduction from 20% to 11%
The application of transfer learning proved to be more effective, as it allowed the model to generalize better despite the small dataset size. By utilizing pre-trained weights from the top layers of the base LSTM model, transfer learning improved prediction accuracy and reduced errors. This is particularly advantageous when working with datasets containing limited subject variability, as transfer learning leverages prior knowledge to enhance model performance on new, unseen data.
The reduction in percentage error from 20% in previous studies 14 to 11% in this study, achieved using LSTM, and further to a maximum error of 3% from the integration of customized transfer learning techniques with LSTM is a key contribution of this work. This customized method is driving significant improvements in accuracy and performance. Furthermore, the increase in dataset size played a critical role by enabling the model to learn from a broader range of samples, thereby capturing the underlying data patterns more effectively.
In summary, The Hybrid CNN-LSTM model outperforms the LSTM-only approach in both HS prediction and VO2 estimation, offering enhanced accuracy, precision, and generalizability. These findings underscore the efficacy of integrating deep learning models for real-time gait analysis and metabolic estimation, supporting their application in wearable and rehabilitation technologies.
Limitations
Despite these promising results, the study has limitations that must be acknowledged. The dataset used consists of only ten subjects
Conclusion
This study demonstrates the potential of a hybrid machine learning model combining CNNs, LSTM networks, and Transfer Learning for estimating HS and EE (VO2) using data from a single nine-axis. By leveraging transfer learning, the model achieved a 3% reduction in prediction error compared to traditional systems, highlighting its ability to address the challenges of accuracy and practical application in clinical and sports settings. Despite these advancements, the study's limited dataset size and subject-specific evaluation constrain its generalizability to broader populations, warranting further validation with larger, more diverse datasets.
Future work should focus on integrating advanced techniques like Generative Adversarial Networks (GANs) for synthetic data generation and fusion with LSTM networks offers exciting possibilities to overcome data scarcity and enhance prediction accuracy.
Footnotes
Acknowledgements
The authors would like to thank Dr Karthikbabu S (KMCH, Coimbatore) for his valuable insights into the experimental set up and Manipal Hospital (Bangalore) for providing the facility for data collection.
Ethical considerations
This study was conducted using data collected from healthy participants, and no patient data were involved. Informed consent was obtained from all participants prior to data collection. As the study did not involve invasive procedures, sensitive personal information, or clinical interventions, formal ethical approval from an institutional ethics committee was not sought. The research adhered to the ethical principles outlined in the Declaration of Helsinki to ensure participant rights and welfare was respected throughout.
Author contributions/CRediT
All authors have made substantial contributions to the conception and design of the study, data acquisition, analysis, and interpretation of data, and have drafted the manuscript or revised it critically for important intellectual content. All authors have read and approved the final version of the manuscript, and agree to be accountable for all aspects of the work.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
