Sage Journals: Discover world-class research

Abstract

Objective

Accurate assessment of physiotherapy exercises is critical for effective rehabilitation, particularly for elderly and mobility-impaired individuals. While telerehabilitation offers a viable alternative to in-clinic supervision, existing approaches often rely on single-modality sensors, limiting robustness and adaptability. This study aims to develop a multimodal, markerless framework for reliable home-based physiotherapy exercise recognition.

Methods

A deep learning–based multimodal framework is proposed that integrates synchronized RGB and depth streams. From RGB data, two-dimensional keypoints, semantic body-part labels, and contour-based visual descriptors are extracted. Depth silhouettes are used to estimate three-dimensional joint positions and reconstruct full-body meshes using the Skinned Multi-Person Linear model, along with global shape descriptors such as Zernike moments. Multimodal features are fused and refined using Kernel Fisher Discriminant Analysis, followed by classification using a Graph Convolutional Network to capture spatial and temporal relationships.

Results

The proposed framework was evaluated on three publicly available rehabilitation datasets: KIMORE, mRI, and UTKinect-Action3D. The system achieved classification accuracies of 95.30%, 92.70%, and 95.59%, respectively, demonstrating consistent performance across diverse rehabilitation-oriented benchmarks.

Conclusions

The results suggest that integrating complementary RGB and depth-based representations can enhance robustness and accuracy in physiotherapy exercise recognition under home-based settings. The proposed framework shows potential for supporting accessible telerehabilitation, while future work will focus on broader validation and practical deployment considerations.

Keywords

Multimodal physiotherapy recognition telerehabilitation RGB-Depth fusion 3D human mesh reconstruction Zernike moments Kernel Fisher Discriminant Analysis Graph Convolutional Network

Introduction

Rehabilitation plays a vital role in restoring motor function and physical independence in elderly and disabled individuals.¹ Traditional physiotherapy models rely heavily on repeated, supervised clinical sessions, which are often inaccessible to patients living in remote, rural, or resource-constrained environments.² The lack of real-time monitoring and personalized feedback further increases the risk of incorrect exercise execution, potentially delaying recovery or causing secondary injuries. In light of the increasing demand for accessible and intelligent healthcare, there is a growing need for automated, vision-based rehabilitation systems that can operate reliably outside clinical settings.^3,4

Telerehabilitation has emerged as a promising alternative, offering remote exercise supervision through camera-based sensing technologies.^5,6 However, most existing systems are constrained by single-modality designs. RGB cameras, while visually rich, are highly sensitive to lighting variations, background clutter, and raise privacy concerns in home environments.^7,8 Depth cameras, by contrast, offer geometric clarity and preserve privacy through silhouette abstraction, yet they often lack fine-grained visual detail and are susceptible to occlusion and sensor noise. These modality-specific limitations reduce the reliability and scalability of standalone systems.^9,10

Recent research highlights the benefits of integrating RGB and depth modalities to capture both appearance-based and structural cues.^11–13 Nevertheless, many existing solutions remain narrowly focused on joint angle estimation or skeletal tracking and fail to encode holistic body configuration or motion dynamics.¹⁴ Moreover, global pose attributes such as limb symmetry, silhouette shape, and spatial coherence, critical for evaluating exercise correctness in real-world scenarios, are often overlooked. This gap underscores the need for a comprehensive multimodal framework capable of modeling both localized joint behavior and global body posture over time.

This study proposes a deep learning–based multimodal framework for physiotherapy exercise recognition using synchronized RGB and depth streams. RGB images are processed to extract 2D keypoints, semantic body part labels, and contour-based visual descriptors. In parallel, depth silhouettes are used to compute 3D joint positions, reconstruct parametric human meshes via the Skinned Multi-Person Linear (SMPL) model, and derive global shape descriptors using Zernike moments. These complementary features are fused into a unified representation that encodes both anatomical structure and dynamic motion characteristics. To enhance class separability, Kernel Fisher Discriminant Analysis (KFDA) is applied for feature refinement. The resulting discriminative features are classified using a Graph Convolutional Network (GCN) to support rehabilitation assessment in sensor-free, home-based settings.

The key contributions of this work include:

Multimodal RGB–Depth Framework for Markerless Motion Capture: A unified system that fuses RGB imagery and depth silhouettes to enable accurate, sensor-free physiotherapy exercise recognition.

Adaptive 3D Anatomical Keypoint Localization (PoseKP-L/R): A novel silhouette-aware lightweight method for extracting 3D joint landmarks from depth maps, dynamically adjusting to upper-limb variations for precise anatomical mapping.

Depth-to-Mesh Reconstruction for Structural Pose Modeling: Generation of full-body 3D meshes from depth silhouettes using novel lightweight PoseKP-L/R and SMPL, providing rich spatial and postural context without requiring physical markers or suits.

Multilevel Feature Fusion and Discriminative Learning for Robust Classification: Integration of 2D keypoints (via FAST, Harris-Laplace), 3D mesh geometry, and handcrafted descriptors including Zernike moments and Gabor filters, followed by feature refinement using KFDA. The resulting discriminative features are classified using a GCN, enabling accurate, rotation-invariant, and temporally coherent exercise recognition.

The paper is structured as follows: “Literature review” section reviews related work, “System methodology” section details the system design, “Results and performance evaluation” section presents results, and “Conclusion and future direction” section concludes.

Literature review

Human Action Recognition (HAR) is a foundational task in telerehabilitation, enabling automated monitoring of patient movements for remote physiotherapy assessment. Traditional approaches have largely relied on wearable sensors, such as Inertial Measurement Units (IMUs), for capturing motion data with high temporal precision. García-de-Villa et al.¹⁵ used multiple IMUs to recognize rehabilitation exercises with 88.4% accuracy, while Zhang et al.¹⁶ demonstrated that a single waist-mounted IMU paired with a hybrid CNN–RNN architecture could monitor elderly activity with reasonable accuracy. Despite their effectiveness, these methods require users to wear devices, limiting practicality and comfort in nonclinical settings.

To address these limitations, vision-based HAR systems have gained traction, particularly those utilizing RGB or depth cameras. Kinect-based systems have shown early promise in exercise monitoring and fall detection but suffer from environmental dependencies such as poor lighting and occlusion sensitivity.^17–19 As a result, RGB-based deep learning models have emerged, utilizing CNNs, LSTMs, and more recently, Transformers. Gupta et al.²⁰ and Li et al.²¹ demonstrated the viability of RGB-only action recognition pipelines, though both approaches were affected by background noise and camera placement. Wang et al.²² and Qiu et al.²³ attempted to mitigate these challenges using OpenPose and silhouette-guided dual-stream CNNs, but segmentation dependency and frame drop sensitivity remained persistent issues.

Recent studies have explored multimodal and hybrid HAR models, combining RGB, depth, and skeletal features to enhance robustness. Miao et al.²⁴ introduced a transformer-based HAR model using RGB and joint data, improving classification accuracy but remaining sensitive to keypoint extraction noise. Zanfir et al.²⁵ and Kanazawa et al.²⁶ advanced the field by incorporating 3D human mesh reconstruction using SMPL, enabling richer pose and shape modeling from monocular RGB input. Their work laid the foundation for markerless physiotherapy assessment by encoding both kinematic and structural information. Yi et al.²⁷ and Kocabas et al.²⁸ further demonstrated that temporally coherent SMPL predictions improved activity recognition accuracy under occlusion and motion blur.

Parallel research in depth-based and silhouette-driven HAR systems has shown strong potential for telerehabilitation. A study on depth-based HAR achieved 91% exercise classification accuracy and 82% correctness assessment by fusing skeletal and silhouette features.²⁹ Similarly, monocular pose estimation frameworks such as BlazePose and others have also demonstrated comparable performance to Kinect in controlled settings,^30,31 although challenges remain in occlusion handling and depth ambiguity.

Several multi-sensor fusion techniques have also been proposed, combining RGB, depth, and inertial data through advanced feature integration methods such as DLRF, Genetic Algorithms, and spatiotemporal descriptors.^32,33 While these approaches improve viewpoint robustness and classification accuracy, they often suffer from high computational overhead and limited applicability in home-based telerehabilitation. Tele-EvalNet³⁴ and other real-time AI-driven feedback systems have addressed aspects of movement quality prediction and exercise execution scoring, yet these systems remain limited by degraded skeleton tracking performance under occlusion and nonfrontal orientations.

In summary, existing HAR methods have made significant strides through deep learning and sensor fusion, yet face persistent challenges in real-world telerehabilitation scenarios. Most methods either lack integration of structural and appearance features or depend on sensor-heavy configurations unsuitable for home deployment. To address these limitations, this study proposes a multimodal HAR framework that fuses complementary appearance-based and structural features from RGB and depth streams. By integrating Zernike moment-based silhouette descriptors, SMPL-based 3D body modeling, and KFDA coupled GCN-based classification, the proposed system delivers accurate, occlusion-robust, and scalable performance for real-world telerehabilitation scenarios.

System methodology

This work presents a unified telerehabilitation framework that integrates both depth and RGB video analysis for accurate exercise monitoring in individuals with disabilities. The depth stream is processed using the Depth-based Keypoint Extraction and Mesh (D-KEM) pipeline for silhouette enhancement, 3D keypoint extraction, SMPL mesh reconstruction, and Zernike-based shape encoding, providing stable geometric structure and explicit spatial cues that are largely invariant to viewpoint changes and lighting conditions. Simultaneously, the RGB stream is analyzed through the Silhouette-based Tracking, Regional Analysis, and Parsing using RGB imagery (STRAP-RGB) pipeline, which performs silhouette segmentation, keypoint detection using FAST and Harris–Laplace, and feature extraction via Laws’ Texture Energy (LTE), Local Ternary Pattern (LTP), and Gabor filters, followed by semantic body-part labeling to capture fine-grained motion dynamics and appearance information. The integration of traditional and deep-learning modules has been carefully designed to exploit the complementary strengths of both modalities, combining the interpretability and texture-level robustness of handcrafted descriptors, particularly effective under occlusion or illumination variations, with the high-level semantic understanding and spatiotemporal abstraction of deep architectures. This complementary fusion allows depth cues to compensate when RGB information degrades, and vice versa, resulting in a balanced and cohesive system with enhanced robustness. A system overview is shown in Figure 1.

Figure 1.

Unified RGB-depth telerehabilitation pipeline integrating Depth-based Keypoint Extraction and Mesh (D-KEM) and Silhouette-based Tracking, Regional Analysis, and Parsing using RGB imagery (STRAP-RGB) illustrating the integration of complementary visual and depth information for multimodal physiotherapy exercise recognition.

Data preprocessing and cross-modal synchronization

Before initiating the D-KEM and STRAP-RGB pipelines, all input recordings underwent a unified preprocessing stage to harmonize temporal and spatial parameters across modalities and datasets. Each RGB–Depth sequence was temporally aligned using sequential timestamps, ensuring that every RGB frame corresponded precisely to its depth counterpart captured at the same instant. To achieve temporal uniformity across datasets, the aligned streams were resampled to a common frame rate of 15 frames per second, and all sequences were standardized to a maximum duration of 260 frames for computational consistency.

Subsequently, al frames were spatially scaled to 256 × 256 pixels, establishing a consistent spatial resolution across datasets of varying native sizes. For intensity normalization, depth maps were converted to 16-bit precision and normalized using a Min–Max transformation defined in (1), ensuring consistent depth scaling and preservation of distance granularity:

D_{norm} (x, y) = (\frac{D (x, y) - D_{\min}}{D_{\max} - D_{\min}} \times 65535,)

(1)

where

D (x, y)

is the raw depth value, and

D_{\min}

D_{\max}

denote the minimum and maximum pixel depths within the frame. Similarly, RGB frames were normalized channel-wise through z-score standardization as expressed in (2), to minimize illumination and contrast bias across scenes:

I_{c}^{'} (x, y) = \frac{I_{c} (x, y) - μ_{c}}{σ_{c}}, c ϵ {R, G, B}

(2)

where

μ_{c}

and

σ_{c}

represent the mean and standard deviation of each RGB channel, respectively. These preprocessing steps established a standardized and balanced input representation, both temporally and spatially, ensuring that all modalities entered their respective pipelines under consistent resolution, frame rate, and intensity distributions.

Depth-based Keypoint Extraction and Mesh

We propose D-KEM modeling, a modular pipeline for 3D human body reconstruction from depth images. The process begins with preprocessing to enhance silhouette quality by removing noise, eliminating the floor, and improving contrast. Keypoints are then extracted using two pose-aware modules, PoseKP-L (Pose-aware KeyPoint extractor – Lowered Arms posture) and PoseKP-R (Pose-aware KeyPoint extractor – Raised Arms posture), designed for arms-down and arms-raised configurations, respectively. These modules rely on classical image processing techniques such as contour analysis and curvature detection to localize anatomical landmarks in $x, y, z$ space. The extracted 3D keypoints are passed to the SMPL model, which generates a detailed body mesh by estimating pose and shape parameters. To further encode the overall structure of the human silhouette, Zernike moments are computed from the depth masks, providing a compact, rotation-invariant descriptor of global body shape for subsequent analysis or classification.

Preprocessing

This study utilizes the RANSAC algorithm to eliminate the floor from depth images by fitting a planar model to the identified floor points. Initially, pixels corresponding to the floor are detected based on their depth values, and a binary mask is employed to ensure that only foreground elements are considered. From these, a set of 3D coordinates representing the floor is constructed, as defined in (3):

P_f l o o r = {(x, y) | z ⟩ 0 and b i n a r y_m [y, x] = 255}

(3)

Here, z denotes the depth value, and the binary mask ensures separation of foreground and background. RANSAC is then used to estimate a planar surface that best approximates the floor points, represented by (4):

z = a . x + b . y + c

(4)

Once the floor plane is estimated, any pixel with a depth value whose residual is less than a predefined threshold ε is classified as part of the floor and is consequently removed, as given in (5):

{depth}_{corrected [y, x]} = {\begin{matrix} 0, | z - a . x + b . y + c | < ε \\ z, otherwise \end{matrix}

(5)

This approach effectively removes the floor while preserving the integrity of remaining depth values as shown in Figure 2. The resulting depth image, denoted as $I_{d}$ , is then normalized to enhance contrast using Min-Max scaling as shown in Figure 3(b) and described in (6):

I_{norm} (x, y) = \frac{I_{d} (x, y) - \min (I_{d})}{\max (I_{d}) - \min (I_{d})} \times 255

(6)

Figure 2.

Depth image preprocessing results showing (a) original depth frames; (b) corresponding frames after floor removal.

Figure 3.

Depth image enhancement stages including (a) floor-removed frames, (b) normalized depth maps, (c) contrast enhancement using Contrast Limited Adaptive Histogram Equalization (CLAHE), and (d) final outputs after bilateral filtering.

To further improve local contrast while avoiding overamplification of noise, CLAHE (Contrast Limited Adaptive Histogram Equalization) is applied as shown in Figure 3(c). The transformation uses a clipped cumulative distribution function, denoted as $P_{clip} (i),$ and is defined by (7):

I_{clahe} (x, y) = \sum_{i = 0}^{I_{norm} (x, y)} P_{clip} (i)

(7)

To enable compatibility with color-based processing techniques, the enhanced grayscale image is replicated across three channels, resulting in a pseudo-color image as shown in (8):

I_{3 ch} = (I_{clahe} (x, y), I_{clahe} (x, y), I_{clahe} (x, y))^{T}

(8)

Finally, bilateral filtering is applied to suppress noise while preserving important edges as illustrated in Figure 3(d). This process is governed by (9):

I_{filtered} (x, y) = \frac{\sum_{x i, y i} I_{3 c h} (x i, y i) G_{s} (d) G_{r} (r)}{\sum_{x i,} G_{s} (d) G_{r} (r)}

(9)

In this equation, d represents spatial distance, r indicates intensity difference, and $G_{s}$ and $G_{r}$ are spatial and range Gaussian kernels, respectively.

Depth-guided silhouette segmentation

Human localization in the preprocessed depth image is first performed using the Histogram of Oriented Gradients (HOG) descriptor. Gradient-based features extracted from the depth image are classified through a pretrained Support Vector Machine (SVM). Among all detections, the candidate with the maximum bounding box area is selected and enlarged using an expansion factor α, as depicted in Figure 4(a).

Figure 4.

Depth-based human silhouette extraction stages: (a) Histogram of Oriented Gradient (HOG)-detected human bounding box, (b) initial silhouette obtained using GrabCut segmentation, (c) noise removal through connected component analysis, and (d) final refined human silhouette.

Once the human region is identified, GrabCut is employed to obtain an accurate silhouette. GrabCut treats foreground–background separation as an energy minimization problem by combining appearance modeling through a Gaussian Mixture Model (GMM) with spatial smoothness constraints, as illustrated in Figure 4(b). The overall energy function is expressed in (10):

E (α, θ, z) = U (α, θ, z) + V (α, z)

(10)

Here, $U (α, θ, z)$ represents the data term that models the likelihood of pixel intensities under the GMM, while $V (α, z)$ encourages spatial consistency between neighboring pixels. The data term is formulated in (11) as:

U (α, θ, z) = \sum_{x \in X} - logP (z_{x} | θ_{α x})

(11)

where the conditional probability

P (z_{x} ∣ θ_{α_{x}})

is modeled using a mixture of Gaussian distributions, defined in (12):

P (z_{x} | θ_{α_{x}}) = \sum_{k = 1}^{K} π_{k}^{α_{x}} N (z_{x}; μ_{k}^{α_{x}}, \sum_{k}^{α_{x}})

(12)

In this formulation, $K$ denotes the number of Gaussian components, $π_{k}^{(α_{x})}$ are the corresponding mixture weights, and $N (\cdot)$ represents a Gaussian distribution with mean $μ_{k}$ and covariance $Σ_{k}$ . The smoothness term is defined in (13):

V (α, z) = \sum_{(x, y) ϵ N} β . δ (α_{x} \neq α_{y}) . e^{- γ z_{x} - z_{y}^{2}}

(13)

where neighboring pixel pairs

(x, y)

are penalized for label discontinuities through

δ (α_{x} \neq α_{y})

, and the parameter

γ

controls edge preservation based on intensity differences. The bounding box obtained from the HOG + SVM stage serves as the initialization for GrabCut, which iteratively refines the segmentation using min-cut/max-flow optimization and repeated GMM updates until convergence.

After segmentation, connected component analysis is applied to remove small or spurious regions. Each component is evaluated based on its area $A$ and its Euclidean distance $D$ from the center of the detected bounding box, computed using (14), as depicted in Figure 4(c):

D = \sqrt{{(x_{c} - x_{i})}^{2} + {(y_{c} - y_{i})}^{2}}

(14)

where

(x_{c}, y_{c})

denotes the bounding box center and

(x_{i}, y_{i})

represents the centroid of the connected component. Only components satisfying the criteria in (15) are retained:

A \geq A_{\min}, D \leq D_{threshold}

(15)

Finally, the refined mask is applied to the floor-removed depth image to extract the human silhouette, as shown in Figure 4(d). The resulting silhouette is saved for subsequent analysis using (16):

silhouette [y, x] = {\begin{matrix} depth_image [y, x] if mask [y, x] = 255 \\ 0, otherwise \end{matrix}

(16)

PoseKP-L and PoseKP-R: Pose-Aware Keypoint Extraction Framework

The PoseKP-L (Pose-aware KeyPoint extractor – Lowered Arms posture) and PoseKP-R (Pose-aware KeyPoint extractor – Raised Arms posture) modules form a dynamic and adaptive framework designed to extract 24 anatomical keypoints from depth-based human silhouettes as shown in Figure 5. Unlike traditional static approaches, this system intelligently adapts to varying postures, body alignments, and arm positions, making it highly suitable for motion analysis, rehabilitation tracking, and activity recognition.

Figure 5.

Visualization of detected anatomical keypoints for different human postures using the PoseKP-L and PoseKP-R modules.

The architecture incorporates two specialized pipelines for accurate 3D keypoint (x, y, z) extraction. PoseKP-L is tailored for scenarios where the arms are in a lowered position. It employs a contour-based strategy for robust keypoint estimation, as detailed in Algorithm 1:

Algorithm 1: Overall algorithm (PoseKP-L)
1. Input: $s i l h o u e t t e_{i n p u t}$ //depth silhouette 2. Output: $p o i n t s_{o u t p u t}$ //List of 3D anatomical keypoints 3. Procedure COMPUTE_KEYPOINTS ( $s i l h o u e t t e_{i n p u t}$ ) 4. Initialize empty list $p o i n t s_{o u t p u t}$ ← 5. Assign input silhouette to working frame 6. frame ← $s i l h o u e t t e_{i n p u t}$ ——– Head & Neck Detection ——– 7. Detect head location as topmost foreground pixel 8. $h_{x}, h_{y}$ ← (width / 2), $t o p_p i x e l_{y}$ 9. Estimate neck position slightly below head 10. $n_{x}$ , $n_{y}$ ← (width / 2), ( $h_{y}$ + $h$ / 8) ——– Shoulder Localization ——– 11. Locate right shoulder by scanning from right at neck height 12. $s h o u l d e r_{r}$ ← max{ $x$ \| frame ( $n_{y}$ , $x$ ) > 0} − 7 13. Locate left shoulder by scanning from left at neck height 14. $s h o u l d e r_{l}$ ← min{ $x$ \| frame ( $n_{y}$ , x) > 0} + 7 15. Recompute neck center using shoulder positions 16. $n_{x}$ ← avg ( $s h o u l d e r_{r}$ , $s h o u l d e r_{l}$ ) 17. Estimate collarbone endpoints with fixed offsets 18. $c l a v i c l e_{l}$ ← $s h o u l d e r_{l}$ + 25 19. $c l a v i c l e_{r}$ ← $s h o u l d e r_{r}$ − 25 ——– Hip & Pelvis Estimation ——– 20. Define vertical hip search region based on body height 21. $h i p_{t o p}$ ← 3 h / 5 22. $h i p_{b o t t o m}$ ← 2 h / 3 23. Extract left and right hip boundaries within hip slice 24. $h i p_{l}$ ← min {x \| ∃ y ∈ [ $h i p_{t o p}$ , $h i p_{b o t t o m}$ ], frame (y, x) > 0} 25. $h i p_{r}$ ← max {x \| ∃ y ∈ [ $h i p_{t o p}$ , $h i p_{b o t t o m}$ ], frame (y, x) > 0} 26. Compute pelvis center 27. $p e l v i s_{x}$ ← avg ( $h i p_{l}$ , $h i p_{r}$ ) 28. $p e l v i s_{y}$ ← avg ( $h i p_{t o p}$ , $h i p_{b o t t o m}$ ) ——– Spine Estimation ——– 29. Compute central spine point 30. $s p i n e_m i d_{x}$ ← avg ( $n_{x}$ , $p e l v i s_{x}$ ) 31. $s p i n e_m i d_{y}$ ← avg ( $n_{y}$ , $p e l v i s_{y}$ ) 32. Estimate upper spine segment 33. $s p i n e_u p p e r_{x}$ ← avg ( $n_{x}$ , $s p i n e_m i d_{x}$ ) 34. $s p i n e_u p p e r_{y}$ ← avg ( $n_{y}$ , $s p i n e_m i d_{y}$ ) 35. Estimate lower spine segment 36. $s p i n e_l o w e r_{x}$ ← avg ( $p e l v i s_{x}$ , $s p i n e_m i d_{x}$ ) 37. $s p i n e_l o w e r_{y}$ ← avg ( $p e l v i s_{y}$ , $s p i n e_m i d_{y}$ ) ——– Arm Joint Detection ——– 38. Search elbow locations between neck and spine midpoint 39. $e l b o w_{l}$ ← min {x \| ∃ y ∈ [ $n_{y}$ , $s p i n e_m i d_{y}$ ], frame (y, x) > 0} 40. $e l b o w_{r}$ ← max {x \| ∃ y ∈ [ $n_{y}$ , $s p i n e_m i d_{y}$ ], frame (y, x) > 0} ——– Hand and Wrist Localization ——– 41. Search wrist locations between spine midpoint and pelvis 42. $w r i s t_{l}$ ← min {x \| ∃ y ∈ [ $s p i n e_m i d_{y}$ , $p e l v i s_{y}$ ], frame (y, x) > 0} 43. $w r i s t_{r}$ ←max {x \| ∃ y ∈ [ $s p i n e_m i d_{y}$ , $p e l v i s_{y}$ ], frame (y, x) > 0} 44. Estimate hand positions slightly above wrist level 45. $h a n d_{l}$ ← ( $w r i s t_{l}$ , $h_{y}$ - 10) 46. $h a n d_{r}$ ← ( $w r i s t_{r}$ , $h_{y}$ - 10) 47. ——– Lower Limb Detection ——– 48. Approximate knee vertical position 49. $k n e e_{y}$ ← 3 h / 4 50. Detect knee locations 51. $k n e e_{l}$ ← min {x \| frame ( $k n e e_{y}$ , x) > 0} − 7 52. $k n e e_{r}$ ← max {x \| frame ( $k n e e_{y}$ , x) > 0} + 7 53. Identify ankle positions at silhouette base 54. $a n k l e_{l}$ ← min {x \| frame ( $b o t t o m_p i x e l_{y}$ , x) > 0} 55. $a n k l e_{r}$ ← max {x \| frame ( $b o t t o m_p i x e l_{y}$ , x) > 0} 56. Adjust heel positions slightly below ankle 57. $h e e l_l_{y}$ ← $a n k l e_l_{y}$ − h / 30 58. $h e e l_r_{y}$ ← $a n k l e_r_{y}$ − h / 30 59. ——– Output ——– 60. Store or visualize all detected anatomical keypoints 61. $p o i n t s_{o u t p u t}$ ← DISPLAY_RESULTS (all keypoints) 62. Return $p o i n t s_{o u t p u t}$ 63. End Procedure

Algorithm 1: Overall algorithm (PoseKP-L)

1. Input:

$s i l h o u e t t e_{i n p u t}$

//depth silhouette

2. Output:

$p o i n t s_{o u t p u t}$

//List of 3D anatomical keypoints

3. Procedure COMPUTE_KEYPOINTS (

$s i l h o u e t t e_{i n p u t}$

)

4. Initialize empty list

$p o i n t s_{o u t p u t}$

←

5. Assign input silhouette to working frame

6. frame ←

$s i l h o u e t t e_{i n p u t}$

——– Head & Neck Detection ——–

7. Detect head location as topmost foreground pixel

$h_{x}, h_{y}$

← (width / 2),

$t o p_p i x e l_{y}$

9. Estimate neck position slightly below head

10.

$n_{x}$

$n_{y}$

← (width / 2), (

$h_{y}$

$h$

/ 8)

——– Shoulder Localization ——–

11. Locate right shoulder by scanning from right at neck height

12.

$s h o u l d e r_{r}$

← max{

$x$

| frame (

$n_{y}$

$x$

) > 0} − 7 13. Locate left shoulder by scanning from left at neck height

14.

$s h o u l d e r_{l}$

← min{

$x$

| frame (

$n_{y}$

, x) > 0} + 7 15. Recompute neck center using shoulder positions

16.

$n_{x}$

← avg (

$s h o u l d e r_{r}$

$s h o u l d e r_{l}$

) 17. Estimate collarbone endpoints with fixed offsets

18.

$c l a v i c l e_{l}$

←

$s h o u l d e r_{l}$

+ 25

19.

$c l a v i c l e_{r}$

←

$s h o u l d e r_{r}$

− 25

——– Hip & Pelvis Estimation ——–

20. Define vertical hip search region based on body height

21.

$h i p_{t o p}$

← 3 h / 5

22.

$h i p_{b o t t o m}$

← 2 h / 3

23. Extract left and right hip boundaries within hip slice

24.

$h i p_{l}$

← min {x | ∃ y ∈ [

$h i p_{t o p}$

$h i p_{b o t t o m}$

], frame (y, x) > 0}

25.

$h i p_{r}$

← max {x | ∃ y ∈ [

$h i p_{t o p}$

$h i p_{b o t t o m}$

], frame (y, x) > 0}

26. Compute pelvis center

27.

$p e l v i s_{x}$

← avg (

$h i p_{l}$

$h i p_{r}$

)

28.

$p e l v i s_{y}$

← avg (

$h i p_{t o p}$

$h i p_{b o t t o m}$

)

——– Spine Estimation ——–

29. Compute central spine point

30.

$s p i n e_m i d_{x}$

← avg (

$n_{x}$

$p e l v i s_{x}$

)

31.

$s p i n e_m i d_{y}$

← avg (

$n_{y}$

$p e l v i s_{y}$

)

32. Estimate upper spine segment

33.

$s p i n e_u p p e r_{x}$

← avg (

$n_{x}$

$s p i n e_m i d_{x}$

)

34.

$s p i n e_u p p e r_{y}$

← avg (

$n_{y}$

$s p i n e_m i d_{y}$

)

35. Estimate lower spine segment

36.

$s p i n e_l o w e r_{x}$

← avg (

$p e l v i s_{x}$

$s p i n e_m i d_{x}$

)

37.

$s p i n e_l o w e r_{y}$

← avg (

$p e l v i s_{y}$

$s p i n e_m i d_{y}$

)

——– Arm Joint Detection ——–

38. Search elbow locations between neck and spine midpoint

39.

$e l b o w_{l}$

← min {x | ∃ y ∈ [

$n_{y}$

$s p i n e_m i d_{y}$

], frame (y, x) > 0}

40.

$e l b o w_{r}$

← max {x | ∃ y ∈ [

$n_{y}$

$s p i n e_m i d_{y}$

], frame (y, x) > 0}

——– Hand and Wrist Localization

——–

41. Search wrist locations between spine midpoint and pelvis

42.

$w r i s t_{l}$

← min {x | ∃ y ∈ [

$s p i n e_m i d_{y}$

$p e l v i s_{y}$

], frame (y, x) > 0}

43.

$w r i s t_{r}$

←max {x | ∃ y ∈ [

$s p i n e_m i d_{y}$

$p e l v i s_{y}$

], frame (y, x) > 0}

44. Estimate hand positions slightly above wrist level

45.

$h a n d_{l}$

← (

$w r i s t_{l}$

$h_{y}$

- 10)

46.

$h a n d_{r}$

← (

$w r i s t_{r}$

$h_{y}$

- 10)

47. ——– Lower Limb Detection ——–

48. Approximate knee vertical position

49.

$k n e e_{y}$

← 3 h / 4

50. Detect knee locations

51.

$k n e e_{l}$

← min {x | frame (

$k n e e_{y}$

, x) > 0} − 7

52.

$k n e e_{r}$

← max {x | frame (

$k n e e_{y}$

, x) > 0} + 7

53. Identify ankle positions at silhouette base

54.

$a n k l e_{l}$

← min {x | frame (

$b o t t o m_p i x e l_{y}$

, x) > 0}

55.

$a n k l e_{r}$

← max {x | frame (

$b o t t o m_p i x e l_{y}$

, x) > 0}

56. Adjust heel positions slightly below ankle

57.

$h e e l_l_{y}$

←

$a n k l e_l_{y}$

− h / 30

58.

$h e e l_r_{y}$

←

$a n k l e_r_{y}$

− h / 30

59. ——– Output ——–

60. Store or visualize all detected anatomical keypoints

61.

$p o i n t s_{o u t p u t}$

← DISPLAY_RESULTS (all keypoints)

62. Return

$p o i n t s_{o u t p u t}$

63. End Procedure

PoseKP-R is optimized for postures with arms raised above shoulder level. This variation adjusts the localization process to maintain precise tracking of the shoulders, wrists, and hands, as described in Algorithm 2:

Algorithm 2: Overall algorithm (PoseKP-R)
1. Input: $s i l h o u e t t e_{i n p u t}$ //depth silhouette 2. Output: $p o i n t s_{o u t p u t}$ //List of 3D anatomical keypoints 3. Procedure COMPUTE_KEYPOINTS ( $s i l h o u e t t e_{i n p u t}$ ) 4. Initialize empty list $p o i n t s_{o u t p u t}$ ← 5. Assign input silhouette to working frame 6. frame ← $s i l h o u e t t e_{i n p u t}$ ——– Hip & Pelvis Estimation ——– 7. Define vertical hip search region based on body height 8. $h i p_{t o p}$ ← 3 h / 5 9. $h i p_{b o t t o m}$ ← 2 h / 3 10. Extract left and right hip boundaries within hip slice 11. $h i p_{l}$ ← min {x \| ∃ y ∈ [ $h i p_{t o p}$ , $h i p_{b o t t o m}$ ], frame (y, x) > 0} 12. $h i p_{r}$ ← max {x \| ∃ y ∈ [ $h i p_{t o p}$ , $h i p_{b o t t o m}$ ], frame (y, x) > 0} 13. Compute pelvis center 14. $p e l v i s_{x}$ ← avg ( $h i p_{l}$ , $h i p_{r}$ ) 15. $p e l v i s_{y}$ ← avg ( $h i p_{t o p}$ , $h i p_{b o t t o m}$ 16. Store side body reference points 17. $b o d y_{l}$ ← $h i p_{l}$ 18. $b o d y_{r}$ ← $h i p_{r}$ ——– Head & Neck Detection ——– 19. Locate head position as first foreground pixel above hip region 20. $h_{x}, h_{y}$ ← first nonzero pixel in [0, $h i p_{t o p}$ ] 21. Estimate neck position slightly below head 22. $n_{x}$ , $n_{y}$ ← $h_{x}$ , ( $h_{y}$ + h / 12) ——– Shoulder Localization ——– 23. Estimate shoulder positions at neck height 24. $s h o u l d e r_{l}$ ← min {x \| frame ( $n_{y}$ , x) > 0} + 7 25. $s h o u l d e r_{r}$ ← max {x \| frame ( $n_{y}$ , x) > 0} – 7 26. Recompute neck center using shoulder positions 27. $n_{x}$ ← avg ( $s h o u l d e r_{r}$ , $s h o u l d e r_{l}$ ) 28. Estimate collarbone endpoints with fixed offsets 29. $c l a v i c l e_{l}$ ← $s h o u l d e r_{l}$ + 25 30. $c l a v i c l e_{r}$ ← $s h o u l d e r_{r}$ – 25 ——– Hand and Wrist Localization ——– 31. Determine horizontal hand extremes using silhouette coverage 32. $h a n d_{l}$ ← min {x \| frame (:,x) > 0} 33. $h a n d_{r}$ ← max {x \| frame (:,x) > 0} 34. Estimate wrist positions slightly below hand level 35. $w r i s t_{l}$ ← ( $h a n d_{l}$ , $h_{y}$ + 10) 36. $w r i s t_{r}$ ← ( $h a n d_{r}$ , $h_{y}$ + 10) ——– Spine Estimation ——– 37. Compute central spine point between neck and pelvis 38. $s p i n e_m i d_{x}$ ← avg ( $n_{x}$ , $p e l v i s_{x}$ ) 39. $s p i n e_m i d_{y}$ ← avg ( $n_{y}$ , $p e l v i s_{y}$ ) 40. Estimate upper spine segment 41. $s p i n e_u p p e r_{x}$ ← avg ( $n_{x}$ , $s p i n e_m i d_{x}$ ) 42. $s p i n e_u p p e r_{y}$ ← avg ( $n_{y}$ , $s p i n e_m i d_{y}$ ) 43. Estimate lower spine segment 44. $s p i n e_l o w e r_{x}$ ← avg ( $p e l v i s_{x}$ , $s p i n e_m i d_{x}$ ) 45. $s p i n e_l o w e r_{y}$ ← avg ( $p e l v i s_{y}$ , $s p i n e_m i d_{y}$ ) ——– Arm Joint Detection ——– 46. Search elbow locations between neck and spine midpoint 47. $e l b o w_{l}$ ← min {x \| ∃ y ∈ [ $n_{y}$ , $s p i n e_m i d_{y}$ ], frame (y, x) > 0} 48. $e l b o w_{r}$ ← max {x \| ∃ y ∈ [ $n_{y}$ , $s p i n e_m i d_{y}$ ], frame (y, x) > 0} ——– Lower Limb Detection ——– 49. Approximate knee vertical position 50. $k n e e_{r}$ ← 3 h / 4 51. Detect knee positions 52. $k n e e_{l}$ ← min {x \| frame ( $k n e e_{y}$ , x) > 0} – 7 53. $k n e e_{r}$ ← max {x \| frame ( $k n e e_{y}$ , x) > 0} + 7 54. Identify ankle positions at silhouette base 55. $a n k l e_{l}$ ← min {x \| frame ( $b o t t o m_p i x e l_{y}$ , x) > 0} 56. $a n k l e_{r}$ ← max {x \| frame ( $b o t t o m_p i x e l_{y}$ , x) > 0} 57. Adjust heel positions slightly below ankle 58. $h e e l_l_{y}$ ← $a n k l e_l_{y}$ − h / 30 59. $h e e l_r_{y}$ ← $a n k l e_r_{y}$ − h / 30 ——– Output ——– 60. Store or visualize all detected anatomical keypoints 61. $p o i n t s_{o u t p u t}$ ← DISPLAY_RESULTS (all keypoints) 62. Return $p o i n t s_{o u t p u t}$ 63. End Procedure

Algorithm 2: Overall algorithm (PoseKP-R)

1. Input:

$s i l h o u e t t e_{i n p u t}$

//depth silhouette

2. Output:

$p o i n t s_{o u t p u t}$

//List of 3D anatomical keypoints

3. Procedure COMPUTE_KEYPOINTS (

$s i l h o u e t t e_{i n p u t}$

)

4. Initialize empty list

$p o i n t s_{o u t p u t}$

←

5. Assign input silhouette to working frame

6. frame ←

$s i l h o u e t t e_{i n p u t}$

——– Hip & Pelvis Estimation ——–

7. Define vertical hip search region based on body height

$h i p_{t o p}$

← 3 h / 5

$h i p_{b o t t o m}$

← 2 h / 3

10. Extract left and right hip boundaries within hip slice

11.

$h i p_{l}$

← min {x | ∃ y ∈ [

$h i p_{t o p}$

$h i p_{b o t t o m}$

], frame (y, x) > 0}

12.

$h i p_{r}$

← max {x | ∃ y ∈ [

$h i p_{t o p}$

$h i p_{b o t t o m}$

], frame (y, x) > 0}

13. Compute pelvis center

14.

$p e l v i s_{x}$

← avg (

$h i p_{l}$

$h i p_{r}$

)

15.

$p e l v i s_{y}$

← avg (

$h i p_{t o p}$

$h i p_{b o t t o m}$

16. Store side body reference points

17.

$b o d y_{l}$

←

$h i p_{l}$

18.

$b o d y_{r}$

←

$h i p_{r}$

——– Head & Neck Detection ——–

19. Locate head position as first foreground pixel above hip region

20.

$h_{x}, h_{y}$

← first nonzero pixel in [0,

$h i p_{t o p}$

]

21. Estimate neck position slightly below head

22.

$n_{x}$

$n_{y}$

←

$h_{x}$

, (

$h_{y}$

+ h / 12)

——– Shoulder Localization ——–

23. Estimate shoulder positions at neck height

24.

$s h o u l d e r_{l}$

← min {x | frame (

$n_{y}$

, x) > 0} + 7

25.

$s h o u l d e r_{r}$

← max {x | frame (

$n_{y}$

, x) > 0} – 7

26. Recompute neck center using shoulder positions

27.

$n_{x}$

← avg (

$s h o u l d e r_{r}$

$s h o u l d e r_{l}$

)

28. Estimate collarbone endpoints with fixed offsets

29.

$c l a v i c l e_{l}$

←

$s h o u l d e r_{l}$

+ 25

30.

$c l a v i c l e_{r}$

←

$s h o u l d e r_{r}$

– 25

——– Hand and Wrist Localization ——–

31. Determine horizontal hand extremes using silhouette coverage

32.

$h a n d_{l}$

← min {x | frame (:,x) > 0}

33.

$h a n d_{r}$

← max {x | frame (:,x) > 0}

34. Estimate wrist positions slightly below hand level

35.

$w r i s t_{l}$

← (

$h a n d_{l}$

$h_{y}$

+ 10)

36.

$w r i s t_{r}$

← (

$h a n d_{r}$

$h_{y}$

+ 10)

——– Spine Estimation ——–

37. Compute central spine point between neck and pelvis

38.

$s p i n e_m i d_{x}$

← avg (

$n_{x}$

$p e l v i s_{x}$

)

39.

$s p i n e_m i d_{y}$

← avg (

$n_{y}$

$p e l v i s_{y}$

)

40. Estimate upper spine segment

41.

$s p i n e_u p p e r_{x}$

← avg (

$n_{x}$

$s p i n e_m i d_{x}$

)

42.

$s p i n e_u p p e r_{y}$

← avg (

$n_{y}$

$s p i n e_m i d_{y}$

)

43. Estimate lower spine segment

44.

$s p i n e_l o w e r_{x}$

← avg (

$p e l v i s_{x}$

$s p i n e_m i d_{x}$

)

45.

$s p i n e_l o w e r_{y}$

← avg (

$p e l v i s_{y}$

$s p i n e_m i d_{y}$

)

——– Arm Joint Detection ——–

46. Search elbow locations between neck and spine midpoint

47.

$e l b o w_{l}$

← min {x | ∃ y ∈ [

$n_{y}$

$s p i n e_m i d_{y}$

], frame (y, x) > 0}

48.

$e l b o w_{r}$

← max {x | ∃ y ∈ [

$n_{y}$

$s p i n e_m i d_{y}$

], frame (y, x) > 0}

——– Lower Limb Detection ——–

49. Approximate knee vertical position

50.

$k n e e_{r}$

← 3 h / 4

51. Detect knee positions

52.

$k n e e_{l}$

← min {x | frame (

$k n e e_{y}$

, x) > 0} – 7

53.

$k n e e_{r}$

← max {x | frame (

$k n e e_{y}$

, x) > 0} + 7

54. Identify ankle positions at silhouette base

55.

$a n k l e_{l}$

← min {x | frame (

$b o t t o m_p i x e l_{y}$

, x) > 0}

56.

$a n k l e_{r}$

← max {x | frame (

$b o t t o m_p i x e l_{y}$

, x) > 0}

57. Adjust heel positions slightly below ankle

58.

$h e e l_l_{y}$

←

$a n k l e_l_{y}$

− h / 30

59.

$h e e l_r_{y}$

←

$a n k l e_r_{y}$

− h / 30

——– Output ——–

60. Store or visualize all detected anatomical keypoints

61.

$p o i n t s_{o u t p u t}$

← DISPLAY_RESULTS (all keypoints)

62. Return

$p o i n t s_{o u t p u t}$

63. End Procedure

3D Human mesh reconstruction with SMPL fitting

In this study, we present a complete pipeline for reconstructing 3D human meshes and fitting the SMPL model,⁴⁷ leveraging multiple motion capture datasets. To ensure interoperability across different skeletal annotation formats, we implement a joint mapping mechanism that translates the extracted 24-point keyset from our PoseKP module into the SMPL-compatible structure. These mapped joints enable accurate 3D pose estimation and body mesh generation. The SMPL model provides a parametric human mesh with 6890 vertices and 13,776 triangular faces, offering a high-fidelity and computationally efficient solution for motion analysis and animation. The model is governed by two sets of parameters; Pose parameters, θ ∈ ℝ⁷², representing 3D axis-angle rotations across 24 joints and Shape parameters, β ∈ ℝ¹⁰, representing identity-specific body shape variations learned from real body scans. The final mesh is computed using the linear blend skinning (LBS) framework defined by (17):

M (θ, β) = W (T (θ, β), J (β), θ, ω)

(17)

where

T (θ, β)

represents the template mesh deformed by shape and pose,

J (β)

are joint locations derived from the shape-dependent skeleton, and

W (.)

is the LBS function which applies joint rotations using precomputed skinning weights

ω

The SMPL model outputs three critical components: the vertex matrix V ∈ ℝ⁶⁸⁹⁰×³, representing the 3D surface of the mesh; the joint location matrix J ∈ ℝ²⁴×³, used for pose tracking; and the face connectivity matrix F ∈ ℝ¹³⁷⁷⁶×³, defining the triangular mesh structure. To match the extracted keypoints with the SMPL model, the 24 joints are formatted as tensors of shape (frames, 24, 3). A depth inversion step corrects for orientation discrepancies, and a one-to-one joint mapping ensures anatomical alignment between the predicted and canonical skeleton.

To optimize the SMPL parameters, a series of loss functions are formulated. First, the joint loss (L_joint) minimizes the Euclidean distance between predicted joints and the SMPL-derived joints, as defined by (18):

L_{joint} = \sum_{i = 1}^{24} ω_{i} J_{SMPL} - J_{computed 2}^{2}

(18)

where

ω_{i}

assigns a weight to each joint, emphasizing more critical joints during fitting. To ensure biomechanically plausible poses, a pose prior is enforced through a negative log-likelihood term derived from a GMM trained on real human motion data. This penalty discourages unnatural articulations and is formulated in (19):

L_{pose} = - \log p (θ)

(19)

The shape regularization loss penalizes extreme body shapes and prevents unrealistic morphologies using the L2 norm of the shape parameters as shown in (20):

L_{shape} = ‖ β ‖_{2}^{2}

(20)

To ensure temporal coherence in multiframe sequences, a smoothness constraint is imposed by minimizing interframe joint displacement, reducing jitter and abrupt transitions. This is expressed in (21):

L_{smooth} = \sum_{t = 1}^{T - 1} | | J_{SMPL} - J_{computed} | |_{2}^{2}

(21)

where T is the total number of frames. Additionally, mesh smoothness is enforced through Laplacian regularization, which encourages each vertex to remain close to the average position of its neighboring vertices. This term, shown in (22), ensures surface continuity and geometric realism:

L_{mesh} = \sum_{i} | | V_{i} - \frac{1}{| N_{i} |} \sum_{j \in N_{i}} V_{j} | |_{2}^{2}

(22)

where

V_{i}

is the

i_{t h}

vertex and

N_{i}

denotes its one-ring neighborhood. The complete objective function combines all these loss components, each weighted by a corresponding coefficient

λ_{i}

, resulting in the total loss defined in (23):

L_{total} = λ_{1} L_{joint} + λ_{2} L_{pose} + λ_{3} L_{shape} + λ_{4} L_{smooth} + λ_{5} L_{mesh}

(23)

Optimization is carried out using the Adam optimizer, which updates the pose and shape parameters to produce accurate, smooth, and anatomically consistent 3D human meshes. Representative visualizations of the pipeline's output, comparing input depth frames with the reconstructed meshes, are shown in Figures 6 and 7 using sequences from the UTKinect-Action3D and KIMORE datasets.

Figure 6.

3D reconstruction of a subject from the UTKinect-Action3D dataset illustrating (a) baseline stance, (b) asymmetric arm tilt, (c) dynamic crouch, and (d) recovery phase.

Figure 7.

3D reconstruction of a subject from the KIMORE dataset illustrating (a) dynamic asymmetric arm gesture, (b) slight hand tilt with partial flexion, (c) dynamic asymmetric arm gesture, and (d) exaggerated torso lean with pronounced wrist angles.

Zernike moment-based silhouette shape analysis

To capture the global structure and symmetry of human silhouettes, we incorporate Zernike moments into our depth-based pipeline as robust shape descriptors.⁴⁸ These moments provide a compact and rotation-invariant representation of silhouette geometry, enabling effective comparison across different poses. Computed over a normalized binary silhouette, Zernike moments decompose shape information into orthogonal basis functions defined within the unit disk, allowing for consistent quantification of spatial distribution patterns in the silhouette. The Zernike moment of order n and repetition m, denoted as $Z_{n, m},$ is mathematically defined by (24):

Z_{n, m} = \frac{n + 1}{π} {\int \int}_{x^{2} + y^{2}} f (x, y) . V *_{n, m} (x, y) d x d y

(24)

where

f (x, y)

is the silhouette image, and

V *_{n, m} (x, y)

radial and angular components. These polynomials enable the separation of symmetric and asymmetric structures, making them highly effective for pose analysis.

Table 1 illustrates a few representative Zernike moments that exemplify how different moment orders correspond to specific pose-relevant structural attributes. These examples help interpret the geometric significance of Zernike patterns, highlighting features such as body alignment, limb extension, compactness, and asymmetry across human silhouettes as shown in Figure 8. For instance, Z (2,0) captures vertical gradients and reflects overall body alignment, while Z (6,2) and Z (7,3) highlight diagonal bends and asymmetrical contours, typically observed in twisted or lifted limb postures. Higher-order terms such as Z (8,0), Z (8,6), and Z (9,3) encode complex patterns such as multilimb folding, radiating spreads, and crouched positions.

Figure 8.

Zernike moment projections computed from pose-encoded silhouettes for different moment orders, including (a) Z (2,0), (b) Z (6,2), (c) Z (7,3), (d) Z (8,0), (e) Z (6,4), (f) Z (8,4), (g) Z (8,6), and (h) Z (9,3).

Table 1.

Representative Zernike moments and their pose-relevant interpretations.

Zernike moment	Zernike pattern description	Pose feature implication
Z (2,0)	Vertical gradient	Indicates overall vertical symmetry
Z (6,2)	Complex diagonal bends	Represents a twisted upper body or lifted leg
Z (7,3)	Edge curvature loop	Associated with asymmetrical or curved postures
Z (8,0)	Concentric ring burst	Encodes compactness, typically seen in sitting or crouched positions
Z (6,4)	Twist star with cross pattern	Reflects crossed limbs or twisted torso configuration
Z (8,4)	Radiating wave symmetry	Highlights outward limb extensions or spread poses
Z (8,6)	Zigzag radial pattern	Indicates curved or irregular limb transitions
Z (9,3)	Angular segmented twist	Represents body curl or off-center crouch posture

To assess the effectiveness of Zernike moments in distinguishing human postures, we conducted a comparative analysis of moment magnitudes across silhouette pairs under both same-pose and different-pose conditions. As illustrated in Figures 9 and 10, silhouettes representing the same exercise or posture yield closely aligned Zernike magnitude distributions, demonstrating strong structural consistency and feature stability. In contrast, silhouette pairs from different exercises exhibit pronounced variations across multiple moment orders, particularly in mid-to-high frequency components. This divergence highlights the sensitivity of Zernike moments to global pose changes, such as limb extensions, curvature, and compactness. These findings confirm that Zernike-based descriptors exhibit strong robustness against noise and intraclass variability while significantly enhancing the discriminative representation of exercise categories. Their integration within the hybrid framework enables a complementary fusion with deep feature representations, effectively preserving global shape integrity. This interaction ensures that the resulting feature space remains geometrically coherent and semantically expressive, thereby strengthening the reliability of the overall classification system.

Figure 9.

Comparative analysis of Zernike moment magnitudes for silhouette-based pose differentiation within the same posture, illustrating the consistency of moment distributions across similar pose instances.

Figure 10.

Comparative analysis of Zernike moment magnitudes for silhouette-based pose differentiation across different postures, highlighting the discriminative variations observed between distinct pose configurations.

Silhouette-based Tracking, Regional Analysis, and Parsing using RGB

We propose STRAP-RGB imagery, a modular pipeline for human pose understanding from RGB video frames. The pipeline begins with frame-wise preprocessing, semantic segmentation, and silhouette extraction to isolate the human subject. Keypoints are detected using classical methods such as FAST and Harris-Laplace, which capture distinctive anatomical landmarks. The LTE, LTP, and Gabor filter features are extracted from the segmented regions to characterize texture and appearance. Additionally, semantic body part labeling (BPL) is performed using deep graph-based parsing models, enabling localized analysis of limbs, torso, and extremities. STRAP-RGB combines geometric precision with semantic richness, making it well-suited for human activity recognition.

Silhouette extraction

The silhouette extraction module plays a foundational role in the STRAP-RGB pipeline, serving as the initial step toward isolating the human figure from the surrounding environment. This module integrates a sequence of unsupervised segmentation and refinement stages to extract a clean silhouette of the subject using only RGB input. The proposed approach combines HOG-based localization, superpixel-level K-Means clustering, and GrabCut refinement to achieve high-quality segmentation without the need for deep-learning models or manual intervention.

Initially, a HOG detector identifies the region of interest (ROI) corresponding to the person in the input frame $I (x, y) \in R^{H \times W \times 3}$ . This region $I_{ROI} (x, y)$ is defined by a bounding box $R = (x_{0}, y_{0}, w, h)$ enclosing the detected subject, as shown in (25):

I_{ROI} (x, y) = I (x_{0} : x_{0} + w, y_{0} : y_{0} + h)

(25)

Within the localized ROI, Simple Linear Iterative Clustering (SLIC) is applied to partition the image into $N_{s}$ compact superpixels that preserve color and edge continuity. Each superpixel $s_{i}$ is represented by its mean RGB feature vector $f_{i} = [r_{i}, g_{i}, b_{i}]^{T}$ . The collection of feature vectors $F = {f_{1}, f_{2}, \dots, f_{N_{s}}}$ is then clustered using the K-Means algorithm to group similar color regions using (26):

min_{C, μ} \sum_{i = 1}^{K} \sum_{f \in C_{k}} | | f - μ_{k} | |^{2}

(26)

where

C_{k}

denotes the set of superpixels belonging to cluster k, and

μ_{k}

represents the cluster centroid.

We empirically set $K = 4$ to capture the dominant components of the frame such as skin, clothing, and background. The cluster exhibiting maximum spatial centrality and color coherence is identified as the probable body region, yielding a preliminary binary mask $M_{init} (x, y)$ . A skin-tone enhancement step based on YCbCr color statistics is applied to emphasize skin-like regions and improve the reliability of the initial mask before refinement. To refine this coarse segmentation and suppress background noise, the GrabCut algorithm is applied, using $M_{init} (x, y)$ as the initialization mask. GrabCut models the color distributions of the foreground and background as Gaussian Mixtures and iteratively minimizes the segmentation energy, as illustrated in (27):

E (L, θ) = U (L, θ) + V

(27)

where

U (L, θ)

is the data term measuring the color likelihoods under parameters

θ

, and

V (L)

enforces smoothness between neighboring pixels. The output binary mask

S (x, y)

is further refined through morphological closing and opening with elliptical structuring elements to remove small artifacts and fill internal gaps, generating the final silhouette mask

S_{final} (x, y)

. As illustrated in Figure 11, the process sequentially performs ROI localization via HOG, superpixel-based K-Means segmentation, and GrabCut refinement to form the resulting clean silhouette. This refined silhouette serves as a robust spatial prior for downstream modules such as keypoint detection, region-specific feature extraction, and semantic body part parsing.

Figure 11.

Silhouette extraction pipeline: (a) original RGB image, (b) detected region of interest using Histogram of Oriented Gradient (HOG), (c) superpixel-based K-means segmentation, (d) binary mask obtained after GrabCut refinement, and (e) final cleaned silhouette.

Keypoints extraction

To extract meaningful structural keypoints from grayscale silhouette images, we employed a pair of classical keypoint detection algorithms that are grounded in image geometry and spatial intensity analysis. Each method emphasizes a different aspect of the silhouette structure; FAST focuses on localized contrast-based corners, while Harris–Laplace extends this analysis to capture scale-consistent anatomical landmarks. Both detectors are computationally efficient and remain stable under partial occlusion or lighting variations, making them well-suited for real-world telerehabilitation scenarios where sensor noise and environmental conditions can vary significantly.

Features from Accelerated Segment Test: high-speed intensity-based keypoint detector

The first method utilized is the FAST (Features from Accelerated Segment Test) detector,⁴⁹ which is designed for rapid keypoint identification based on local intensity contrast as shown in Figure 12(a). Given a candidate pixel p with intensity $I (p)$ , the algorithm examines a circular neighborhood (typically consisting of 16 pixels on a Bresenham circle) and classifies p as a keypoint if there exists a contiguous segment of n pixels that are all either significantly brighter or significantly darker than $I (p)$ . This condition is expressed by (28):

\exists n \in {12, 16} : \forall x_{i} \in S_{n} | i (x_{i}) - I (p) > t |

(28)

where

S_{n}

denotes the set of n contiguous pixels on the circle, and

t

is a predefined intensity threshold. The strength of FAST lies in its computational simplicity and sensitivity to structural discontinuities, which makes it particularly effective for detecting joint-like regions and articulations along the silhouette boundary.

Figure 12.

Keypoint detection results on silhouette images illustrating (a) FAST keypoints capturing prominent corner-like structures and (b) harris–laplace keypoints highlighting scale-invariant anatomical landmarks.

In this framework, FAST serves as a lightweight geometric feature extractor that enhances local boundary precision and contour-based shape representation. Its ability to detect consistent corner points under varying illumination and minor occlusion makes it well suited for dynamic human silhouettes. These geometric cues provide complementary low-level information that strengthens spatial consistency and supports the subsequent multimodal fusion and classification stages.

Harris-Laplace: multi-scale geometric keypoint detector

To complement this local approach, we employed the Harris-Laplace detector,⁵⁰ which extends classical Harris corner detection with a robust scale selection mechanism as shown in Figure 12(b). This handcrafted technique was chosen for its reliability in identifying stable geometric features across varying scales and imaging conditions, offering consistent performance without the need for extensive training or high-texture input. Initially, candidate keypoints are identified using the Harris corner response, derived from the second-order image gradients. The response function R is given by (29):

R = det (M) - k . (trace (M))^{2}

(29)

where M is the autocorrelation matrix (also known as the structure tensor) given by (30):

M = [\begin{matrix} I_{x}^{2} & I_{X} I_{y} \\ I_{X} I_{y} & I_{y}^{2} \end{matrix}]

(30)

Here, $I_{X}$ and $I_{y}$ represent the image gradients in the x and y directions, respectively, and k is an empirical constant typically set between 0.04 and 0.06. Once spatial keypoints are detected, their stability across scale is assessed using the Laplacian of Gaussian (LoG) operator. The scale-normalized $LoG$ is defined by (31):

{LoG}_{σ} (x, y) = σ^{2} . \nabla^{2} G (x, y, σ) * I (x, y)

(31)

where

G (x, y, σ)

is a Gaussian kernel of standard deviation

σ

and

\nabla^{2}

denotes the Laplacian operator. The inclusion of the

σ^{2}

factor ensures that the response is normalized across scales, enabling the detector to identify keypoints that are maximally stable in both space and scale domains. Only those keypoints that correspond to local extrema in the scale space are retained.

Together, these two methods provide a complementary framework for keypoint extraction. While FAST efficiently captures high-frequency local structures within the silhouette, Harris-Laplace contributes scale-consistent landmark points that are robust to variations in subject size or camera distance. This dual-detection strategy enhances both spatial precision and scale stability, ensuring that geometric cues from the RGB silhouettes effectively complement high-level modules in the framework, thereby improving robustness and consistency in keypoint representation.

Feature detection

Following keypoint extraction, we computed a suite of classical feature descriptors designed to encode both the local texture and structural orientation of human silhouettes. To ensure comprehensive spatial representation, we employed three complementary techniques: LTE, LTP, and Gabor filter responses. Each of these methods captures a unique characteristic of the silhouette, ranging from edge directionality to microtextural patterns and spatial frequency signatures

Laws’ texture energy

The LTE is a handcrafted feature extraction method that captures spatial texture patterns by convolving the input image with a set of specially designed filters.⁵¹ Unlike gradient-based descriptors, LTE focuses on localized intensity variations, making it particularly robust for silhouette-based pose analysis where directional edge information may be limited or noisy as shown in Figure 13.

Figure 13.

Laws’ texture energy (LTE) maps computed using different filter combinations (L5E5, E5L5, E5E5, S5S5, W5W5, R5R5, E5S5, and L5S5), highlighting diverse spatial patterns such as edges, spots, ripples, and composite textures within silhouette images.

The process begins by defining a set of 1D convolution masks that encode primitive patterns such as level (L), edge (E), spot (S), wave (W), and ripple (R). These masks are combined pairwise using the outer product to generate a total of 25 unique 2D filters. Each resulting filter is sensitive to a distinct texture characteristic, such as horizontal edges, corner-like structures, or fine ripples, depending on the combination of row and column vectors.

Each 2D kernel $K_{i, j}$ is applied to the grayscale image $I (x, y)$ through convolution to obtain a texture response map given by (32):

R_{i, j} (x, y) = I (x, y) * K_{i, j}

(32)

To measure the strength of each texture pattern, the local texture energy is computed by squaring the response values by (33):

E_{i, j} (x, y) = R_{i, j} (x, y)^{2}

(33)

These energy maps emphasize regions of the silhouette that strongly exhibit the corresponding pattern, such as arm contours (edge filters), body ripples (ripple filters), or flat torso regions (level filters). From each energy map, a statistical descriptor (typically the mean or standard deviation) is computed over the silhouette mask. The concatenation of these 25 values forms the final LTE feature vector, which encodes a multipattern texture signature of the human pose.

Local ternary patterns

Local Ternary Patterns⁵² extend the Local Binary Pattern (LBP) operator by introducing a three-valued encoding scheme that enhances robustness against local intensity fluctuations and image noise as shown in Figure 14(a). Rather than assigning binary outcomes based solely on the sign of intensity differences, LTP introduces a threshold τ to define a tolerance zone around the center pixel intensity $I_{c}$ . For each of the P neighbors $I_{p}$ within a circular neighborhood of radius R, the ternary pattern is calculated using (34):

{LTP}_{P, R} = \sum_{p = 0}^{P - 1} s^{τ} (I_{p} - I_{c}) {.3}^{p},

s^{τ} (x) = {\begin{matrix} - 1, x \leq - τ \\ 0, | x | < τ \\ 1, x \geq τ \end{matrix}

(34)

Figure 14.

Texture feature extraction results showing (a) local ternary pattern (LTP) representations and (b) Gabor filter responses at 45° and 90° orientations, capturing local intensity variations and directional texture information.

This formulation creates a ternary string per pixel, where each bit can take values from {−1, 0, 1}. To facilitate efficient histogram representation and distance computation, the LTP descriptor is typically decomposed into two separate binary patterns: the upper LBP (encoding 1s and 0s) and the lower LBP (encoding −1s and 0s). Histograms of both components are then concatenated to form the final LTP feature vector. By suppressing minor intensity variations around the central pixel, LTP provides improved discrimination in scenarios where silhouettes may exhibit subtle noise or blurring.

Gabor filters

To further enrich the spatial-frequency representation, we applied a bank of Gabor filters⁵³ at multiple orientations shown in Figure 14(b). Gabor filters act as bandpass filters that are sensitive to specific frequencies and directions, effectively mimicking the response of human visual cortex cells. A 2D Gabor filter $G (x, y)$ is defined by (35):

G (x, y; θ, λ, σ, Υ) = \exp (- \frac{{x^{'}}^{2} + Υ^{2} {y^{'}}^{2}}{2 σ^{2}}) \cos (\frac{2 π x^{'}}{λ} + ψ)

(35)

where,

x^{'} = x \cos θ + y \sin θ, y^{'} = x \sin θ + y \cos θ

and

θ

is the orientation,

λ

is the wavelength,

σ

is the standard deviation of the Gaussian envelope,

Υ

is the spatial aspect ratio, and

ψ

is the phase offset. By convolving the silhouette image with Gabor filters at multiple orientations (e.g., 0°, 45°, 90°, 135°), we obtained a set of directional responses that emphasize region-specific edge texture and periodic patterns. The statistical moments (e.g., mean and standard deviation) of these filtered outputs were concatenated to form the final Gabor-based feature vector.

Collectively, the use of LTE, LTP, and Gabor filters enables a rich multiperspective description of the human silhouette. The LTE captures macrotextural patterns and spatial energy distributions, LTP captures microtextural variation, and Gabor encapsulates orientation-specific frequency response.

Body part labeling

To derive a semantically structured understanding of human silhouettes, we employed a BPL strategy based on the Graphonomy model⁵⁴ proposed by Gong et al. Graphonomy introduces a universal human parsing framework that leverages graph-based transfer learning to perform fine-grained pixel-wise classification of human body regions. Unlike traditional encoder-decoder models, Graphonomy integrates hierarchical graph reasoning over both the feature and label space, enabling significant performance across diverse domains, poses, and clothing variations.

The model was applied to the RGB silhouette images to generate segmentation maps, where each pixel is classified into one of N predefined anatomical categories. Formally, for each pixel location $(i, j)$ , the network outputs a probability distribution over all classes shown by (36):

{\hat{Y}}_{(i, j)} = [{\hat{y^{(1)}}}_{(i, j)}, {\hat{y^{(2)}}}_{(i, j)}, \dots . {\hat{y^{(N)}}}_{(i, j)}]

(36)

The final body part label map $L ϵ W \times H$ is obtained by selecting the most probable class per pixel using the following argmax operation shown in (37):

L_{(i, j)} = \arg max_{c ϵ {1, \dots . N}} {\hat{y^{(c)}}}_{(i, j)}

(37)

Here, ${\hat{y^{(c)}}}_{(i, j)}$ denotes the predicted likelihood of class c at pixel $(i, j),$ and $W \times H$ represents the spatial dimensions of the silhouette frame. As illustrated in Figure 15, the resulting segmentation masks are color-coded by class index, allowing for intuitive visualization and downstream region-specific analysis.

Figure 15.

Semantic body part labeling of human silhouettes obtained using the graphonomy model, illustrating pixel-wise anatomical segmentation of different body regions.

To enhance geometric interpretation of the parsed silhouette, contour points and centroid positions were extracted for each anatomically segmented region. Following generation of the segmentation map $L ϵ R^{W \times H}$ , each body part label $c ϵ {1, \dots, N}$ was isolated using binary masking. For each mask, contour boundaries were obtained, resulting in a set of outer shape-defining points. Simultaneously, the centroid of each region was computed using spatial moment analysis to identify the geometric center. These centroids provide a compact summary of spatial localization, while the contour points capture the boundary shape. As illustrated in Figure 16, red dots denote the extracted contour points, and yellow markers indicate the centroid of each anatomical segment. This representation supports tasks such as symmetry quantification, pose feature extraction, and region-wise motion analysis.

Figure 16.

Contour points (red) and centroid locations (yellow) extracted from segmented body parts using graphonomy-based parsing, enabling region-wise geometric characterization.

Feature optimization

We adopted an early feature-level fusion approach, where multimodal features obtained from both the modalities were combined into a single unified vector before the discriminative learning stage as shown in (38):

F = [f^{(1)} ∥ f^{(2)} ∥ \dots ∥ f^{(k)}] \in R^{d}

(38)

$F$ represents the fused feature vector, where $f^{(i)} \in R^{d_{i}}$ denotes the $i^{t h}$ feature subset (e.g., geometric, textural, semantic, or depth-based descriptors), $∥$ indicates concatenation, and $d = \sum_{i = 1}^{k} d_{i}$ is the total dimensionality of the fused representation. The resulting fused feature matrix for n samples is defined as (39):

X = [F_{1}, F_{2}, F_{3} . . F_{n}]^{T} \in R^{n \times d}

(39)

which integrates complementary information from both RGB and depth modalities within a shared feature space. This early fusion enables joint learning of spatial, geometric, and structural cues from the very beginning of the pipeline, leading to richer representation and improved discriminability. Subsequently, this fused feature matrix

X

was passed to the KFDA module, which is a nonlinear extension of classical Fisher Discriminant Analysis as shown in Figure 17, designed to enhance class separability in high-dimensional feature spaces. The key idea is to implicitly map the input data into a high-dimensional Reproducing Kernel Hilbert Space using a kernel function, where linear separation is more feasible. Let the training dataset be

{x_{i}, y_{i}}_{i = 1}^{n}

are input vectors and

{x_{i} \in R^{d}, y i \in {1, \dots, C}

are class labels. The input vectors are mapped via a nonlinear function

ϕ : R^{d} \to H,

and the goal is to find a projection direction w∈H that maximizes the Fisher criterion using (40):

J (w) = \frac{w^{T} S_{B} w}{w^{T} S_{w} w}

(40)

where

S_{B}

and

S_{w}

are the between-class and within-class scatter matrices in H, defined using (41) and (42):

S_{B} = \sum_{c = 1}^{C} n_{c} (μ_{c} - μ) (μ_{c} - μ)^{T}

(41)

S_{l n} = \sum_{c = 1}^{C} \sum_{x_{1} \in c}^{.} (ϕ (x_{i}) - μ_{c}) (ϕ (x_{i}) - μ_{c})^{T}

(42)

with

M_{c}

being the mean of class c and M the global mean in feature space. Due to the high dimensionality of H, KFDA employs the kernel trick, where inner products

ϕ (x_{i})^{T} ϕ (x_{j})

are computed using a positive definite kernel function

K (x_{i}, x_{j})

, such as the Gaussian RBF using (43):

k (x_{i}, x_{j}) = \exp (- \frac{| x_{i} - x_{j} |^{2}}{2 σ^{2}})

(43)

Figure 17.

Feature optimization using Kernel Fisher Discriminant Analysis (KFDA), illustrating the projection of fused features into a discriminative subspace.

This reformulation leads to a generalized eigenvalue problem involving kernel matrices, from which the optimal discriminant projections can be derived. The resulting lower-dimensional representation preserves nonlinear class boundaries, making it well-suited for tasks requiring robust class separation prior to classification algorithms such as GCNs.

Graph convolutional network

To enhance the discriminative capability of the input features prior to graph-based learning, after employing KFDA for nonlinear feature optimization, we employed GCN. Let the resulting transformed feature matrix be denoted as $X_{K F D A} \in R^{n \times K F D A}$ , where n is the number of nodes and d is the dimensionality of the optimized feature space. This matrix is then utilized as the initial input to a GCN for classification. Given a graph G = (V, E) with adjacency matrix $A \in R^{n \times n}$ we compute the normalized adjacency matrix using (44):

\hat{A} = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}}

(44)

where I is the identity matrix and D is the diagonal degree matrix of A + I. The propagation rule for the l-th GCN layer is defined using (45).

H^{(l + 1)} = σ (\hat{A} H^{(l)} W^{(l)})

(45)

where

H (0) = X_{K F D A}, W \in R^{d_{1} \times d_{l + 1_}}

is a learnable weight matrix, and σ(⋅) denotes a nonlinear activation function, such as ReLU. The final output representation H(L) is passed through a softmax function for multiclass classification using (46).

\hat{Y} = Softmax (H (L))

(46)

By integrating KFDA with GCN, the model benefits from both enhanced class discriminability in the input space and relational learning over the graph structure, leading to improved classification performance. The architecture of GCN is shown in Figure 18.

Figure 18.

Proposed architecture of the Graph Convolutional Network (GCN), illustrating feature propagation through graph convolution layers followed by the final classification stage.

Algorithm 3 presents the complete algorithmic representation of the proposed multimodal framework, detailing each sequential stage from data preprocessing to final classification. It outlines the integrated operation of the D-KEM and STRAP-RGB pipelines, the early feature fusion process, KFDA-based optimization, and GCN-driven classification.

Algorithm 3: Overall algorithm (Multimodal RGB–D Telerehabilitation Framework)
1. Input: Synchronized RGB–Depth sequences { $R G B_{s e q}$ , $D_{s e q}$ } 2. Output: Predicted exercise label ŷ ——Preprocessing and Cross-Modal Synchronization—— 3. for each sequence S ∈ Dataset do 4. Align RGB and Depth frames by sequential timestamps 5. Resample both streams → frame rate f = 15 fps 6. for each time index t do 7. Resize RGB frame $I_{t}$ and depth frame $D_{t}$ to 256 × 256 8. Normalize depth: 9. $D_n o r m_{t}$ = ( $D_{t}$ − $D_{m i n}$ ) / ( $D_{m a x}$ − $D_{m i n})$ 10. Normalize RGB (channel-wise): 11. $R G B_n o r m_{t}$ = ( $R G B_{t}$ − $μ_{R G B}$ ) / $σ_{R G B}$ 12. end for 13. end for ————Depth Stream Processing (D-KEM) ———— 14. for each depth frame $D_{t}$ ∈ $D_{s e q}$ do 15. Floor removal + contrast enhancement + denoising 16. $D_c l e a n_{t}$ = BilateralFilter (CLAHE (RANSAC_RemoveFloor ( $D_{t}$ ))) 17. Silhouette segmentation (HOG + SVM + GrabCut) 18. $B_{t} \leftarrow HOG + SVM (D_c l e a n_{t})$ 19. $S_{t} \leftarrow GrabCut (D_c l e a n_{t}, B_{t})$ 20. Silhouette refinement 21. $S_{t} \leftarrow ConnectedComponents (S_{t})$ 22. Posture identification and keypoint extraction 23. $P o s t u r e_{t}$ ← DetectPosture ( $S_{t}$ ) 24. if $P o s t u r e_{t}$ = Lower_Arms then 25. $K_2 D_{t}$ = PoseKP_L ( $S_{t}$ ) 26. else 27. $K_2 D_{t}$ = PoseKP_R ( $S_{t}$ ) 28. end if 29. 2D → 3D lifting using depth 30. $K_3 D_{t}$ = LiftTo3D( $K_2 D_{t}$ , $D_{t}$ ) 31. SMPL mesh fitting (optimize θ, β) 32. Initialize mesh parameters (θ, β) 33. repeat until convergence 34. ( $V_{t}$ , $J_{t}$ ) = SMPL (θ, β) 35. $L_{t o t a l}$ = λ₁ ‖ $J_{t}$ − $K_3 D_{t}$ ‖² + λ₂ ‖θ‖² + λ₃ ‖β‖² + λ₄ ‖Δ $J_{t}$ ‖² + λ ‖Laplacian ( $V_{t}$ )‖² 36. (θ, β) ← $A d a m_{u p d a t e}$ (∂ $L_{t o t a l}$ /∂θ, ∂ $L_{t o t a l}$ /∂β) 37. end repeat 38. Global silhouette shape descriptor (Zernike moments)- 39. Compute Zernike moments: 40. $Z_n m_{t}$ = ∬ I(ρ, θ) · $V_{n m}$ (ρ, θ) · ρ dρ dθ 41. end for 42. Construct depth-stream feature representation: 43. $F_{d e p t h}$ = [ Geometric_Features( $K_3 D$ ), Statistical ( $Z_n m$ )] 44. ———-RGB Stream Processing (STRAP-RGB) ———- 45. for each RGB frame $I_{t}$ ∈ $R G B_{s e q}$ do 46. Person localization (ROI) 47. $R_{t}$ = HOG_Detector ( $I_{t}$ ) 48. $I_R O I_{t}$ = Crop ( $I_{t}$ , $R_{t}$ ) 49. Silhouette extraction (SLIC → KMeans → GrabCut 50. $S e g_m a s k_{t}$ = GrabCut (KMeans (SLIC ( $I_R O I_{t}$ ))) 51. Keypoint detection (FAST + Harris-Laplace) 52. $K P_F A S T_{t}$ = FAST( $I_R O I_{t}$ ) 53. $K P_H L_{t}$ = HarrisLaplace( $I_R O I)$ 54. $K P_a l l_{t}$ = $K P_F A S T_{t}$ ∪ $K P_H L$ 55. Texture descriptors 56. $L T E_{t}$ = Σ \| $I_{t}$ $K_{i}$ \| 57. $L T P_{t}$ : f(x) = {1 if x ≥ I + τ, 0 if \|x − I\| < τ, −1 otherwise} 58. $G a b o r_{t}$ = exp (−(x'² + γ²y'²)/(2σ²)) · cos ((2πx'/λ) + ψ) 59. Semantic parsing + region statistics 60. $L a b e l_m a p_{t}$ = Graphonomy ( $I_{t}$ ) 61. ( $C o n t o u r s_{t}$ , $C e n t r o i d s_{t}$ ) = Extract_Regions ( $L a b e l_m a p_{t}$ ) 62. RGB feature vector per frame 63. $F_R G B_{t}$ = [ $L T E_{t}$ , $L T P_{t}$ , $G a b o r_{t}, K P_a l l_{t}$ , $C o n t o u r s_{t}$ , $C e n t r o i d s_{t}$ ] 64. end for ———-Feature Fusion and KFDA Optimization———- 65. for each sequence index i do 66. $F_F u s e d_{i}$ = [ $F_d e p t h_{i}$ \|\| $F_r g b_{i}$ ] 67. end for 68. Construct kernel matrix: 69. $K_{i, j}$ = exp (−‖ $F_{i}$ − $F_{j}$ ‖² / (2σ²)) 70. Compute scatter matrices: 71. $S_{B}$ = $\sum_{c = 1}^{C} n_{c}$ ( $μ_{c}$ − $μ$ ) $(μ_{c} - μ)^{T}$ 72. $S_{W}$ = $\sum_{c = 1}^{C} \sum_{i = 1}^{n_{c}} n_{c}$ ( $x_{i}$ − $μ_{c}$ ) $(x_{i} - μ_{c})^{T}$ 73. Solve generalized eigenproblem: 74. $S_B_{α}$ = λ ( $S_{W}$ + εI) α 75. Project optimized features: 76. $X_{K F D A}$ = K · U ——–Graph Construction and GCN Classification——– 77. Build adjacency matrix using kNN: 78. $A_{i, j}$ = exp (−‖ $x_{i}$ − $x_{j}$ ‖² / τ) if $x_{j}$ ∈ kNN( $x_{i}$ ) 79. Normalize adjacency: 80. Â = $D^{1 / 2}$ (A + I) $D^{- 1 / 2}$ 81. Initialize node features: 82. $H^{0}$ = $X_{K F D A}$ 83. for l = 0 → L − 1 do 84. $H^{(I + 1)}$ ) = ReLU(Â · $H^{I}$ · $W^{I}$ ) 85. end for 86. p = Softmax ( $H^{I}$ ) 87. ŷ = argmax (p) 88. Return ŷ

Algorithm 3: Overall algorithm (Multimodal RGB–D Telerehabilitation Framework)

1. Input: Synchronized RGB–Depth sequences {

$R G B_{s e q}$

$D_{s e q}$

}

2. Output: Predicted exercise label ŷ

——Preprocessing and Cross-Modal Synchronization——

3. for each sequence S ∈ Dataset do

4. Align RGB and Depth frames by sequential timestamps

5. Resample both streams → frame rate f = 15 fps

6. for each time index t do

7. Resize RGB frame

$I_{t}$

and depth frame

$D_{t}$

to 256 × 256

8. Normalize depth:

$D_n o r m_{t}$

= (

$D_{t}$

−

$D_{m i n}$

) / (

$D_{m a x}$

−

$D_{m i n})$

10. Normalize RGB (channel-wise):

11.

$R G B_n o r m_{t}$

= (

$R G B_{t}$

−

$μ_{R G B}$

) /

$σ_{R G B}$

12. end for

13. end for

————Depth Stream Processing (D-KEM) ————

14. for each depth frame

$D_{t}$

∈

$D_{s e q}$

15. Floor removal + contrast enhancement + denoising

16.

$D_c l e a n_{t}$

= BilateralFilter (CLAHE (RANSAC_RemoveFloor (

$D_{t}$

)))

17. Silhouette segmentation (HOG + SVM + GrabCut)

18.

$B_{t} \leftarrow HOG + SVM (D_c l e a n_{t})$

19.

$S_{t} \leftarrow GrabCut (D_c l e a n_{t}, B_{t})$

20. Silhouette refinement

21.

$S_{t} \leftarrow ConnectedComponents (S_{t})$

22. Posture identification and keypoint extraction

23.

$P o s t u r e_{t}$

← DetectPosture (

$S_{t}$

)

24. if

$P o s t u r e_{t}$

= Lower_Arms then

25.

$K_2 D_{t}$

= PoseKP_L (

$S_{t}$

)

26. else

27.

$K_2 D_{t}$

= PoseKP_R (

$S_{t}$

)

28. end if

29. 2D → 3D lifting using depth

30.

$K_3 D_{t}$

= LiftTo3D(

$K_2 D_{t}$

$D_{t}$

)

31. SMPL mesh fitting (optimize θ, β)

32. Initialize mesh parameters (θ, β)

33. repeat until convergence

34. (

$V_{t}$

$J_{t}$

) = SMPL (θ, β)

35.

$L_{t o t a l}$

= λ₁ ‖

$J_{t}$

−

$K_3 D_{t}$

‖² + λ₂ ‖θ‖² + λ₃ ‖β‖² + λ₄ ‖Δ

$J_{t}$

‖² + λ ‖Laplacian (

$V_{t}$

)‖²

36. (θ, β) ←

$A d a m_{u p d a t e}$

(∂

$L_{t o t a l}$

/∂θ, ∂

$L_{t o t a l}$

/∂β)

37. end repeat

38. Global silhouette shape descriptor (Zernike moments)-

39. Compute Zernike moments:

40.

$Z_n m_{t}$

= ∬ I(ρ, θ) ·

$V_{n m}$

*(ρ, θ) · ρ dρ dθ

41. end for

42. Construct depth-stream feature representation:

43.

$F_{d e p t h}$

= [ Geometric_Features(

$K_3 D$

), Statistical (

$Z_n m$

)]

44. ———-RGB Stream Processing (STRAP-RGB) ———-

45. for each RGB frame

$I_{t}$

∈

$R G B_{s e q}$

46. Person localization (ROI)

47.

$R_{t}$

= HOG_Detector (

$I_{t}$

)

48.

$I_R O I_{t}$

= Crop (

$I_{t}$

$R_{t}$

)

49. Silhouette extraction (SLIC → KMeans → GrabCut

50.

$S e g_m a s k_{t}$

= GrabCut (KMeans (SLIC (

$I_R O I_{t}$

)))

51. Keypoint detection (FAST + Harris-Laplace)

52.

$K P_F A S T_{t}$

= FAST(

$I_R O I_{t}$

)

53.

$K P_H L_{t}$

= HarrisLaplace(

$I_R O I)$

54.

$K P_a l l_{t}$

$K P_F A S T_{t}$

∪

$K P_H L$

55. Texture descriptors

56.

$L T E_{t}$

= Σ |

$I_{t}$

$K_{i}$

57.

$L T P_{t}$

: f(x) = {1 if x ≥ I + τ, 0 if |x − I| < τ, −1 otherwise}

58.

$G a b o r_{t}$

= exp (−(x'² + γ²y'²)/(2σ²)) · cos ((2πx'/λ) + ψ)

59. Semantic parsing + region statistics

60.

$L a b e l_m a p_{t}$

= Graphonomy (

$I_{t}$

)

61. (

$C o n t o u r s_{t}$

$C e n t r o i d s_{t}$

) = Extract_Regions (

$L a b e l_m a p_{t}$

)

62. RGB feature vector per frame

63.

$F_R G B_{t}$

= [

$L T E_{t}$

$L T P_{t}$

$G a b o r_{t}, K P_a l l_{t}$

$C o n t o u r s_{t}$

$C e n t r o i d s_{t}$

]

64. end for

———-Feature Fusion and KFDA Optimization———-

65. for each sequence index i do

66.

$F_F u s e d_{i}$

= [

$F_d e p t h_{i}$

$F_r g b_{i}$

]

67. end for

68. Construct kernel matrix:

69.

$K_{i, j}$

= exp (−‖

$F_{i}$

−

$F_{j}$

‖² / (2σ²))

70. Compute scatter matrices:

71.

$S_{B}$

$\sum_{c = 1}^{C} n_{c}$

(

$μ_{c}$

−

$μ$

)

$(μ_{c} - μ)^{T}$

72.

$S_{W}$

$\sum_{c = 1}^{C} \sum_{i = 1}^{n_{c}} n_{c}$

(

$x_{i}$

−

$μ_{c}$

)

$(x_{i} - μ_{c})^{T}$

73. Solve generalized eigenproblem:

74.

$S_B_{α}$

= λ (

$S_{W}$

+ εI) α

75. Project optimized features:

76.

$X_{K F D A}$

= K · U

——–Graph Construction and GCN Classification——–

77. Build adjacency matrix using kNN:

78.

$A_{i, j}$

= exp (−‖

$x_{i}$

−

$x_{j}$

‖² / τ) if

$x_{j}$

∈ kNN(

$x_{i}$

)

79. Normalize adjacency:

80. Â =

$D^{1 / 2}$

(A + I)

$D^{- 1 / 2}$

81. Initialize node features:

82.

$H^{0}$

$X_{K F D A}$

83. for l = 0 → L − 1 do

84.

$H^{(I + 1)}$

) = ReLU(Â ·

$H^{I}$

$W^{I}$

)

85. end for

86. p = Softmax (

$H^{I}$

)

87. ŷ = argmax (p)

88. Return ŷ

Results and Performance Evaluation

Datasets

The KIMORE dataset served as the foundation for this study, providing a clinically relevant benchmark for rehabilitation analysis. It includes recordings from 78 participants (44 healthy and 34 with lower back pain) performing five guided rehabilitation exercises, such as arm raises, trunk bends, lateral flexions, and squats. The dataset offers synchronized RGB–Depth videos, 25-joint skeletal data, and clinical assessment scores for each repetition. Its inclusion of participants with motor impairments makes it particularly valuable for developing and validating systems designed to support individuals with limited mobility or musculoskeletal disorders.³⁵

Building upon this foundation, the mRI dataset was integrated to extend the framework's adaptability to home-based rehabilitation and mobility monitoring. It contains over five million multimodal frames collected from 20 participants using RGB-D cameras, mmWave radar, and IMU sensors. The dataset captures repetitive, full-body movements, including bending, stretching, and reaching, that mirror functional activities practiced in rehabilitation and motor recovery programs. Its multimodal nature enables robust modeling of motion patterns across different sensor modalities, enhancing the framework's applicability in diverse rehabilitation environments.³⁶

To further evaluate the performance of the proposed framework, the UTKinect-Action3D dataset was employed. It features 10 subjects performing 10 everyday actions, including walking, sitting, standing up, bending, lifting, and side movements, captured through synchronized RGB, depth, and skeletal modalities. These actions closely align with the functional motor activities retrained during physical therapy, focusing on aspects such as trunk stability, balance, and coordination. Incorporating this dataset enables the framework to recognize natural daily movements that serve as indicators of rehabilitation progress and ensures consistent performance across varied movement patterns and physical conditions.³⁷

Experimental setup

Experiments were conducted on a Google Colab virtual machine equipped with an NVIDIA Tesla T4 GPU (16 GB GDDR6 VRAM, 2560 CUDA cores, and 320 Tensor cores) running Ubuntu 18.04.6 LTS. The development environment utilized Python 3.10.13, PyTorch 2.1.0 + cu118, and cuDNN 8.9.1, providing GPU-accelerated tensor computation and parallel convolution support for both the D-KEM and STRAP-RGB pipelines. Image preprocessing and segmentation were performed using OpenCV 4.9.0 and scikit-image 0.22.0, while NumPy 1.25.0 and Pandas 2.1.1 were employed for array-based data handling and performance logging. Visualization and quantitative analysis were carried out using Matplotlib 3.8.0.

Time-cost analysis

To assess the computational efficiency of the proposed framework, a time-cost analysis was conducted across all major components of the depth and RGB pipelines. Table 2 summarizes the average per-frame execution time and computational complexity (MFLOPs) for each module, measured under GPU execution in the same experimental environment described above. The analysis shows that the end-to-end processing remains computationally tractable, with the primary latency arising from 3D mesh reconstruction and semantic body-part labeling.

Table 2.

Time cost analysis.

Process	Execution time (sec)	MFLOPs
Modality: Depth
Preprocessing	0.07	10.5
Silhouette Extraction	0.32	52
Pose KP-L	0.16	2.2
Pose KP-R	0.18	2.18
Zernike Moments	0.17	2.3
3D Mesh Reconstruction	1.05	350
Modality: RGB
Preprocessing	0.05	10
Silhouette Extraction	0.18	25
FAST Feature Detection	0.02	0.75
Harris–Laplace	0.31	14
LTE	0.05	1.60
LTP	0.14	1.33
Gabor Filters	0.08	2.20
Body Part Labeling	0.25	200

Confusion matrices

Table 3 presents a confusion matrix for five classes (E1–E5) for KIMORE dataset. The diagonal values represent the true positive rates for each class, indicating how accurately the model classified instances within each category. Overall, the model demonstrates high classification performance, particularly for classes E3 (100%), E5 (95%), and E2 (89%), suggesting strong model precision in these areas.

Table 3.

Confusion matrix for KIMORE dataset.

Class	E1	E2	E3	E4	E5
E1	0.83	0.06	0.00	0.00	0.11
E2	0.11	0.89	0.00	0.00	0.00
E3	0.00	0.00	1.00	0.00	0.00
E4	0.04	0.00	0.04	0.91	0.01
E5	0.01	0.02	0.02	0.00	0.95
Accuracy ± SD [95% CI]			95.30 ± 0.42 [94.47–96.13]

Class E1 shows a true positive rate of 83%, with some misclassification into E2 (6%) and E5 (11%). Class E4 has a slightly lower accuracy (91%), with minor misclassification into E1, E3, and E5. These off-diagonal entries suggest that E1 and E4 might share overlapping features with adjacent classes. Despite this, the matrix indicates minimal confusion among most classes, supporting the model's overall reliability.

Table 4 presents the confusion matrix for the MRI dataset, showing a classification accuracy of 92.70%. Most classes demonstrate high performance, with E1 (92%), E3 (90%), E5 (90%), E9 (95%), E11 (98%), and E12 (98%) being classified accurately with minimal confusion. However, notable misclassifications occur in a few classes. E2 shows the lowest accuracy at 64%, with frequent confusion with E5 (14%), E6 (7%), E9 (6%), and E12 (5%). Similarly, E6 is correctly classified 73% of the time but is misidentified as E2 (9%) and E10 (14%). E10 also shows some confusion, mainly with E4 (12%) and E8 (3%), reducing its accuracy to 85%. These misclassifications, clearly indicated in the table, may result from overlapping feature characteristics or insufficient model discrimination between similar patterns. Enhancing feature representation or refining the model architecture could help reduce such classification errors.

Table 4.

Confusion matrix for MRI dataset.

Class	E1	E2	E3	E4	E5	E6	E7	E8	E9	E10	E11	E12
E1	0.92	0.00	0.00	0.03	0.00	0.03	0.01	0.00	0.01	0.00	0.00	0.00
E2	0.00	0.64	0.00	0.00	0.14	0.07	0.04	0.00	0.06	0.00	0.00	0.05
E3	0.00	0.00	0.90	0.00	0.05	0.00	0.00	0.03	0.02	0.00	0.00	0.00
E4	0.04	0.00	0.07	0.85	0.00	0.00	0.02	0.00	0.01	0.00	0.00	0.01
E5	0.00	0.02	0.04	0.02	0.90	0.00	0.00	0.02	0.00	0.00	0.00	0.00
E6	0.00	0.09	0.04	0.00	0.00	0.73	0.00	0.00	0.00	0.14	0.00	0.00
E7	0.00	0.00	0.00	0.04	0.00	0.00	0.89	0.00	0.00	0.04	0.00	0.03
E8	0.10	0.00	0.00	0.00	0.04	0.00	0.00	0.86	0.00	0.00	0.00	0.00
E9	0.00	0.00	0.00	0.00	0.00	0.02	0.00	0.00	0.95	0.00	0.01	0.02
E10	0.00	0.00	0.00	0.12	0.00	0.00	0.00	0.03	0.00	0.85	0.00	0.00
E11	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.02	0.98	0.00
E12	0.00	0.00	0.01	0.00	0.00	0.01	0.00	0.00	0.00	0.00	0.00	0.98
Accuracy ± SD [95% CI]							92.70 ± 0.50 [91.72–93.68]

The confusion matrix in Table 5 shows the performance of the human activity recognition model across ten activity classes, with an overall accuracy of 95.59%. Most actions such as sit down (98%), pick up (97%), carry (96%), push (98%), and clap hands (99%) are recognized with high accuracy. However, some misclassifications are evident. For example, stand up is often misclassified as walk, pick up, or throw, with only 81% of samples correctly identified. The most significant confusion occurs with the throw class, which has just 50% accuracy and is frequently mistaken for walk, sit down, stand up, pull, and wave hands. These errors, visible in the table, likely stem from the similarity in motion patterns between these dynamic activities. Enhancing temporal modeling or incorporating additional contextual cues could help reduce these misclassifications.

Table 5.

Confusion matrix for UTKinect-Action3D dataset.

Class	Walk	Sit down	Stand up	Pick up	Carry	Throw	Push	Pull	Wave hands	Clap hands
Walk	0.93	0.00	0.00	0.02	0.00	0.00	0.02	0.02	0.01	0.00
Sit down	0.02	0.98	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Stand up	0.05	0.00	0.81	0.05	0.00	0.04	0.00	0.00	0.05	0.00
Pick up	0.00	0.01	0.00	0.97	0.00	0.00	0.02	0.00	0.00	0.00
Carry	0.00	0.02	0.00	0.00	0.96	0.00	0.00	0.02	0.00	0.00
Throw	0.10	0.10	0.10	0.00	0.00	0.50	0.00	0.10	0.10	0.00
Push	0.00	0.00	0.00	0.00	0.00	0.02	0.98	0.00	0.00	0.00
Pull	0.03	0.00	0.00	0.00	0.00	0.00	0.00	0.97	0.00	0.00
Wave hands	0.02	0.02	0.00	0.00	0.00	0.02	0.00	0.00	0.92	0.02
Clap hands	0.01	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.99
Accuracy ± SD [95% CI]						95.59 ± 0.38 [94.86–96.32]

Performance evaluation

As shown in the Table 6, the classification performance over the KIMORE dataset reveals that the model performs exceptionally well on most classes, with particularly high precision and recall for Classes E3 (Precision: 0.96, Recall: 1.00, F1-score: 0.98), E4 (0.98, 0.91, 0.94), and E5 (0.99, 0.95, 0.97), indicating strong accuracy and reliability. Class E2 also shows balanced and solid results (0.91, 0.89, 0.90), suggesting consistent detection with minimal misclassifications. However, Class E1 exhibits the weakest performance with a relatively low precision of 0.60 despite a high recall of 0.83, resulting in an F1-score of 0.70. This indicates that while the model is good at identifying actual E1 instances, it also tends to incorrectly classify other classes as E1, leading to a high false positive rate. Overall, the model demonstrates consistent performance across the evaluated datasets, but requires improvement in reducing misclassification for the lower-performing class (E1).

Table 6.

Precision, recall, and F1-score results over KIMORE.

Class	Precision	Recall	F1-Score
E1	0.60	0.83	0.70
E2	0.91	0.89	0.90
E3	0.96	1.00	0.98
E4	0.98	0.91	0.94
E5	0.99	0.95	0.97

As shown in the Table 7, classes E11 and E12 demonstrate the best overall performance, each achieving a precision of 0.97–0.99, recall of 0.98, and F1 score of 0.98. This indicates the model can accurately and consistently identify instances from these classes. Several other classes, including E1, E3, E5, E8, and E9, also perform well with F1 scores around 0.90 or higher, reflecting balanced and reliable classification. In contrast, class E6 shows the weakest performance, with the lowest precision (0.22) and F1 score (0.31), as seen in the table, suggesting the model struggles significantly with this class. Overall, the table highlights strong model performance on most classes, with only a few outliers needing improvement.

Table 7.

Precision, recall, and F1-score results over MRI.

Class	Precision	Recall	F1 score
E1	0.90	0.92	0.91
E2	1.00	0.64	0.78
E3	0.90	0.90	0.90
E4	0.79	0.85	0.82
E5	0.93	0.90	0.92
E6	0.22	0.50	0.31
E7	0.89	0.89	0.89
E8	0.95	0.86	0.90
E9	0.98	0.95	0.96
E10	0.73	0.85	0.79
E11	0.97	0.98	0.98
E12	0.99	0.98	0.98

As reported in Table 8, the model's precision, recall, and F1-score metrics across the ten activity classes from the UTKinect-Action3D dataset further confirm its strong performance. Most classes demonstrate consistently high scores, particularly “Clap hands” and “Push,” with near-perfect values (F1-score = 0.99). Other well-recognized actions include “Pick up” (F1 = 0.97), “Sit down” (F1 = 0.97), and “Carry” (F1 = 0.96), showing the model's ability to accurately distinguish structured, isolated movements. In contrast, “Throw” exhibits the lowest recall (0.50) and F1-score (0.62), reflecting confusion observed earlier in the confusion matrix (Table 6), likely due to overlapping motion patterns with other upper-body gestures. The “Stand up” action also shows a slightly lower F1-score of 0.83, primarily due to a reduced recall of 0.81, suggesting that transition-based activities are more challenging for the model. Overall, Table 7 indicates consistent performance across most activity classes, with some limitations in distinguishing actions involving subtle or overlapping gestures.

Table 8.

Precision, recall, and F1-score results over UTKinect-Action3D.

Class	Precision	Recall	F1 score
Walk	0.88	0.93	0.91
Sit down	0.96	0.98	0.97
Stand up	0.85	0.81	0.83
Pick up	0.97	0.97	0.97
Carry	0.96	0.96	0.96
Throw	0.83	0.50	0.62
Push	1.00	0.98	0.99
Pull	0.88	0.97	0.92
Wave hands	0.96	0.92	0.94
Clap hands	0.99	0.99	0.99

The ROC curves illustrated in Figure 19 depict the classification performance of the proposed model across five experimental evaluations (E1–E5) on the KIMORE dataset. The curves remain consistently above the random guessing line, confirming the model's strong discriminative capability. Among all, E3 achieves the highest performance with an AUC of 0.99, indicating near-perfect class separation. E4 and E5 also demonstrate excellent outcomes, with AUCs of 0.96 and 0.97, respectively, reflecting high reliability in prediction. Meanwhile, E2 and E1 yield slightly lower yet commendable results, with AUCs of 0.94 and 0.90, respectively, still indicating effective class differentiation. Overall, the mean AUC of 0.95 reflects the robustness and stability of the model across repeated experimental runs on the KIMORE dataset.

Figure 19.

Results in terms of ROC curves for the KIMORE dataset, illustrating the classification performance across different exercise classes.

The ROC curves in Figure 20 depict the classification performance of the proposed model across twelve experimental evaluations (E1–E12) on the mRI dataset. The results show a mean AUC of 0.92, reflecting strong overall class separability and reliable predictive capability. Notably, E11 and E12 achieve exceptional performance with AUCs of 0.99, while E9 closely follows with an AUC of 0.98, confirming excellent model discrimination across these experiments. E1, E5, and E7 also demonstrate high accuracy, with AUCs ranging between 0.95 and 0.96. However, E6 (AUC = 0.74) and E2 (AUC = 0.82) indicate relatively weaker separation, suggesting that these categories may involve more complex or overlapping feature distributions. Overall, the ROC curves lie well above the random guessing line, indicating reliable discrimination across multiple test conditions on the mRI dataset.

Figure 20.

Results in terms of ROC curves for the mRI dataset, illustrating the classification performance across different exercise classes.

The ROC curves in Figure 21 illustrate the classification performance of the proposed model across various action classes in the UTKinect-Action3D dataset. The model achieves a mean AUC of 0.95, demonstrating strong discriminative capability and reliable performance across a wide range of movement variations. Among all actions, “Pick up,” “Push,” and “Clap hands” exhibit exceptional results, each attaining an AUC of 0.99, demonstrating nearly perfect separation between true and false positives. Similarly, “Sit down” and “Carry” perform excellently with AUCs of 0.98, further validating the model's robustness. Conversely, “Throw” (AUC = 0.75) shows comparatively weaker performance, suggesting that its dynamic motion characteristics may introduce classification ambiguity. Overall, the ROC visualization confirms that the model maintains high reliability across most activities, with all curves significantly above the random guessing line, signifying superior recognition accuracy on the UTKinect-Action3D dataset.

Figure 21.

Results in terms of ROC curves for the UTKinect-Action3D dataset, illustrating the classification performance across different exercise classes.

Discussion

Ablation study

As detailed in Table 9, a comprehensive ablation study was conducted to quantify the contribution of individual feature-extraction components across the KIMORE, mRI, and UTKinect-Action3D datasets. The full configuration—integrating preprocessing, 3D pose and mesh features, 2D keypoints, and BPL-based contour descriptors—achieved the highest accuracies of 95.30%, 92.70%, and 95.59%, respectively. A clear hierarchy of contribution emerges from the ablation results. The 3D pose and mesh features act as the primary drivers of recognition accuracy; their removal leads to a substantial degradation (72.20%, 69.93%, and 71.82%), as these features encode explicit spatial structure, joint relationships, and volumetric motion cues that are central to exercise characterization. In contrast, excluding 2D keypoints results in moderate yet consistent accuracy reductions (82.39%, 78.00%, and 79.80%), indicating that while 2D motion cues are informative, the system retains a degree of robustness due to the continued presence of 3D mesh geometry and depth-derived structural information. Similarly, removing individual handcrafted descriptors such as FAST, Harris–Laplace, Gabor, LTE, or LTP produces comparatively smaller performance drops, as these components primarily refine local appearance and texture representation rather than defining the core motion semantics. Shape-based descriptors further strengthen discrimination; omitting Zernike moments (93.20%, 87.90%, and 89.20%) and BPL-based contour points (88.36%, 81.70%, and 86.11%) degrades accuracy but does not collapse performance, since global pose and mesh cues remain intact. Finally, excluding the KFDA optimization stage causes a pronounced decline (79.70%, 80.23%, and 81.20%), highlighting the importance of nonlinear feature optimization for enhancing class separability within the fused feature space. Overall, these findings confirm that reliable rehabilitation exercise recognition is driven primarily by 3D pose and mesh representations, while complementary 2D, shape, and contour features provide supportive gains that collectively strengthen robustness rather than acting as standalone determinants.

Table 9.

Ablation study on model configurations and their impact on exercise recognition accuracy.

Model configuration	Description	KIMORE accuracy (%) ± SD [95% CI]	mRI accuracy (%) ± SD [95% CI]	UTKinect-Action3D accuracy (%) ± SD [95% CI]
All parameters	Model trained using all feature extraction techniques (Preprocessing, 3D Mesh, 2D Keypoints, BPL-based Contour Points)	95.30 ± 0.42 [94.47–96.13]	92.70 ± 0.50 [91.72–93.68]	95.59 ± 0.38 [94.86–96.32]
Without preprocessing	Model trained without image preprocessing	87.86 ± 0.63 [86.63–89.09]	84.30 ± 0.71 [82.91–85.69]	89.20 ± 0.59 [88.05–90.35]
Without 3D Keypoints + Mesh	Model trained without 3D keypoints and mesh	72.20 ± 0.88 [70.47–73.93]	69.93 ± 0.94 [68.09–71.77]	71.82 ± 0.91 [70.02–73.62]
Without 2D Keypoints (FAST, Harris–Laplace, Gabor Filters)	Model trained without 2D keypoint features	82.39 ± 0.74 [81.00–83.78]	78.00 ± 0.79 [76.55–79.45]	79.80 ± 0.76 [78.37–81.23]
Without Zernike Moments	Model trained without Zernike moments	93.20 ± 0.46 [92.31–94.09]	87.90 ± 0.57 [86.79–89.01]	89.20 ± 0.52 [88.19–90.21]
Without BPL-based contour points	Model trained without BPL-based contour points	88.36 ± 0.60 [87.19–89.53]	81.70 ± 0.69 [80.35–83.05]	86.11 ± 0.63 [84.88–87.34]
Without LTE	Model trained without LTE features	93.30 ± 0.45 [92.42–94.18]	89.90 ± 0.54 [88.84–90.96]	92.62 ± 0.48 [91.68–93.56]
Without LTP	Model trained without LTP features	92.70 ± 0.47 [91.78–93.62]	90.10 ± 0.51 [89.10–91.10]	91.70 ± 0.50 [90.73–92.67]
Without Gabor filters	Model trained without Gabor filter features	90.53 ± 0.58 [89.39–91.67]	85.28 ± 0.66 [83.99–86.57]	89.14 ± 0.55 [88.06–90.22]
Without FAST	Model trained without FAST keypoints	92.10 ± 0.49 [91.14–93.06]	89.90 ± 0.52 [88.88–90.92]	90.30 ± 0.51 [89.30–91.30]
Without Harris–Laplace	Model trained without Harris–Laplace keypoints	90.58 ± 0.56 [89.48–91.68]	91.20 ± 0.49 [90.24–92.16]	90.30 ± 0.53 [89.27–91.33]
Without KFDA	Model trained without Kernel Fisher Discriminant Analysis feature optimization	79.70 ± 0.77 [78.19–81.21]	80.23 ± 0.82 [78.63–81.83]	81.20 ± 0.79 [79.65–82.75]

All ablation experiments were conducted using identical 80–20 train/test splits across datasets. Each configuration was repeated five times with randomized initialization. Reported values represent mean ± standard deviation, and 95% confidence intervals were estimated using nonparametric bootstrapping (n = 1000). The low variance demonstrates the framework's robustness and statistical reliability.

As summarized in Table 10, the impact of temporal feature modeling was examined by comparing the proposed GCN with representative sequence-based baselines, including RNN, LSTM, and 3D CNN. All baseline models were trained using the same subject-independent splits, input representations, and training protocol as the proposed framework. While RNN and LSTM achieved moderate performance and 3D CNN improved temporal representation, the proposed GCN consistently yielded higher accuracies (95.30%, 92.70%, and 95.59%) across the three datasets. These results indicate that graph-based temporal modeling, when coupled with the proposed multimodal representation, more effectively captures interjoint dependencies and motion continuity in rehabilitation exercises.

Table 10.

Comparison with representative temporal and skeleton-based baselines under identical protocol on KIMORE, mRI, and UTKinect-Action3D datasets.

Temporal model	KIMORE accuracy (%)	mRI accuracy (%)	UTKinect-Action3D Accuracy (%)
RNN	87.00%	78.70%	83.00%
LSTM	87.20%	82.10%	81.70%
3D CNN	93.08%	89.10%	88.30%
Proposed GCN	95.30%	92.70%	95.59%

Finally, Table 11 compares different fusion strategies: feature concatenation, attention-enabled fusion, and decision-level fusion. Feature concatenation provided the best overall results (95.30%, 92.70%, and 95.59%), while attention-based and decision-level fusion achieved lower accuracies. This trade-off illustrates that feature concatenation offers the optimal balance between accuracy and computational efficiency for rehabilitation applications.

Table 11.

Comparative evaluation of feature fusion techniques on KIMORE, mRI, and UTKinect-Action3D datasets.

Fusion strategy	Description	KIMORE accuracy (%)	mRI accuracy (%)	UTKinect-Action3D accuracy (%)
Feature concatenation	Direct concatenation of multimodal features before classification	95.30%	92.70%	95.59%
Attention-enabled feature fusion	Weighted fusion emphasizing discriminative modalities through attention weights	92.20%	91.70%	92.70%
Decision-level feature fusion	Independent classification per modality followed by probabilistic voting	94.40%	92.12%	91.25%

Comparison with state-of-the-art methods

Table 12 provides a contextual comparison of our proposed model with representative state-of-the-art methods reported in the literature for rehabilitation exercise and HAR. Jleli et al.³⁸ applied a YOLO V5–ShuffleNet V2 model on the KIMORE dataset, achieving 87% accuracy. Zaher et al.³⁹ improved upon this using a CNN model enhanced through hyperparameter tuning, reaching 93.08%, while Zaher et al.⁴⁰ employed a feature-ranking strategy combining the Fast Correlation-Based Filter with an Extra Trees classifier, obtaining 81.85% accuracy. Among KIMORE-based end-to-end studies, Karlov et al.⁴¹ integrated an ST-GCN with supervised contrastive learning, achieving 89% accuracy, and Abedi et al.⁴² implemented a cross-modal RGB-to-skeleton augmentation network using RNN/LSTM, reporting 87% accuracy. For UTKinect-Action3D, Keceli et al.⁴³ utilized HOG-Deep features to achieve 93.4%, Ding et al.⁴⁴ adopted a Rotation Matrix Representation-Based 3D model with SVD and HMM, attaining 91.5%, and Kumar et al.⁴⁵ introduced a Time-Series Graph Matching approach that reached 93.5% accuracy. Meanwhile, Ashraf et al.⁴⁶ reported a deep multimodal biomechanical framework for lower back pain rehabilitation on the mRI dataset, achieving 91.00% accuracy.

Table 12.

Comparison with state-of-the-art methods.

Author	Title	Methodology	Dataset	Results (accuracy)	Limitations
Jleli et al.³⁸	Artificial Intelligence-driven Remote Monitoring Model for Physical Rehabilitation	YOLOv5–ShuffleNet V2 + Bi-LSTM based model	Kimore	87.00%	Relies mainly on RGB and skeletal data, Evaluation is restricted to a single dataset
Zaher et al.³⁹	Unlocking the potential of RNN and CNN models for accurate rehabilitation exercise classification on multidatasets	CNN with hyperparameter tuning (End-to-end model that learns features automatically from video sequences.)	Kimore	93.08%	Sensitive to occlusion and viewpoint variations
Zaher et al.⁴⁰	Rehabilitation monitoring and assessment: a comparative analysis of feature engineering and machine learning algorithms on the UI-PRMD and KIMORE benchmark datasets	The combination of FCBF for feature ranking and Extra Trees classifier	Kimore	81.85%	Limited geometric expressiveness, sensitive to occlusion, sensor noise, and real-world variability
Karlov et al.⁴¹	Rehabilitation Exercise Quality Assessment through Supervised Contrastive Learning with Hard and Soft Negatives	ST-GCN + Supervised Contrastive Learning	Kimore	89.00%	Not explicitly designed for deployment in unconstrained home environments
Abedi et al.⁴²	Cross-Modal Video to Body-Joints Augmentation for Rehabilitation Exercise Quality Assessment	Cross-Modal Video-to-Body-Joints Augmentation using Sequential Neural Networks (RNN/LSTM)	Kimore	88.00%	Sensitive to occlusion and viewpoint changes, limited to a single dataset
Keceli et al.⁴³	3D Skeletal Volume Templates for Deep Learning-Based Activity Recognition	HOG + Deep Features (Deep feature extractor on precomputed skeleton volumes)	UTKinect-Action3D Dataset	93.40%	sensitive to occlusion and joint tracking errors, designed for controlled scenarios
Ding et al.⁴⁴	Human Action Recognition Using Similarity Degree Between Postures and Spectral Learning	Rotation Matrix Representation-Based 3D (RMRB3D) with Singular Value Decomposition (SVD) and Hidden Markov Model (HMM)	UTKinect-Action3D Dataset	91.50%	Relies entirely on accurate skeleton extraction from depth sensors, making it sensitive to occlusion, joint noise, and tracking errors
Kumar et al.⁴⁵	Human Action Recognition from Depth Sensor via Skeletal Joint and Shape Trajectories with a Time-Series Graph Matching	Time-Series Graph Matching (TSGM)	UTKinect-Action3D Dataset	93.50%	Computationally intensive
Ashraf et al.⁴⁶	Deep multimodal biomechanical analysis for lower back pain rehabilitation to improve patients stability	Multimodal RGB–depth framework combining 3D pose estimation, silhouette-based features, and deep temporal modeling for exercise recognition.	mRI: Multimodal 3D Human Pose Estimation Dataset	91.00%	Sensitive to severe occlusions or cluttered backgrounds
Proposed		KIMORE		95.30%
		mRI		92.70%
		UTKinect-Action3D		95.59%

While these recent end-to-end learning models demonstrate promising results, they often require substantial computational power, extensive training data, and high-end GPU configurations, making them less feasible for real-time or resource-limited rehabilitation settings. Additionally, such frameworks process the entire image, capturing unnecessary background information that may reduce focus on the actual rehabilitation movement. In contrast, the proposed hybrid multimodal framework emphasizes silhouette-based representations and interpretable structural features, allowing the system to isolate and analyze only the clinically relevant body regions performing the exercise. This focused approach enhances robustness under occlusion, illumination changes, and limited training data, while significantly reducing computational complexity. As summarized in Table 12, the proposed model achieves accuracies of 95.30% on KIMORE, 92.70% on mRI, and 95.59% on UTKinect-Action3D, indicating competitive performance across diverse rehabilitation-oriented benchmarks. These results suggest that the proposed framework provides a favorable balance between accuracy, interpretability, and robustness under the evaluated experimental settings.

Limitations and generalizability

The datasets utilized in this study (KIMORE, mRI, and UTKinect-Action3D) encompass diverse rehabilitation and activity scenarios but still have limited sample sizes and subject diversity, which may influence generalization. To evaluate the robustness of the proposed model, the UTKinect Action3D dataset was included, featuring everyday actions that closely resemble functional movements retrained during physiotherapy. The incorporation of KFDA optimization and multimodal RGB–Depth feature fusion enhances discriminative learning and reduces overfitting, enabling one modality to compensate when another underperforms. These design elements collectively support adaptability across varied users and environments. As reflected in the baseline comparisons in Table 10, this complementary behavior enables the proposed framework to maintain stable performance across datasets where simpler temporal or joint-only models exhibit performance degradation. While the proposed framework demonstrates consistent performance across the evaluated benchmarks, future work will focus on extending validation to larger and more diverse participant cohorts and exploring additional real-world rehabilitation settings to further assess its practical applicability.

Clinical applicability and integration

Although this study primarily focuses on technical validation across public datasets, the proposed framework aligns closely with the clinical objectives of physiotherapy, such as accurate motion tracking, posture correction, and progress monitoring. By providing objective, quantitative feedback on movement quality, the system can complement physiotherapists’ visual assessments and reduce interobserver variability. In practical scenarios, the model could assist clinicians in remote patient supervision, enabling timely correction of exercise performance during telerehabilitation sessions. The future work will involve user studies in collaboration with rehabilitation specialists to validate the interpretability and reliability of the model outputs in real-world sessions.

Conclusion and future direction

In conclusion, this study introduced a deep learning–based multimodal framework for physiotherapy exercise recognition that integrates synchronized RGB and depth streams to enable accurate, markerless assessment in home environments. The system effectively combines two dimensional keypoints, semantic body part labels, and visual descriptors from RGB images with three dimensional joint positions and full body mesh reconstructions from depth silhouettes using the SMPL model. Feature fusion and refinement using KFDA, followed by classification through a GCN, resulted in a highly discriminative representation of human motion. The framework achieved strong performance across three publicly available rehabilitation datasets, with accuracies of 95.30 percent on KIMORE, 92.70 percent on mRI, and 95.59 percent on UTKinect-Action3D, demonstrating that the integration of complementary RGB and depth-based representations provides both accuracy and robustness beyond what is typically achieved by streamlined temporal or joint-only models, though further validation is needed for deployment in real-world rehabilitation settings. Future work will focus on enhancing temporal modeling through attention mechanisms, adapting the framework for deployment on edge devices, and introducing user specific personalization to improve adaptability. Additional improvements may include integrating multi view camera setups to handle occlusion, incorporating physiological signals such as electromyography or heart rate to enrich feedback, and developing real time clinician interfaces for remote monitoring and adaptive therapy planning. These directions aim to transform the system into a comprehensive and intelligent rehabilitation assistant that bridges clinical precision with home-based accessibility.

Footnotes

ORCID iDs

Shaheryar Najam

Ahmad Jalal

Contributorship

AK and YW (equal contribution) conceptualized the study, designed the AI framework, and conducted the experiments and analysis. SN contributed to model optimization and performance evaluation. NAA assisted with data preparation and experimental protocols. AJ provided methodological guidance and technical oversight. HL supervised the research and guided the theoretical direction. AK and YW drafted the manuscript. All authors reviewed and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by NUIST Talent Start-up Fund (No. 1513142501062) and Jiangsu Distinguished Fund (No. R2025T07). The publication was also supported by the Open Access Initiative of the University of Bremen and the DFG via SuUB Bremen. This research is supported and funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R410), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

All datasets utilized in this research. KIMORE, mRI, and UTKinect-Action3D are publicly available and were obtained from open-access repositories under their respective usage licenses. Each dataset adheres to ethical research standards and provides anonymized participant data for noncommercial academic use. No additional restrictions apply to their reuse beyond those specified by the original authors. KIMORE: https://vrai.dii.univpm.it/content/kimore-dataset; mRI: https://sizhean.github.io/mri; UTKinect-Action3D:

References

Meng

Zhang

Guo

, et al. Recent progress in sensing and computing techniques for human activity recognition and motion analysis. Electronics (Basel) 2020; 9: 1357.

Steel

Cox

Garry

. Therapeutic videoconferencing interventions for the treatment of long-term conditions. J Telemed Telecare 2011; 17: 109–117.

Wang

Klaser

Schmid

, et al. Action recognition by dense trajectories. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR) 2011: 3169–3176.

Cao

Simon

Wei

S-E

, et al. Realtime multi-person 2D pose estimation using part affinity fields. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR) 2017: 7291–7299.

Barriga

Conejero

Hernández

, et al. A vision-based approach for building telecare and telerehabilitation services. Sensors 2016; 16: 1724.

Hailey

Roine

Ohinmaa

, et al. Evidence of benefit from telerehabilitation in routine care: a systematic review. J Telemed Telecare 2011; 17: 281–287.

Argent

Bevilacqua

Keogh

, et al. The importance of real-world validation of machine learning systems in wearable exercise biofeedback platforms: a case study. Sensors 2021; 21: 2346.

Liao

Vakanski

Xian

, et al. A review of computational approaches for evaluation of rehabilitation exercises. Comput Biol Med 2020; 119: 103687.

van Diest

Lamoth

Stegenga

, et al. Exergaming for balance training of elderly: state of the art and future developments. J Neuroeng Rehabil 2013; 10: 01.

10.

Liu

Sangani

Patterson

, et al. Real-time avatar-based feedback to enhance the symmetry of spatiotemporal parameters after stroke: instantaneous effects of different avatar views. IEEE Trans Neural Syst Rehabil Eng 2020; 28: 878–887.

11.

Rafiq

Jalal

. Wearable sensors-based human locomotion and indoor localization with smartphone. ICET24 2024. IEEE, 2024, pp. 1–6.

12.

Mahmood

Jalal

Kim

. WHITE STAG model: wise human interaction tracking and estimation (WHITE) using spatio temporal and angular-geometric (STAG) descriptors. Multimed Tools Appl 2020; 79: 30851–30876.

13.

Ahmad

Khalid

Kim

. Automatic recognition of human interaction via hybrid descriptors and maximum entropy markov model using depth sensors. Entropy 2020; 22: 321.

14.

Sylvester

Lautzenheiser

Kramer

. A review of musculoskeletal modelling of human locomotion. Interface Focus 2021; 11: 20200060.

15.

García-de-Villa

Villar

González

, et al. Simultaneous recognition and evaluation of rehabilitation exercises with wearable sensors. Sensors 2024; 24: 319.

16.

Zhang

Chen

Wang

. Physical activity recognition for elderly using hybrid deep models with single IMU. IEEE Access 2019; 7: 100104–100113.

17.

Chang

Y-J

Chen

S-F

Huang

J-D

. A Kinect-based system for physical rehabilitation: a pilot study for young adults with motor disabilities. Res Dev Disabil 2011; 32: 2566–2570.

18.

Yang

Zhang

Tian

. Recognizing actions using depth motion maps-based histograms of oriented gradients. ACM Int Conf Multimed 2012: 1057–1060.

19.

Camara

Villanueva

Cabeza

, et al. Real-Time posture monitoring and fall detection using Microsoft Kinect sensor. Sensors 2015; 15: 8271–8290.

20.

Gupta

Saini

Aggarwal

. Human activity recognition using RGB video: a survey. Proc Int Conf Softw Comput Pattern Recognit 2020: 43–52.

21.

Wang

Chen

. Action recognition from RGB video using pose estimation and graph CNNs. Proc IEEE Int Conf Image Processing 2021: 2354–2358.

22.

Wang

. Exercise action detection using RGB video and OpenPose-based feature extraction. Multimed Tools Appl 2022; 81: 28495–28512.

23.

Qiu

Liu

Tang

. Human activity recognition using dual-stream CNN with silhouette and RGB inputs. Pattern Recognit Lett 2020; 138: 276–282.

24.

Miao

Zheng

. Rehabilitation exercise recognition using transformer-based fusion of pose and video features. Proc IEEE EMBC 2023: 3002–3005.

25.

Zanfir

Marinoiu

Sminchisescu

. Monocular 3D pose and shape estimation of multiple people in natural scenes—the importance of multiple scene constraints. Proc CVPR 2018: 2148–2157.

26.

Kanazawa

Black

Jacobs

, et al. End-to-end recovery of human shape and pose. Proc CVPR 2018: 7122–7131.

27.

Zhu

, et al. 3D mesh-based human pose estimation and action recognition for physical therapy monitoring. IEEE Trans Med Robot Bionics 2021; 3: 125–135.

28.

Kocabas

Athanasiou

Black

. VIBE: video inference for human body pose and shape estimation. Proc CVPR 2020: 5253–5263.

29.

Ashraf

Najam

Sadiq

, et al. A novel telerehabilitation system for physical exercise monitoring in elderly healthcare. IEEE Access 2025; 13: 9120–9133.

30.

Marusic

Nguyen

Tapus

. Evaluating Kinect, OpenPose, and BlazePose for human body movement analysis on a low back pain physical rehabilitation dataset. Companion of the 2023 ACM/IEEE Int Conf Human-Robot Interact 2023: 587–591.

31.

Malik

NUR

Abu-Bakar

SAR

Sheikh

, et al. Cascading pose features with CNN-LSTM for multiview human action recognition. Signals 2023; 4: 40–55.

32.

Batool

Alotaibi

, et al. Multimodal human action recognition framework using an improved CNN-GRU classifier. IEEE Access 2024; 12: 158388–158406.

33.

Agahian

Negin

Köse

. An efficient human action recognition framework with pose-based spatiotemporal features. Eng Sci Technol Int J 2020; 23: 196–203.

34.

Khanghah

Fernie

Fekr

. A novel approach to tele rehabilitation: implementing a biofeedback system using machine learning algorithms. Mach Learn Appl 2023; 14: 100499.

35.

Capecci

Ceravolo

Ferracuti

, et al. The KIMORE dataset: kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation. IEEE Trans Neural Syst Rehabil Eng 2019; 27: 1436–1448.

36.

An S, Li Y and Ogras U. mRI: Multi-modal 3D human pose estimation dataset using mmWave, RGB-D, and inertial sensors. Adv Neural Inf Process 2022; 35: 27414–27426.

37.

Xia

Chen

C-C

Aggarwal

. View invariant human action recognition using histograms of 3D joints. In: IEEE Computer society conference on computer vision and pattern recognition workshops (CVPRW). Providence: RI: IEEE Computer Society, 2012, pp.20–27.

38.

Jleli

Samet

Dutta

. Artificial intelligence-driven remote monitoring model for physical rehabilitation. J Disabil Res 2024; 3(1): 20230065.

39.

Zaher

Ghoneim

Abdelhamid

, et al. Unlocking the potential of RNN and CNN models for accurate rehabilitation exercise classification on multi-datasets. Multimed Tools Appl 2024; 84: 1261–1301.

40.

Zaher

Ghoneim

Abdelhamid

, et al. Rehabilitation monitoring and assessment: a comparative analysis of feature engineering and machine learning algorithms on the UI-PRMD and KIMORE benchmark datasets. J Inf Telecommun 2025; 9: 1–21.

41.

Karlov

Abedi

Khan

. Rehabilitation exercise quality assessment through supervised contrastive learning with hard and soft negatives. Med Biol Eng Comput 2025; 63: 15–28.

42.

Abedi

Malmirian

Khan

. Cross-modal video to body-joints augmentation for rehabilitation exercise quality assessment. In: Joint European conference on machine learning and knowledge discovery in databases (ECML PKDD 2023). Cham, Switzerland: Springer Nature, 2023, pp.320–327.

43.

Keçeli

Kaya

Can

. 3D Skeletal volume templates for deep learning-based activity recognition. Electronics (Basel) 2022; 11: 3567.

44.

Ding

Liu

Chen

, et al. Human action recognition using similarity degree between postures and spectral learning. IET Comput Vision 2018; 12: 110–117.

45.

Kumar

Suneetha

, et al. Human action recognition from depth sensor via skeletal joint and shape trajectories with a time-series graph matching. Presented at the International Conference on Signal Processing & Communication Engineering Systems: Spaces-2021, Andhra Pradesh, India 2024; 2512: 020029.

46.

Ashraf

Najam

, et al. Deep multimodal biomechanical analysis for lower back pain rehabilitation to improve patients stability. Front Bioeng Biotechnol. 2025;13:1631910. PMID: 41280645; PMCID: PMC12634531.

47.

Loper

Mahmood

Romero

, et al. SMPL: a skinned multi-person linear model. ACM Trans Graph 2015; 34: 1–16.

48.

Khotanzad

Hong

. Invariant image recognition by Zernike moments. IEEE Trans Pattern Anal Mach Intell 1990; 12: 489–497.

49.

Rosten

Drummond

. Machine learning for high-speed corner detection. Computer Vision – ECCV 2006, Lecture Notes in Computer Science 2006; 3951: 430–443.

50.

Mikolajczyk

Schmid

. Scale & affine invariant interest point detectors. Int J Comput Vision 2004; 60: 63–86.

51.

Aach

Kaup

Mester

. On texture analysis: local energy transforms versus quadrature filters. Signal Processing 1995; 45: 173–181.

52.

Tan

Triggs

. Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans Image Process 2010; 19: 1635–1650.

53.

Daugman

. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J Optic Soc Am A 1985; 2: 1160–1169.

54.

Gong

Gao

Liang

, et al. Graphonomy: universal human parsing via graph transfer learning. Proc IEEE/CVF Conf Comput Vision Pattern Recognit 2019: 7450–7459. doi: 10.1109/CVPR.2019.00763

Toward intelligent rehabilitation: Multimodal human pose modeling with parametric meshes and graph-based temporal reasoning

Abstract

Objective

Methods

Results

Conclusions

Keywords

Introduction

Literature review

System methodology

Data preprocessing and cross-modal synchronization

Depth-based Keypoint Extraction and Mesh

Preprocessing

Depth-guided silhouette segmentation

PoseKP-L and PoseKP-R: Pose-Aware Keypoint Extraction Framework

3D Human mesh reconstruction with SMPL fitting

Zernike moment-based silhouette shape analysis

Silhouette-based Tracking, Regional Analysis, and Parsing using RGB

Silhouette extraction

Keypoints extraction

Features from Accelerated Segment Test: high-speed intensity-based keypoint detector

Harris-Laplace: multi-scale geometric keypoint detector

Feature detection

Laws’ texture energy

Local ternary patterns

Gabor filters

Body part labeling

Feature optimization

Graph convolutional network

Results and Performance Evaluation

Datasets

Experimental setup

Time-cost analysis

Confusion matrices

Performance evaluation

Discussion

Ablation study

Comparison with state-of-the-art methods

Limitations and generalizability

Clinical applicability and integration

Conclusion and future direction

Footnotes

ORCID iDs

Contributorship

Funding

Declaration of conflicting interests

Data availability statement

References