Sage Journals: Discover world-class research

Abstract

Low-back musculoskeletal disorders (MSDs) are the primary work-related injuries among manual material handling (MMH) workers, who are frequently exposed to repetitive lifting. To prevent low-back MSDs in the workplace, we present a video-based lifting action recognition method using rank-altered kinematic feature pairs, called top-scoring pairs (TSPs). We derive TSPs from a video dataset containing lifting and other activities commonly seen in MMH. These TSPs collectively classify each frame as lifting and non-lifting. The validation process involves evaluating classification performance. The proposed method minimizes computational and memory requirements while achieving performance comparable to more complex methods with greater computational demands. This makes it suitable for systems with limited hardware resources, thereby providing extensive feasibility across a variety of MMH environments to improve workplace safety.

Keywords

top-scoring pair lifting counting computer vision

Introduction

Between 2021 and 2022, 502,280 cases of work-related musculoskeletal disorders (MSDs) resulting in days away from work were reported in the U.S. (U.S. Bureau of Labor Statistics, 2023). Most of these cases occur among manual material handling (MMH) workers frequently exposed to repetitive lifting tasks. Epidemiological studies link repetitive lifting to an increased risk of low-back MSDs (Maher et al., 2017). From a biomechanical standpoint, repetitive lifting fatigues the back muscles and increases spinal loading, particularly on the L5/S1 intervertebral disc, leading to compression and shear forces (Marras et al., 2006). The L5/S1 disc also experiences significant bending moments during repetitive lifting due to the lumbar spine’s structure (Dolan & Adams, 1998). These stresses are key risk factors for chronic low-back pain and disc impairments, such as herniation (Desmoulin et al., 2020). Therefore, managing lifting frequency in the workplace is critical to prevent work-related low-back injuries.

In current safety evaluations, practitioners manually observe workers to identify lifting actions and determine frequencies, either through field surveys or video analysis. This process is time-consuming and labor-intensive, requiring real-time monitoring throughout shifts. Thus, developing an automated system for monitoring lifting tasks is essential. To this end, the first step is acquiring motion data from workers. With the widespread availability of cameras, computer vision-based neural networks offer a promising solution for collecting body motion data. However, technical challenges remain for workplace implementation. Current approaches often stack multiple convolutional layers to capture features from local to global scales (Meng et al., 2022). Despite their power, these networks are computationally and memory intensive, limiting their use on hardware-constrained systems (Han et al., 2015).

To address these limitations, we propose a video-based lifting action recognition method that reduces computational and memory demands while maintaining high classification accuracy. The method uses BlazePose (Bazarevsky et al., 2020), a lightweight CNN architecture, to detect 18 key body joints. Kinematic features are extracted from these joints, and top-scoring pairs (TSPs) form an ensemble classifier that identifies lifting actions by classifying each video frame as lifting or non-lifting. We validated the method against baseline classifiers using a video dataset of seven common MMH activities.

Method

To identify lifting actions in videos, we adopt a four-stage process: pose estimation, pre-processing, feature extraction, and classification. Figure 1 illustrates the overall workflow of the proposed method.

Figure 1.

Overall workflow of the proposed method.

Pose Estimation

Recently, Google researchers introduced BlazePose, a lightweight CNN architecture designed for real-time human pose estimation on mobile devices (Bazarevsky et al., 2020). In this study, we employed BlazePose for motion data collection due to its low latency and high accuracy. From the 33 joints detected by BlazePose, we selected 18 key joints to capture full-body movements for lifting action recognition (Figure 2).

Figure 2.

Selected key joints of the human body. These include shoulders, elbows, wrists, pinkies, index fingers, thumbs, hips, knees, and ankles.

Pre-processing

The initial step in data pre-processing involves smoothing each joint’s trajectory by filtering out high-frequency noise. We utilize a fifth-order Butterworth filter with a cutoff frequency of 3 Hz, given that typical walking frequencies range from 1.2to 2.2 Hz (Luo et al., 2021). After filtering, the joint trajectories become more precise, with reduced noise and improved accuracy. Next, we normalize the joint positions by centering and scaling: each joint is shifted by subtracting the mid-hip point and then scaled by dividing it by the average trunk length over a short time window.

Feature Extraction

From the pre-processed joints ${x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{18}, y_{18}}$ , we extract kinematic feature vectors $x = {a, p, d, v}^{T}$ using a sliding window $w$ . Each vector consists of joint angle $a$ , joint position $p$ , inter-joint distance $d$ , and joint velocity $v$ .

Classification

Consider a data matrix $X = {x_{l n}} \in ℝ^{L \times N}$ , where each element $x_{l n}$ denotes the $n$ -th $(1 \leq n \leq N)$ observation of a random variable $X_{l}$ for feature $l (1 \leq l \leq L)$ . Each observation $x_{n} = {[x_{1 n} x_{2 n} \dots x_{L n}]}^{T}$ has a class label $y_{n} \in {1, 2}$ . The TSP algorithm is interested in finding a feature pair $(i, j) (1 \leq i, j \leq L; i \neq j)$ for which there is a significant difference in the probability of $X_{i} < X_{j}$ between class 1 and 2. For a feature pair $(i, j)$ , the conditional ordering probability $p_{i j} (c)$ is defined as $p_{i j} (c) : = P (X_{i} < X_{j} | Y = c)$ for $c \in {1, 2}$ . $p_{i j} (1)$ represents the probability of feature $j$ being greater than feature $i$ when $c = 1$ , and similarly $p_{i j} (2)$ when $c = 2$ . For the data matrix $X$ , these two probabilities are estimated by the relative frequencies of occurrences of $X_{i} < X_{j}$ for class $c$ , that is,

{\hat{p}}_{i j} (c) = \frac{\sum_{n = 1}^{N} [I (x_{i n} < x_{j n}) I (y_{n} = c)]}{\sum_{n = 1}^{N} I (y_{n} = c)},

(1)

where $I (\cdot)$ is an indicator function that generates a value of one when the statement inside is true and zero otherwise. The score of a feature pair $(i, j)$ , which measures the discriminative power of the pair, is defined as $Δ_{i j} : = | p_{i j} (1) - p_{i j} (2) |$ . To find the best feature pair $(i^{*}, j^{*})$ , called the TSP, the TSP algorithm estimates the scores $Δ_{i j}$ for every distinct pair and selects the pair with the highest score.

Suppose we derive feature pairs with scores comparable to the TSP from the data matrix $X$ . The $k$ -TSP classifier $f_{k - TSP}$ produces a prediction for a new observation $x_{new} = {[x_{1 new} x_{2 new} \dots x_{L new}]}^{T}$ as follows:

{\hat{y}}_{new} = f_{k - TSP} (x_{new}) = {\begin{matrix} 1, if \sum_{u = 1}^{k} I (f_{TSP}^{u} (x_{new}) = 2) < c_{k}, \\ 2, if \sum_{u = 1}^{k} I (f_{TSP}^{u} (x_{new}) = 2) \geq c_{k} . \end{matrix}

(3)

where $c_{k}$ is the number of votes required to classify the new observation as class 2. We determined the hyper-parameters $k$ and $c_{k}$ to be 9 and 3, respectively.

Results

For validation, we created a benchmark video dataset for lifting action recognition in MMH by conducting experiments involving 25 healthy participants. Each participant completed four sessions, during which they performed seven common MMH activities: pushing, pulling, sitting, standing, walking, and lifting.

Classification Performance

We applied the $k$ -TSP classifier to the MMH dataset above. The baseline classifiers were constructed using the kinematic features used to identify the TSPs. Table 1 shows the performance results of each classifier.

Table 1.

Performance Comparison. The Latency is the Run Time It Takes for a Classifier to Produce a Single Prediction.

Classifier	Accuracy	Precision	Recall	F1-Score	Latency (ms)
$k$ -TSP	0.89	0.89	0.93	0.91	0.062
RF	0.91	0.91	0.95	0.93	5.057
SVM	0.91	0.98	0.88	0.93	0.774
MLP	0.93	0.97	0.92	0.94	2.380
LSTM	0.97	0.98	0.98	0.98	311.188

Discussions

All classifiers achieved a performance of 0.88 or higher across all metrics. The $k$ -TSP classifier demonstrates superior latency by comparing only a small set of features for class prediction, leading to high computational efficiency. This efficiency is essential for real-time monitoring of lifting tasks on platforms with limited hardware resources, such as mobile devices and embedded systems. As a result, the proposed method offers a practical and cost-effective ergonomic solution for various MMH environments.

This study has several limitations. First, the use of 2D pose estimation may lead to inaccuracies in feature extraction due to the projection of joints onto a 2D plane. Second, the window length should be adjusted based on the frame rate of the videos being processed to maintain consistent temporal resolution for feature extraction. Third, a lift counting algorithm based on class predictions needs to be developed to count lifts and assess the risks associated with lifting frequency.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This manuscript is based upon work supported by the National Science Foundation under Grant # 2013451.

ORCID iDs

SeHee Jung

Bingyi Su

Liwei Qing

References

Bazarevsky

Grishchenko

Raveendran

Zhu

Zhang

Grundmann

(2020). BlazePose: On-device real-time body pose tracking. https://arxiv.org/abs/2006.10204v1

Desmoulin

G. T.

Pradhan

Milner

T. E.

(2020). Mechanical aspects of intervertebral disc injury and implications on biomechanics. Spine, 45(8), E457–E464. https://doi.org/10.1097/BRS.0000000000003291

Dolan

Adams

M. A.

(1998). Repetitive lifting tasks fatigue the back muscles and increase the bending moment acting on the lumbar spine. Journal of Biomechanics, 31(8), 713–721. https://doi.org/10.1016/S0021-9290(98)00086-4

Han

Mao

Dally

W. J.

(2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. 4th International conference on learning representations, ICLR 2016—Conference track proceedings. https://arxiv.org/abs/1510.00149v5

Luo

Wang

Weng

Hsu

L. T.

Chen

(2021). Integration of GNSS and BLE technology with inertial sensors for real-time positioning in urban environments. IEEE Access, 9, 15744–15763. https://doi.org/10.1109/ACCESS.2021.3052733

Maher

Underwood

Buchbinder

(2017). Non-specific low back pain. The Lancet, 389(10070), 736–747. https://doi.org/10.1016/S0140-6736(16)30970-9

Marras

W. S.

Parakkat

Chany

A. M.

Yang

Burr

Lavender

S. A.

(2006). Spine loading as a function of lift frequency, exposure duration, and work experience. Clinical Biomechanics, 21(4), 345–352. https://doi.org/10.1016/J.CLINBIOMECH.2005.10.004

Meng