Classifying Diverse Manual Material Handling Tasks Using Vision Transformers and Recurrent Neural Networks

Abstract

Frequent or prolonged manual material handling (MMH) is a major risk factor for work-related musculoskeletal disorders, which cause considerable health and economic burdens. Assessing physical exposures is essential for identifying high-risk tasks and implementing targeted ergonomic interventions. However, variability in MMH task performance across individuals and work settings complicates physical exposure assessments. Further, conventional tools often suffer from limitations such as bias, discomfort, behavioral interference, and high costs. Noncontact (ambient) methods and automated data collection and analysis present promising alternatives for assessing physical exposure. We investigated the use of vision transformers and recurrent neural networks for non-contact MMH task classification using RGB video for eight simulated MMH tasks. Spatial features were extracted using the Contrastive Language-Image Pre-training vision transformer, then classified by a Bidirectional Long Short-Term Memory model to capture temporal dependencies between video frames. Our model achieved a mean accuracy of 88% in classifying MMH tasks, demonstrating comparable performance to methods using depth cameras or wearable sensors, while potentially offering better scalability and feasibility for real environments. Future work includes improving temporal modeling, integrating task-adapted feature extraction, and validating across more diverse workers and occupational environments.

Keywords

physical exposure assessment musculoskeletal disorders vision-language model (VLM)generative pretrained transformer (GPT)computer vision

Get full access to this article

View all access options for this article.

References

Fan

Mei

(2024). Assisting in the identification of ergonomic risks for workers: A large vision-language model approach. ISARC Proceedings of the International Symposium on Automation and Robotics in Construction, 41, 1010–1017.

Huang

(2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint, arXiv:1508.01991.

Kadikon

Rahman

M. N. A.

(2016). Manual material handling risk assessment tool for assessing exposure to risk factor of work-related musculoskeletal disorders: A review. Journal of Engineering and Applied Science, 100(10), 2226–2232.

Kim

Nussbaum

M. A.

(2014). An evaluation of classification algorithms for manual material handling tasks based on data obtained using wearable technologies. Ergonomics, 57(7), 1040–1051.

Liberty Mutual Insurance. (2023). 2023 workplace safety index: The top 10 causes of disabling injuries. Liberty Mutual Insurance.

Ojelade

Rajabi

M. S.

Kim

Nussbaum

M. A.

(2024). A data-driven approach to classifying manual material handling tasks using markerless motion capture and recurrent neural networks. International Journal of Industrial Ergonomics, 107, 103755.

OpenAI. (2021). clip-vit-base-patch32 [Computer Software]. Hugging Face.

Punnett

Wegman

D. H.

(2004). Work-related musculoskeletal disorders: The epidemiologic evidence and the debate. Journal of Electromyography and Kinesiology, 14(1), 13–23.

Radford

Kim

J. W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

Krueger

(2021). Learning transferable visual models from natural language supervision. arXiv Preprint, arXiv:2103.00020.

10.

Tang

Song

Liang

Wang

Zhang

Lin

Zhu

Vosoughi

(2024). Video understanding with large language models: A survey. arXiv Preprint, arXiv:2312.17432.

11.

U.S. Bureau of Labor Statistics (BLS). (2024). Survey of occupational injuries and illnesses data. Nonfatal occupational injuries and illnesses requiring days away from work. https://www.bls.gov/data/home.htm