Sage Journals: Discover world-class research

Abstract

This study developed a robotic arm grasping system that belongs to the vision-language-action (VLA) model. The model is divided into three parts: language, vision, and action. GPT-3.5-Turbo served as the language model, and YOLOv7 was used for vision—to identify object positions. The system is capable of simulating factory object grasping tasks, accepting natural-language inputs with nonfixed word orders and translating them into system commands using the natural-language model. In addition, the Google speech-recognition technology was integrated with a graphical user interface to achieve diversified input methods. The system checks whether the corresponding objects exist in the gripping area through the vision model and converts the visual points into six parameters for the robotic arm, directing it to move to the expected position to complete the tasks. For the GPT natural-language model, the average matching percentage across all six languages was 97.18% after fine-tuning with both Chinese and English. In the experimental validation of the VLA model, the YOLOv7 vision model achieved an average image recognition accuracy of 95.02% for object detection. Using both voice and text inputs in Chinese, the GPT model achieved a recognition success rate of 91.67%, and the overall system’s object pick-up rate was 94%, based on 94 successful object grasps out of 100 intended targets. Owing to its adaptability to multilingual and diverse inputs, the proposed system can be applied to object grasping tasks in environments such as factories. The novelty of this work lies in its practical, system-level integration of a multilingual natural-language interface, real-time vision-based object detection, and robotic grasp execution within a modular vision–language–action (VLA) framework for industrial object manipulation.

Keywords

generative AI deep learning robotic arm voice control YOLOv7 GPT

Get full access to this article

View all access options for this article.

References

Trapeze Group. AI, machine learning, and the next chapter of transportation. https://trapezegroup.com/transit-trends/how-ai-and-machine-learning-will-rock-your-world/ (2020, accessed 7 January 2026).

LeCun

Boser

Denker

, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989; 1(4): 541–551.

Hinton

Salakhutdinov

RR.

Reducing the dimensionality of data with neural networks. Science 2006; 313: 504–507.

Gkioxari

Dollár

, et al. Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22 October 2017, pp. 2961–2969.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27 June 2016, pp. 779–788.

Elman

JL.

Finding structure in time. Cogn Sci 1990; 14(2): 179–211.

Werbos

PJ.

Backpropagation through time: what it does and how to do it. Proc IEEE 1990; 78(10): 1550–1560.

Hochreiter

Schmidhuber

Long short-term memory. Neural comput. MIT Press, 1997.

Medium. Unleashing the power of Lang chain and OpenAI GPT: conversing with CSV files. https://medium.com/@vijayveeranar/unleashing-the-power-of-langchain-and-openai-gpt-conversing-with-csv-files-3e500cc6bb50 (2023, accessed 6 September 2024).

10.

Brohan

Brown

Carbajal

, et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In: Proceedings of the conference on robot learning (CoRL), 2023.

11.

Google DeepMind. RT-2: vision-language-action models. https://robotics-transformer2.github.io/ (2023, accessed 6 September 2024).

12.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2021.

13.

Radford

Narasimhan

Salimans

, et al. Improving language understanding by generative pretraining. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (2018, accessed 6 October 2024).

14.

Alwahedi

Aldhaheri

Ferrag

, et al. Machine learning techniques for IoT security: current research and future vision with generative AI and large language models. IoT CPS 2024; 4: 167–185.

15.

Sohrabpour

Oghazi

Toorajipour

, et al. Export sales forecasting using artificial intelligence. Technol Forecast Soc Change 2021; 163: 120480.

16.

Shin

Han

Rhee

AI-assistance for predictive maintenance of renewable energy systems. Energy 2021; 221: 119775.

17.

Beham

Roomi

SMM

. A review of face recognition methods. Int J Pattern Recognit Artif Intell 2013; 27(04): 1356005.

18.

Zhao

Zheng

, et al. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst 2019; 30(11): 3212–3232.

19.

Mata

de Miguel

Durán

, et al. Artificial intelligence (AI) methods in optical networks: a comprehensive survey. Opt Switching Netw 2018; 28: 43–57.

20.

Chowdhary

(ed.). Fundamentals of artificial intelligence. 1st ed. Springer, 2020. pp.603–649.

21.

Wang

Bochkovskiy

Liao

. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (CVPR), Vancouver, BC, Canada, 17 June 2023, pp. 7464–7475.

22.

Hartley

Zisserman

(eds). Multiple view geometry in computer vision. 2nd ed. Cambridge Univ. Press, 2004.

Development of a robotic arm grasping system using generative artificial intelligence

Abstract

Keywords

Get full access to this article

References