Abstract
This study developed a robotic arm grasping system that belongs to the vision-language-action (VLA) model. The model is divided into three parts: language, vision, and action. GPT-3.5-Turbo served as the language model, and YOLOv7 was used for vision—to identify object positions. The system is capable of simulating factory object grasping tasks, accepting natural-language inputs with nonfixed word orders and translating them into system commands using the natural-language model. In addition, the Google speech-recognition technology was integrated with a graphical user interface to achieve diversified input methods. The system checks whether the corresponding objects exist in the gripping area through the vision model and converts the visual points into six parameters for the robotic arm, directing it to move to the expected position to complete the tasks. For the GPT natural-language model, the average matching percentage across all six languages was 97.18% after fine-tuning with both Chinese and English. In the experimental validation of the VLA model, the YOLOv7 vision model achieved an average image recognition accuracy of 95.02% for object detection. Using both voice and text inputs in Chinese, the GPT model achieved a recognition success rate of 91.67%, and the overall system’s object pick-up rate was 94%, based on 94 successful object grasps out of 100 intended targets. Owing to its adaptability to multilingual and diverse inputs, the proposed system can be applied to object grasping tasks in environments such as factories. The novelty of this work lies in its practical, system-level integration of a multilingual natural-language interface, real-time vision-based object detection, and robotic grasp execution within a modular vision–language–action (VLA) framework for industrial object manipulation.
Get full access to this article
View all access options for this article.
