Sage Journals: Discover world-class research

Abstract

Large language models (LLMs)-driven robotic systems have become a focal point. However, current methodologies predominantly rely on auxiliary components such as pre-defined motion primitives or pre-trained skills to facilitate physical environment actions, which constitutes the primary limitation for such approaches. This study presents the SeIn framework that breaks through this limitation by leveraging prompt-conditioned sequential generation capability of LLMs to directly output control actions for dual-arm nursing robot—specifically, sequences of 14-DoF target joint positions. Firstly, the LLMs were grounded in a MuJoCo physics environment, enabling the dual-arm nursing robot to generate control actions while receiving corresponding environmental observations. Subsequently, task-specific textual prompts were designed to guide the LLMs in task reasoning, observation processing, and generation of target joint position sequences. Finally, the framework was evaluated through comprehensive MuJoCo simulations, including: ablation experiments on prompt design; expressive action experiments across seven action types spanning three operational categories; performance evaluation experiments querying four distinct LLMs at 4- and 60-s execution durations; and a functional action experiment involving Reach-to-Object. Experimental results demonstrated that SeIn framework achieved 100% success rate in generating joint sequences for actions using task-specific text prompts and pre-trained LLMs (specifically, GPT-4o and DeepSeek-V3.1), without requiring auxiliary components.

Keywords

large language models dual-arm nursing robot low-level control actions target joint positions

Introduction

Dual-arm nursing robots have garnered extensive research interest.^1–4 They can enhance the quality of life for disabled/semi-disabled individuals with mobility impairments while addressing caregiver shortages. Current implementations predominantly utilize teleoperation, requiring continuous operator involvement. Recent advances in large language models (LLMs) have demonstrated significant capabilities in high-level semantic comprehension and logical reasoning.^5,6 Building on these capabilities, researchers have extended LLMs to robotics by driving robots with pre-trained LLMs for high-level task interpretation, task planning, and behavior selection.^7–13 Nevertheless, low-level action implementation remains dependent on auxiliary components including pre-defined motion primitives and pre-trained behavioral skills. Existing methodologies, constrained by the limited repertoire of available primitives and skills, prove inadequate for fulfilling the operational demands of nursing robots requiring diverse action execution.

Recent studies explore extending LLMs to the low-level control of robots. For example, VoxPoser¹⁴ employs LLMs for end-effector trajectory planning, yet necessitates an external trajectory optimizer for motion calculation. Similarly, Prompt2Walk¹⁵ applies LLMs to learn a walking feedback policy for a quadruped robotic dog, but it relies on a pre-trained RL policy to provide contextual history examples. Despite related research advances, insufficient robotic data of LLMs impedes direct low-level robot control. Prevailing academic consensus holds that pre-trained LLMs cannot directly generate robot low-level actions without auxiliary components.¹⁶

Current research inadequately verifies this hypothesis. In this work, the potential of leveraging pre-trained LLMs was examined to provide low-level control solutions for robots. A novel SeIn framework was facilitated to guide pre-trained LLMs to directly generate dense sequences of target joint positions for a dual-arm nursing robot through task-specific textual prompts. The operational principle of the SeIn framework is illustrated in Figure 1, with its core workflow as follows. Firstly, to make the LLMs useful for low-level robot control, the models are grounded in the MuJoCo physical environment. The textual prompts are considered key for controlling the nursing robot via LLMs. Task-specific textual prompts are designed, comprising four components: Task Description, Prior Knowledge of Action, Observation Information, and System Configuration. Through these operations, the SeIn framework invokes pre-trained LLMs to generate dense sequences of target joint positions, enabling physical environment action and environmental observations acquisition. Simulation experiments in MuJoCo test the execution effect of the SeIn framework across seven expressive actions in three categories and one functional action.

Figure 1.

LLMs generate low-level control actions. The SeIn framework grounds LLMs in the MuJoCo physical environment so that observations can be obtained and actions can be sent. Task-specific textual prompts guide the LLMs to reason the task of generate the normalized target joint position sequences. Subsequently, the target joint position sequence is tracked by the PD controller to update the robot’s state.

As a summary, the main contributions of this work are threefold:

(1) SeIn framework guides pre-trained LLM using only textual prompts to output dense sequences of target joint positions (normalized) at each inference step. These sequences undergo inverse normalization and are tracked by PD controller, enabling low-level action execution from text to motion without auxiliary components.

(2) Task-specific textual prompts are designed to guide LLMs in understanding the task, acquiring observations, and generating low-level control actions for the robot. Notably, the ablation experiments further revealed which settings in the prior knowledge and observation information facilitated the emergence of SeIn framework capabilities.

(3) A 14-degree-of-freedom (DoF) dual-arm nursing robot is simulated in a MuJoCo environment, demonstrating that the SeIn framework can be generalized to the execution of multiple actions.

Related work

Dual-arm nursing robot

Dual-arm nursing robots adopt humanoid arms connected to a torso and equipped with a mobile chassis. Representative systems include Twendy one,¹ RIBA-II,² RoNA,³ RescueBot,⁴ etc. These robots can perform a variety of tasks, including expressive actions (such as waving, hugging, etc.) and functional actions (such as cleaning the table, etc.).¹⁷ Nursing robotic systems primarily use teleoperation, and the operator comprehends and plans the tasks through the “human-in-the-loop” strategy. As for action execution, researchers have utilized latent embeddings of language commands as multi-task input context^18,19 and trained by behavior cloning,^20–22 offline reinforcement learning,²³ and goal-conditioned reinforcement learning.²⁴ However, these methods demand significant professional knowledge and lack task understanding and planning at the task level.^25–28

LLMs-based robots

LLMs trained on web-scale data demonstrate strengths in high-level semantic understanding and logical reasoning.^29–35 Researchers have applied pre-trained LLMs to robotic systems enabling robots to understand tasks, plan tasks, and choose behaviors. For example, Inner Monologue⁹ and Say-Can¹⁰ utilize LLMs for behavior selection to solve high-level planning problems. In ChatGPT for Robotics,¹¹ Code as Policies,⁷ ProgPrompt,³⁶ and Demo2Code,³⁷ researchers have utilized the code generation capabilities of LLMs to map high-level instructions into the code format for robot policies. Works including Text2Motion³⁸ and AutoTAMP³⁹ combine LLMs with traditional Task And Motion Planning (TAMP). However, these studies based on pre-trained LLMs typically rely on pre-defined motion primitives or pre-trained skills to perform low-level actions.^20,21,40

As shown in Figure 2, researchers have begun to explore the intersection between LLMs and low-level control of robots. VoxPoser¹⁴ and Language to Rewards⁴¹ explore the use of LLMs to generate high-reward regions for robots trajectory planning. However, they require an external trajectory optimizer to generate trajectories. Kwon et al.⁴² utilized LLMs to generate Python code that is executed by an interpreter and outputs a sequence of end-effector poses. The code generated by LLMs is prone to syntactic and semantic errors. Mirchandani et al.⁴³ employed LLMs as a general pattern machine and provided a controller for CartPole through sequence improvement techniques.⁴⁴ Subsequently, Prompt2Walk¹⁵ collected observations and actions from the RL controller to initialize prompts, employing LLMs as the walking feedback policy for the quadruped robot dog. However, it relies on a pre-trained RL policy to provide contextual history examples, and its performance is constrained by this RL policy.

Figure 2.

Comparison of the SeIn framework with existing robot research based on pre-trained LLMs. Existing studies primarily relies on auxiliary components, such as (a) pre-defined motion primitives or pre-trained skills, (b) trajectory optimizer, and (c) contextual history examples, to execute low-level actions. This work explores whether LLMs can directly generate target joint position sequences using textual prompts for robot low-level control, as shown in (d).

Despite the progress in related research, existing research still relies on auxiliary components other than LLMs for low-level control of the robot, and has not verified the potential for LLMs to directly generate low-level controls. This study investigates the application of pre-trained LLMs to generate dense sequences of target joint positions, enabling low-level control implementation in dual-arm nursing robot. Our research eliminates the need to fine-tune LLMs using specialized robot data, while getting rid of reliance on auxiliary components such as pre-trained skills, pre-defined motion primitives, trajectory optimizer, and contextual history examples.

Proposed method

This section describes the SeIn framework in detail with its pipeline overview shown in Figure 3. The SeIn framework grounds the LLMs in the MuJoCo, and guides the pre-trained LLMs through the task-specific textual prompts to generate dense sequences of target joint positions, which are tracked by a PD controller to execute of the dual-arm nursing robot.

Figure 3.

Overview of the pipeline. The task-specific textual prompts direct the pre-trained LLMs to understand the task, obtain the observations, and generate the normalized target joint positions. Subsequently, the denormalized target joint positions are followed by the PD controller. After each LLMs inference loop, the textual prompts are updated based on historical observations and actions.

Problem formulation

First, this section explains the assumptions and constraints of this study. Then, the robot control hierarchy corresponding for the low-level action execution is presented. Finally, this section explains how to perform normalization operations on observation and action. Building upon this foundation, subsequent sections will detail the grounding of LLMs and the design of task-specific textual prompts.

Assumptions and constraints

The SeIn framework is designed to investigate the ability of LLMs, to perform low-level robot control, with the following assumptions:

(1) No pre-defined motion primitives, pre-trained skills, RL policies, or trajectory optimizer. The use of textual prompts is investigated to invoke pre-trained LLMs to output a dense sequence of target joint positions necessary for the action execution.

(2) No contextual history examples. The ability of LLMs is investigated to invoke their internal knowledge for task reasoning and textual sequence generation.

(3) No need for robot data to train or fine-tune LLMs. This study focuses on invoking pre-trained LLMs. Consequently, the framework can be deployed and run even if local computational resources are limited.

(4) LLMs interact with the physical environment. Through grounding the LLMs in the MuJoCo physical environment, the SeIn framework can obtain historical observations and actions from MuJoCo, and subsequently, the LLMs are required to derive the desired action sequences.

Motion level planning

It is worth noting that the low-level control interface mentioned in studies related to LLMs-based robots is quite different from that in traditional robot control. In order to clarify the specific reference level of low-level control, Firoozi et al.⁴⁵ classifies the control level of robots into four levels: (1) Task Level: task goals, for example, tasks such as hugging, waving, etc.; (2) Skill Level: pre-trained skills such as pick [obj] up, put [obj] down, etc.; (3) Motion Level: predefined motion primitives that generate control commands, including target joint positions, etc.; (4) Servo Level: The three-loop control of position, velocity, and current operating within the servo drive.

In this work, the term “low-level control” refers to motion-level planning. It entails the direct generation of a dense sequence of target joint positions by the LLMs, which serves as the reference trajectory for the robot’s actuators. Specifically, in the configuration of dual-arm nursing robot, the robot consists of a mobile chassis, torso, left and right arms, etc., with a total of 14 degrees of freedom (DoF). Therefore, LLMs are required to output a 14-dimensional vector representing the target positions for all joints at each control step, for example: [50, 50, 50, 90, 90, 90, 50, 90, 50, 50, 80, 80, 80, 50, 80, 50]. They refer to the joint data corresponding to the torso (2-DoF), left arm (6-DoF), and right arm (6-DoF), respectively. This sequence of vectors constitutes the low-level control actions in SeIn framework.

Observation and action normalization

When processing numerical data, LLMs decompose the numerical input into a series of tokens, linguistic units that cannot directly correspond to the mathematical meaning of the numerical value. For example, the floating-point number 3.14 may be decomposed into several tokens, thus affecting the LLMs understanding of the numerical value. Meanwhile, the control data of the robot, such as joint position and joint velocity, are mainly floating-point numerical data, resulting in LLMs are not sensitive enough to the floating-point values of the robot. The core objective of normalization is to transform a robot’s continuous state space into a structured discrete representation. This converts complex regression problems—predicting continuous floating-point values—into discrete decision problems that are more manageable for LLMs. In the work of Prompt2Walk,^15,43 the researchers discretize the output of LLMs into non-negative integers from 0 to 200, and experiments show that this approach is effective because these integers are represented by individual tokens in the tokenizer of LLMs.

Building upon Prompt2Walk, the normalization operation is employed in this work to map all possible numerical values of the observations and actions of robot to non-negative integers ranging from 0 to 200. The corresponding mathematical expression for the normalization operation is:

y_{i} = ⌊ (round (x_{i}, 1) + 5) \times 10 ⌋

(1)

Where $x_{i}$ denotes the $i$ element in the observation and action, $round (x_{i}, 1)$ denotes rounding $x_{i}$ to one decimal place, and ⌊·⌋ denotes rounding operation. Subsequently, the normalized robot observations and actions are inputted to the LLMs, which output the normalized action sequence after inference.

Grounding LLMs

LLMs have difficulty comprehending and capturing causal relationships in the physical environment, fundamentally limiting their capability in low-level motion control and other functional tasks.^46–48 This is due to the lack of interaction data-based learning in pre-trained LLMs, where the training process involves LLMs learning by predicting the next token rather than through interaction-driven learning with environments. Their training goal makes LLMs limited in understanding and processing concepts from the real world or physical environment. Consequently, they lack grounded understanding of real-world dynamics, preventing direct environment interaction.

To enable LLMs for low-level robotic motion control, the Functional Grounding methodology⁴⁹ is employed to build the SeIn framework to ground the LLMs in the physical environment. The LLMs can obtain observations from MuJoCo, including the robot’s joint positions and joint velocities.

The SeIn framework enables the symbol processing inside the LLMs to interface with the external physical environment. So that SeIn is capable of simulating, predicting, and controlling external physical processes. The implementation comprises through three key steps. Firstly, based on the current observations, the LLMs directly generate Actions, which are sequences of target joint positions (normalized) required for the execution of action by the dual-arm nursing robot. The Action data is then tracked by a set of joint-level Proportional Differential (PD) controllers operating at a higher frequency to update the state of the dual-arm nursing robot in MuJoCo. Subsequently, the LLMs generate the next round of Actions based on historical observations and actions. The above process is repeated until the requirements of the target action are reached.

In this case, the joint-level PD controller is executed with the following equation:

τ_{sim} = ({target}_{q} - q) \times k_{p} + ({target}_{dq} - dq) \times k_{d}

(2)

${target}_{q}$ is the target joint position of the dual-arm nursing robot. ${target}_{dq}$ is the target joint velocity, set to $0$ in this implementation. $q$ and $dq$ are the observations, that is, the current joint position and joint velocity. $k_{p}$ and $k_{d}$ are the proportional and differential gains, respectively.

Prompt engineering

Unlike traditional learning-based and model-based controllers, textual prompts serve as the key to utilizing LLMs to drive the low-level control of a dual-arm nursing robot. The task-specific textual prompts are designed, which consist of four components: Task Description, Prior Knowledge of Action, Observation Information, and System Configuration, as illustrated in Figure 4.

(1) Task Description. Describe the basic information required for the task, including the role of LLMs in the task and the target action. LLM Role. Defining the role of the LLMs in the task, enabling the LLMs to understand its responsibilities and to generate task-relevant responses. Additionally, the control frequency of the LLMs policy is defined to ensure that the LLMs adjust its actions according to that frequency. Action Name. Specifying the task to be performed by the dual-armed care robot, in this case the hugging action. Providing an action name provides the LLMs with clear objectives to adapt its output to achieve this particular action.

(2) Prior Knowledge of Action. Defining prior knowledge about the target action, comprising two components: Action Execution Process and Target Joint Position. Action Execution Process. Describing the Action Execution Process required for the dual-arm nursing robot to accomplish the target action. By decomposing the target action into intuitive sub-steps, it enables the LLMs to understand the details of the action and directs the LLMs to generate an action-consistent output. Target Joint Position. Specifying the target position of a particular joint during action execution. By providing an accurate target joint position to the LLMs, it ensures that the LLMs outputs a joint angle that corresponds the target pose, thereby achieving the correct hugging configuration.

(3) Observation Information. Describing the observation information provided to the LLMs, comprising joint positions and joint velocities. The SeIn framework requires observation information to provide feedback to the robot during the execution of actions. Input Space. Defining the state representation of the robot, including joint positions (q) and joint velocities (dq). Input space provides the LLMs with the necessary observations required for understanding the current state of the robot. Output Space. Specifying the output structure for LLMs needs to generate, namely a 14-dimensional vector describing the target joint positions. Explicit output specification ensures LLMs-generated a specific data format, ensuring that the output is consistent with the requirements of the low-level controller.

(4) System Configuration. Defining the system configuration of the dual-arm nursing robot, including Joint Order, Control Pipeline, and Reasoning Constraints. Joint Order. Listing the specific sequential arrangement of robotic joints guiding the LLMs in understanding the mechanical configuration of robot. Requiring the LLMs needs to strictly follow these joint orders when generating outputs, so that the data generated by LLMs can correspond to the correct joints. Control Pipeline. Providing an overview of the whole control flow, specifically describing the use of the PD controller to track outputs. The control pipeline provides the LLMs with how the various components are handled and interconnected. Reasoning Constraints. Establishing guidelines for LLMs output generation, including smooth motion transitions between states, strict adherence to specified output formats. It is important to note that all values processed by the LLMs are raw data; these values have been normalized. These constraints ensure LLMs to produce physically plausible, controller-compatible outputs while maintaining consistency and predictability.

Figure 4.

Task-specific textual prompts. It consists of four components: Task description, prior knowledge of Action, observation information, and system configuration. These components enable LLMs to reason about the task, receive observations and generate the desired output. The final output comprises target joint positions.

Algorithmic details

The pseudo-code of the algorithm is shown in Algorithm 1, where $o_{t}$ denotes the observation information of the current round, including the joint position $q$ and joint velocity $dq$ . LLMs output is the normalized Target Joint Position ${action}_{t}$ , and $prompt$ denotes the task-specific textual prompts.

Algorithm 1. SeIn Framework.
Require: Joint position $q$ , joint velocity $dq$ , PD parameters $k_{p}, k_{d}$ . Require: Maximum number of iterations $K$ , frequency of LLMs $d$ . Require: History buffer $H$ . 1: Initialize $t \leftarrow 1$ . 2: $H . empty ()$ 3: $o_{t} \leftarrow Env . reset ()$ . 4: while $t < K$ do 5: $o_{t} \leftarrow Normalization (Env . get ())$ . 6: if $(t \mod d) = = 0$ then 7: $H . update ({action}_{t - 1}, o_{t})$ . 8: ${action}_{t} \leftarrow LLMs (prompt, H)$ . 9: end if 10: ${target}_{q} \leftarrow Inverse_Normalization ({action}_{t})$ . 11: $q, dq \leftarrow Inverse_Normalization (o_{t})$ . 12: $Env . step (PD_Control (k_{p}, k_{d}, {target}_{q}, q, dq))$ . 13: $t \leftarrow t + 1$ . 14: end while

Algorithm 1. SeIn Framework.

Require: Joint position

q

, joint velocity

dq

, PD parameters

k_{p}, k_{d}

.
Require: Maximum number of iterations

K

, frequency of LLMs

d

.
Require: History buffer

H

.
1: Initialize

t \leftarrow 1

.
2:

H . empty ()

o_{t} \leftarrow Env . reset ()

.
4: while

t < K

do
5:

o_{t} \leftarrow Normalization (Env . get ())

.
6: if

(t \mod d) = = 0

then
7:

H . update ({action}_{t - 1}, o_{t})

.
8:

{action}_{t} \leftarrow LLMs (prompt, H)

.
9: end if
10:

{target}_{q} \leftarrow Inverse_Normalization ({action}_{t})

.
11:

q, dq \leftarrow Inverse_Normalization (o_{t})

.
12:

Env . step (PD_Control (k_{p}, k_{d}, {target}_{q}, q, dq))

.
13:

t \leftarrow t + 1

.
14: end while

In each loop, SeIn framework first retrieves the current state of the robot from the MuJoCo environment, which includes 14-dimensional joint positions ( $q$ ) and 14-dimensional joint velocities ( $dq$ ). 28-dimensional floating-point vector is then processed by the normalization function to convert it into a sequence of non-negative integers, making it more suitable for the LLM’s token-based processing. This normalized observation ( $o_{t}$ ) is appended to a history buffer ( $H$ ), which maintains a record of past observations and actions. The complete, updated textual prompt, including the static task-specific information and the dynamic history buffer, is then sent to the pre-trained LLM. The LLM, operating at a inference frequency of 5 Hz, generates a sequence of 14 integers, representing the target joint positions for the next step. This integer sequence is then passed through an inverse normalization function to convert it back into a floating-point target joint position vector ( $targe t_{q}$ ). Finally, the vector is used as the setpoint for the high-frequency PD controller, which runs at 100 Hz and calculates the required torques to drive the robot’s joints accordingly.

Experiments

In this paper, the simulation model of a dual-arm nursing robot is developed in SolidWorks.⁵⁰ Simulation experiments were conducted in the MuJoCo⁵¹ physical environment, and three categories (totaling seven actions) were tested. Through extensive experiments, we aim to answer the following questions:

Question 1: How does the prior knowledge in the task-specific textual prompts affect the execution of actions in the SeIn framework?

Question 2: How should observation information be configured to achieve optimal performance in the SeIn framework?

Question 3: Can the SeIn framework be generalized to multiple action executions?

Question 4: What is the optimal normalization range for Observation and Action?

Question 5: How stable is the SeIn framework in the long-term when querying different LLMs to perform expressive actions?

Action design

ELEGNT¹⁷ categorizes robotic actions into functional actions and expressive actions. Expressive actions (e.g. waving and hugging) play a supporting role in the caregiving process. These types of actions can convey comfort, encouragement, and provide emotional support to the care recipient. Critically, this type of action does not require interaction with objects, the difficulty of executing the action is relatively low, and it is insensitive to the speed requirement. Based on these considerations, this study focuses on expressive actions for experimentation, covering a total of seven distinct actions, as illustrated in Figure 5.

Figure 5.

Seven expressive actions. Three categories are involved: single-arm execution of actions, dual-arm cooperative execution of actions, and dual-arm and torso cooperative execution of actions. The SeIn framework is validated for its execution performance across multiple movements.

Based on the motion patterns of the left and right arms, expressive actions are categorized into three types, as shown in Table 1 from left to right:

Single-arm execution of actions. Wave arm: Either the left or right arm performs the waving action separately while the opposite arm maintains its initial posture.

Dual-arm cooperative execution of actions. Hugging, Turn Left and Turn Right actions: The left and right arms need to execute different actions when executing. Arm Exercises: the left and right arms execute the same actions.

Dual-arm and torso cooperative execution of actions. Arms Up and Bend Down to Welcome: requiring both arms to perform the same movement, that is, the torso finishes bending after simultaneously raising the right and left arms.

Table 1.

Categorization of expressive actions.

Single-arm execution	Dual-arm cooperative execution	Dual-arm & torso cooperative execution
Wave right arm	Hugging	Arms up and bend down to welcome
Wave left arm	Turn left
	Turn right
	Arm exercises

Experimental set-up and evaluation metrics

Based on the structure of the prototype of the dual-arm nursing robot, the simulation model of the robot is developed in SolidWorks. The simulation model consists of mobile chassis, torso, left and right arms, etc., with a total of 14-DoF. The left and right arms adopt a symmetric structure, and each arm incorporates shoulder, elbow and wrist joints, and each joint has 2-DoF. The torso is composed of lumbar and hip joints, with 1-DoF for each joint.

Experiments were conducted in the MuJoCo physical environment. The PD control parameters for the controller are specified in Table 2. First, we conducted ablation experiments using GPT-4o-2024-08-06 for prompt design, along with expressive action experiments covering seven action types across three categories, to preliminarily evaluate the performance of the SeIn framework. The success rate was set as an evaluation metric to evaluate the performance of the SeIn framework. Five trials were performed for each action, and each trial lasted 4 s. Within 4 s, the experiment was deemed successful if the dual-arm nursing robot executed the set target action.

Table 2.

PD control parameters.

	Torso		Left arm						Right arm
Param	1	2	1–6						1–6
Kp	1000	500	500	500	250	250	250	250	500	500	250	250	250	250
Kd	25	12.5	12.5	12.5	12.5	6.25	6.25	6.25	12.5	12.5	6.25	6.25	6.25	6.25

Note. Units are Nm/rad for Kp and Nms/rad for Kd.

Subsequently, experiments were conducted to determine the optimal normalization range, and the framework’s performance was evaluated when querying different LLMs under execution durations of 4 and 60 s. Finally, experiments were conducted on functional action involving Reach-to-Object. The specific experimental settings for each of these will be detailed within their respective experiments.

Analysis of experimental results

First, we try to answer Question 1: How does the prior knowledge in the task-specific textual prompts affect the execution of actions in the SeIn framework?

The Action Execution Process and The Target Joint Position in the prior knowledge were selected for the experiment, and two configurations were set up: (1) Contains only the Action Execution Process. (2) Contains both the Action Execution Process and the Target Joint Position. In the experiment, the Hugging action was selected for testing, and the observation information included joint position and joint velocity. The results are shown in Table 3.

Table 3.

The effect of prior knowledge on action execution.

Prior knowledge	Success rate (%)
(1) Action execution process	0
(2) Action execution process + Target joint position	100

Experiments incorporating only Action Execution Process in the prior knowledge achieved 0% success rate (all experiments failed). In contrast, trials combining both Action Execution Process and Target Joint Position in the prior knowledge, achieved 100% success rate. Comparative analysis between failed experiments and the successful experiments to visualize how prior knowledge affects the execution of embracing movements.

As shown in Figure 6, the visualized Action Execution Process indicates that when only relying on (1) the Action Execution Process as the prior knowledge of LLMs, although the robot attempts the hugging action, but collisions occurred between the arms during the movement, and joint positions cannot be attained the Target Joint Position needed for the hugging action. However, with the addition of (2) the Target Joint Position, the arms did not collide during the movement, and the hugging action was successfully completed with a success rate of 100%.

Figure 6.

Comparison of the visualized execution process of the hugging action. In the first and second rows of pictures, the robot’s left and right arms collide during the lifting process. In contrast, in the third row, the robot’s left and right arms do not collide during the lifting process. Finally, both arms reach the target positions and successfully completed the hugging action.

Question 1 is answered through the ablation experiments. The synergistic collaboration between the Action Execution Process and the Target Joint Position is the key to achieving the low-level motion (action execution) using LLMs to drive the dual-arm nursing robot. Specifically, the Action Execution Process helps LLMs understand the sub-steps required for the action, while the Target Joint Position ensures that the LLMs output the joint position corresponding to the desired target pose.

Subsequently, we attempted to answer question 2: How should observation information be configured to achieve optimal performance in the SeIn framework?

We evaluated the following four types of observation information: (1) no observation, (2) joint position, (3) joint velocity, (4) joint position + joint velocity. Similarly, the hugging action was selected for the experiment. Based on the previous experimental results, the prior knowledge contains both the Action Execution Process and the Target Joint Position.

Experimental results are summarized in Table 4. Among them, in the experiments with no observation information, all five experiments failed. In the experiments where the observation information was either joint position or joint velocity, both achieved a success rate of 40%. Finally, the SeIn framework achieved the best performance with 100% success rate in the case where joint position and joint velocity were observed simultaneously.

Table 4.

Experimental results for different observation information.

Observation information	Success rate (%)
No observation	0
Joint position	40
Joint velocity	40
Joint position + joint velocity	100

Question 2 is answered through observational ablation studies. The configuration of observation information is crucial. Although the incorporation of prior knowledge provides LLMs with essential information such as sub-steps and target joint positions required for the action, the SeIn framework requires both joint positions and joint velocities as simultaneous observation information to provide feedback to the robot during the execution of the action. And one cannot be used without the other.

Then, we try to answer question 3: Can the SeIn framework be generalized to multiple action executions?

To investigate this, seven expressive actions (e.g. Hugging and Wave Arm) were designed, as shown in Figure 5. Consistent with empirical protocols, the prior knowledge integrates both the Action Execution Process and the Target Joint Position, and the observation information contains both the joint position and the joint velocity. Quantitative results are shown in Table 5. The SeIn framework can be applied to a wide range of actions.

Table 5.

Results of the seven actions.

Action type	Action name	DoF	Success rate (%)
Single-arm execution	Wave right arm	6	100
Single-arm execution	Wave left arm	6	100
Dual-arm execution	Turn left	7	100
	Turn right	7	100
	Hugging	12	100
	Arm exercises	12	100
Dual-arm and torso execution	Arms up and bend down to welcome	14	100

As shown in Table 5, the SeIn framework executes seven distinct actions, covering three categories: single-arm execution, dual-arm-cooperative execution, and dual-arm-torso cooperative execution, achieving 100% success rate across all categories. This demonstrates that the SeIn framework is able to accurately generate the required joint position sequences for target actions, ranging from single-arm actions with 6-DOF to full-body motions involving 14-DOF.

The execution process of the seven actions is shown in Figure 7. Three critical outcomes are observed: (1) the target joint successfully reached their target position at the completion of the action. (2) There is no collision of the left and right arms during the execution process. (3) There is no confusion between the left and right arms during the execution process.

Figure 7.

Execution process of the seven actions.

Based on the above experiments and analysis, we can now answer question 3. The SeIn framework extends to the execution of seven distinct actions, and the experimental results demonstrate the full kinematic feasibility of the SeIn framework. At the same time, the task-specific textual prompts can be rapidly reconfigurable to different action requirements, demonstrating the potential of the SeIn framework to extend to a number of different actions.

Next, we attempted to answer question 4: What is the optimal normalization range for Observation and Action?

To determine the optimal normalization range, we systematically compared four ranges: [0–100], [0–200], [0–500], and [0–1000]. Experiments were conducted with GPT-4o-2024-08-06 and Qwen2.5-72B-Instruct using identical prompts. The experimental setup is as follows: LLM inference frequency (5 Hz), evaluation duration (4 s), all seven expressive actions, repetition count (5 repetitions per action). Decoding parameters are uniformly set to temperature = 0.0. Outputs employed structured constraints (JSON + regular expression validation) to minimize failures caused by format mismatches. Success is determined by the robot achieving the target pose without collision and without confusion between left-right arms.

As shown in Figure 8. When handling Single-Arm Execution, GPT-4o and Qwen2.5 achieved higher success rates across all four normalization range. Notably, When performing Arm Exercises in Dual-Arm Execution, both models encountered challenges: Qwen2.5 achieved 60% success in the [0–200] range and 0% in [0–500]; GPT-4o performed best at 100% success in [0–200]. Finally, when processing Arms Up and Bend Down to Welcome: Qwen2.5 achieves a 60% success rate in [0–200], [0–500], and [0–1000], while its success rate drops to 0% in the [0–100]; GPT-4o achieved 100% success rate in [0–200] range, demonstrating optimal performance. Overall, the [0–200] normalization range demonstrated superior performance.

Figure 8.

Normalization range evaluation results in seven expressive actions. The experiment systematically compared four ranges: [0–100], [0–200], [0–500], and [0–1000].

Finally, we try to answer question 5: How stable is the SeIn framework in the long-term when querying different LLMs to perform expressive actions?

To investigate this, we extended the evaluation duration from the 4 to 60 s. We selected the following models for comparative experiments: (1) Closed-source models: GPT-4o-2024-08-06; (2) Open-source models: LLaMA-3.3-70B-Instruct, Qwen2.5-72B-Instruct, DeepSeek-V3.1. The experimental setup is as follows: identical prompts, LLM inference frequency (5 Hz), all seven expressive actions, repetition count (5 repetitions per action), with the normalization scale set to [0–200].

As shown in Table 6, where 4-SR and 60-SR represent the success rates of the SeIn framework during evaluation durations of 4 and 60 s, respectively. Both GPT-4o and DeepSeek-V3.1 models achieved 100% success rates across all 7 actions and 35 trials, maintaining stable performance in both short-term (4 s) and long-term (60 s) tests. Qwen2.5-72B-Instruct demonstrated strong performance in low-DoF tasks (DoF ≤ 7), with its 60-SR performance matching or slightly fluctuating around its 4-SR performance. However, its performance declined in high-DoF tasks, indicating bottlenecks in long-term temporal reasoning and error correction for complex actions. LLaMA-3.3-70B-Instruct’s performance is closely tied to task complexity. For high-DoF tasks (DoF ≥ 12)—namely Hugging, Arm Exercises, and Arms Up and Bend Down to Welcome—the model completely fails (achieving 0% success in both 4-SR and 60-SR), unable even to execute short-term control for these high-DoF actions.

Table 6.

Evaluation results for querying various LLMs under 4- and 60-s durations.

Model	Action type	Action name	DoF	4-SR (%)	60-SR (%)
GPT-4o-2024-08-06	Single-arm execution	Wave right arm	6	100	100
		Wave left arm	6	100	100
	Dual-arm execution	Turn left	7	100	100
		Turn right	7	100	100
		Hugging	12	100	100
		Arm exercises	12	100	100
	Dual-arm and torso execution	Arms up and bend down to welcome	14	100	100
DeepSeek-V3.1	Single-arm execution	Wave right arm	6	100	100
		Wave left arm	6	100	100
	Dual-arm execution	Turn left	7	100	100
		Turn right	7	100	100
		Hugging	12	100	100
		Arm exercises	12	100	100
	Dual-arm and torso execution	Arms up and bend down to welcome	14	100	100
Qwen2.5-72B-Instruct	Single-arm execution	Wave right arm	6	100	100
		Wave left arm	6	100	80
	Dual-arm execution	Turn left	7	100	100
		Turn right	7	100	100
		Hugging	12	100	40
		Arm exercises	12	60	20
	Dual-arm and torso execution	Arms up and bend down to welcome	14	60	60
LLaMA-3.3-70B-Instruct	Single-arm execution	Wave right arm	6	100	60
		Wave left arm	6	100	100
	Dual-arm execution	Turn left	7	100	100
		Turn right	7	60	60
		Hugging	12	0	0
		Arm exercises	12	0	0
	Dual-arm and torso execution	Arms up and bend down to welcome	14	0	0

Through the above experiments and analysis, we can answer Question 5. While the long-term stability of the SeIn framework correlates with the inherent capabilities of the selected LLMs. It is crucial to emphasize that for top-tier LLMs, the SeIn framework actively suppresses error accumulation through its observation information feedback, demonstrating its potential for achieving long-term stability.

Reach-to-object

To further test the potential of the SeIn framework in handling more complex, functional actions requiring interaction with the environment, we introduced the Reach-to-Object task. The task execution flow is as follows: The robot must first obtain the world coordinates of the red block in the MuJoCo in real-time, generating a target point with a fixed safety margin (+2 cm along the X-axis). Subsequently, it dynamically selects the arm closest to the target to execute the task. Then, it solves for the full target joint position in the robot base coordinate system. Finally, the arm selection decision, red block coordinates, discretized target q, and observation segments are provided to the LLM, which outputs a smooth intermediate joint sequence for the approach.

Experimental setup: GPT-4o-2024-08-06 and DeepSeek-V3.1 were selected for the experiment. Each arm performed 5 trials. Within 10 s, the robot arm end must reach within the target offset area without collisions or confusion between the left and right arms.

The execution process is illustrated in Figure 9. The robot’s left/right arm starts from a stationary position and gradually extends toward the red block placed on its left/right side. The entire execution process exhibits smooth and natural motion, with the robotic arm’s endpoint ultimately positioning within the target object area. No collisions or confusion between the left and right arms occurred during the execution. Performance metrics are presented in Table 7. When SeIn framework queried GPT-4o-2024-08-06 and Deepseek-V3.1, both achieved a 100% success rate. This shows that when handling more complex, environment-interactive functional actions (Reach-to-Object), the SeIn framework successfully generates the required target joint position sequences to guide the arm to the target location area.

Figure 9.

Execution process of the reach-to-object. The first and second rows of images show the motion of the left and right arms as they approach the red object area, respectively.

Table 7.

Success rate comparison for reach-to-object task.

Action type	DoF	Model	Arm name	Success rate (%)
Reach-to-object	14	GPT-4o-2024-08-06	Right arm	100
		GPT-4o-2024-08-06	Left arm	100
		Deepseek-V3.1	Right arm	100
		Deepseek-V3.1	Left arm	100

Experimental Summary: The SeIn framework generates target joint-position sequences by invoking pre-trained LLMs through task-specific textual prompts to accomplish multiple action executions. Specifically, the prior knowledge embedded in the task-specific textual prompts helps the LLMs precisely decode action kinematics, resulting in generating more accurate target joint positions. The observation information consisting of joint positions and joint velocities guarantees stability during execution. Such task-specific textual prompts are rapidly reconfigurable for heterogeneous action specifications to different action requirements. Ultimately, the SeIn framework accomplished the execution of seven expressive actions and demonstrated potential for handling one functional action in a simulated physical environment.

Limitations and future work

The inference speed of the SeIn framework is limited

The SeIn framework exhibits constrained inference speeds due to LLMs response latency (5 Hz inference rate) and PD controller tracking frequency (100 Hz). Simulation experiments indicate that the SeIn framework is currently best suited for expressive actions and task with lower real-time requirements. Although demonstrated in simulation, enhanced LLMs inference rates could enable accelerated execution in physical systems.

The textual prompts are still relatively fragile

Although the SeIn framework can perform seven actions, these actions are relatively simple. Empirical evidence confirms that the design of prior knowledge and observation information can greatly influence the final performance of the SeIn framework. Therefore, crafting more refined and optimized textual prompts may further exploit the potential of the SeIn framework to execute more complex actions (e.g. actions involving interaction with environmental objects or requiring multi-step coordination).

Conduct research on optimization and convergence

Subsequent studies will delve into the convergence properties of generated sequences and explore optimization frameworks to enhance their accuracy. For instance, optimization methods such as co-evolutionary neural dynamics (incorporating dual gradient accumulation)⁵² can be introduced.

This study was conducted solely in simulation

The real-world experiments have not yet been carried out. Future deployments will focus on the SeIn framework on a dual-arm nursing robot for real-world experiments to test its performance in real prototypes and scenarios.

Conclusion

The SeIn framework is proposed to integrate LLMs with MuJoCo simulations, leveraging for dense target joint-position sequences generation. Task-specific textual prompts are engineered with four components: Task Description, Prior Knowledge of Action, Observation Information, and System Configuration. Subsequently, ablation experiments systematically quantify prompts component impacts of task-specific textual on the SeIn framework. Through experiments with four distinct LLMs at execution durations of 4 and 60 s, the framework’s long-term stability was validated.

This work empirically validates that properly architected frameworks and optimized prompts enable LLMs to demonstrate task comprehension and low-level robotic control capability. It holds promise for simplifying the complex “high-level planning + low-level execution” separation architecture currently employed in LLM-driven robotic systems.

Footnotes

Handling Editor: Fernanda Coutinho

ORCID iDs

Zhendong Zhao

Jiexin Xie

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

The authors confirm.

Author contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zhendong Zhao, Jiexin Xie, Yang Li and Shijie Guo. The first draft of the manuscript was written by Zhendong Zhao and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by National Natural Science Foundation of China under Grant 62303154.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Code availability

The authors confirm.

References

Iwata

Sugano

. Design of human symbiotic robot TWENDY-ONE. In: 2009 IEEE international conference on robotics and automation, Kobe, Japan, 12–17 May 2009, pp.580–586. New York: IEEE.

Guo

Shiraoka

Inada

, et al. A two-ply polymer-based flexible tactile sensor sheet using electric capacitance. Sensors 2014; 14(2): 2225–2238.

Ding

Lim

Y-J

Solano

, et al. Giving patients a lift-the robotic nursing assistant (RoNA). In: 2014 IEEE international conference on technologies for practical robot applications (TePRA), Woburn, MA, USA, 14–15 April 2014, pp.1–5. New York: IEEE.

Chung

Yang

Cui

, et al. Design of a rescue robot with a wearable suit augmenting high payloads rescue missions. In: 2017 2nd international conference on advanced robotics and mechatronics (ICARM), Hefei and Taian, China, 27–31 August 2017, pp.704–711. New York: IEEE.

Achiam

Adler

Agarwal

, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

Touvron

Martin

Stone

, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

Liang

Huang

Xia

, et al. Code as policies: language model programs for embodied control. In: 2023 IEEE international conference on robotics and automation (ICRA), London, UK, 29 May–2 June 2023, pp.9493–9500. New York: IEEE.

Zeng

Attarian

Ichter

, et al. Socratic models: composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.

Huang

Xia

Xiao

, et al. Inner monologue: embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.

10.

Ahn

Brohan

Brown

, et al. Do as I can, not as I say: grounding language in robotic affordances. Proc Mach Learn Res 2023; 205: 287–318.

11.

Vemprala

Bonatti

Bucker

, et al. ChatGPT for robotics: design principles and model abilities. IEEE Access 2024; 12: 55682–55696.

12.

Snell

Yang

, et al. Context-aware language modeling for goal-oriented dialogue systems. In: Findings of the association for computational linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022, pp.2351–2366. Kerrville, TX: Association for Computational Linguistics.

13.

Zhao

Yue

Xie

, et al. A dual-agent collaboration framework based on LLMs for nursing robots to perform bimanual coordination tasks. IEEE Robot Autom Lett 2025; 10: 2942–2949.

14.

Huang

Wang

Zhang

, et al. VoxPoser: composable 3D value maps for robotic manipulation with language models. In: Towards generalist robots: learning paradigms for scalable skill acquisition CoRL 2023, Atlanta, GA, USA, 6–9 November 2023, pp.251–266. New York: PMLR.

15.

Wang

Zhang

Chen

, et al. Prompt a robot to walk with large language models. In: 2024 IEEE 63rd conference on decision and control (CDC), Milan, Italy, 16–19 December 2024, pp.1531–1538. New York: IEEE.

16.

Wang

Shi

, et al. Large language models for robotics: opportunities, challenges, and perspectives. J Autom Intell 2024; 2: 30–41.

17.

Huang

Sivapurapu

, et al. ELEGNT: expressive and functional movement design for non-anthropomorphic robot. arXiv preprint arXiv:2501.12493, 2025.

18.

Kress-Gazit

Fainekos

Pappas

GJ.

Translating structured English to robot controllers. Adv Robot 2008; 22(12): 1343–1359.

19.

Matuszek

Herbst

Zettlemoyer

, et al. Learning to parse natural language commands to a robot control system. In: Desai

Dudek

Khatib

, et al. (eds) Experimental robotics: the 13th international symposium on experimental robotics. Springer, 2013, pp.403–415.

20.

Jang

Irpan

Khansari

, et al. BC-Z: zero-shot task generalization with robotic imitation learning. Proc Mach Learn Res 2021; 164: 991–1002.

21.

Lynch

Wahid

Tompson

, et al. Interactive language: talking to robots in real time. IEEE Robot Autom Lett 2023; 5: 1–8.

22.

Mees

Borja-Diaz

Burgard

Grounding language with visual affordances over unstructured data. In: 2023 IEEE international conference on robotics and automation (ICRA), London, UK, 29 May–2 June 2023, pp.11576–11582. New York: IEEE.

23.

Ebert

Yang

Schmeckpeper

, et al. Bridge data: boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021.

24.

Korattikara

Levine

, et al. From language to goals: inverse reinforcement learning for vision-based instruction following. arXiv preprint arXiv:1902.07742, 2019.

25.

Lee

Hwangbo

Hutter

Robust recovery controller for a quadrupedal robot using deep reinforcement learning. arXiv preprint arXiv:1901.07517, 2019.

26.

Siekmann

Godse

Fern

, et al. Sim-to-real learning of all common bipedal gaits via periodic reward composition. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 30 May–5 June 2021, pp.7309–7315. New York: IEEE.

27.

Xia

Martín-Martín

, et al. ReLMoGen: leveraging motion generation in reinforcement learning for mobile manipulation. arXiv preprint arXiv:2008.07792, 2020.

28.

Chiang

HTL

Faust

Fiser

, et al. Learning navigation behaviors end-to-end with AutoRL. IEEE Robot Autom Lett 2019; 4(2): 2007–2014.

29.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877–1901.

30.

Radford

Child

, et al. Language models are unsupervised multitask learners. OpenAI Blog 2019; 1: 9.

31.

Radford

Narasimhan

Salimans

, et al. Improving language understanding by generative pre-training. OpenAI Blog 2018; 1: 3.

32.

Devlin

Chang

M-W

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), Minneapolis, MN, USA, June 2–7 2019, pp.4171–4186. Kerrville, TX: Association for Computational Linguistics.

33.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 88–98.

34.

Pang

Liu

TransDFL: identification of disordered flexible linkers in proteins by transfer learning. Genomics Proteomics Bioinformatics 2023; 21(2): 359–369.

35.

Pang

Liu

DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model. BMC Biol 2024; 22(1): 3–13.

36.

Singh

Blukis

Mousavian

, et al. ProgPrompt: generating situated robot task plans using large language models. In: 2023 IEEE international conference on robotics and automation (ICRA), London, UK, 29 May–2 June 2023, pp.11523–11530. New York: IEEE.

37.

Wang

Gonzalez-Pumariega

Sharma

, et al. Demo2Code: from summarizing demonstrations to synthesizing code via extended chain-of-thought. Adv Neural Inf Process Syst 2023; 36: 14848–14956.

38.

Lin

Agia

Migimatsu

, et al. Text2Motion: from natural language instructions to feasible plans. Auton Robot 2023; 47(8): 1345–1365.

39.

Chen

Arkin

Dawson

, et al. AutoTAMP: autoregressive task and motion planning with LLMs as translators and checkers. In: 2024 IEEE international conference on robotics and automation (ICRA), Yokohama, Japan, 13–17 May 2024, pp.6695–6702. New York: IEEE.

40.

Brohan

Brown

Carbajal

, et al. RT-1: robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.

41.

Gileadi

, et al. Language to rewards for robotic skill synthesis. Proc Mach Learn Res 2023; 229: 374–404.

42.

Kwon

Di Palo

Johns

Language models as zero-shot trajectory generators. IEEE Robot Autom Lett 2024; 7: 6728–6735.

43.

Mirchandani

Xia

Florence

, et al. Large language models as general pattern machines. In: 7th conference on robot learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023, pp.2498–2518. New York: PMLR.

44.

Zhou

Schärli

Hou

, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.

45.

Firoozi

Tucker

Tian

, et al. Foundation models in robotics: applications, challenges, and the future. Int J Robot Res 2025; 44(5): 701–739.

46.

Mahowald

Ivanova

Blank

, et al. Dissociating language and thought in large language models. Trends Cogn Sci 2024; 28(6): 517–540.

47.

Bender

Koller

. Climbing towards NLU: on meaning, form, and understanding in the age of data. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Online, 5–10 July 2020, pp.5185–5198. Kerrville, TX: Association for Computational Linguistics.

48.

Bisk

Holtzman

Thomason

, et al. Experience grounds language. arXiv preprint arXiv:2004.10151, 2020.

49.

Carta

Romac

Wolf

, et al. Grounding large language models in interactive environments with online reinforcement learning. Proc Mach Learn Res 2023; 202: 3676–3713.

50.

Matsson

JE.

An introduction to SOLIDWORKS flow simulation 2023. SDC Publications, 2023.

51.

Todorov

Erez

Tassa

. MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ international conference on intelligent robots and systems, Vilamoura-Algarve, Portugal, 7–12 October 2012, pp.5026–5033. New York: IEEE.

52.

Fan

Jin

, et al. Coevolutionary neural dynamics considering multiple strategies for nonconvex optimization. Tsinghua Sci Technol 2025; 2025: 9010120.

Generating joint sequences with pre-trained LLMs to drive dual-arm nursing robot to perform actions

Abstract

Keywords

Introduction

Related work

Dual-arm nursing robot

LLMs-based robots

Proposed method

Problem formulation

Assumptions and constraints

Motion level planning

Observation and action normalization

Grounding LLMs

Prompt engineering

Algorithmic details

Experiments

Action design

Experimental set-up and evaluation metrics

Analysis of experimental results

Reach-to-object

Limitations and future work

The inference speed of the SeIn framework is limited

The textual prompts are still relatively fragile

Conduct research on optimization and convergence

This study was conducted solely in simulation

Conclusion

Footnotes

ORCID iDs

Consent to participate

Consent for publication

Author contributions

Funding

Declaration of conflicting interests

Code availability

References