Sage Journals: Discover world-class research

Abstract

To achieve efficient and stable autonomous operation during excavator robotic trimming plane operations, this study proposes an online trajectory planning method based on deep reinforcement learning (RL) using the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. This method involves the construction of a simulation environment to generate training data, where the joint angles of the boom, arm, and bucket of the excavator robotic working device serve as state observation variables, and the angle changes of each joint constitute the action information. The interaction between the simulation environment and the autonomous learning algorithm is facilitated by these state observations, and the policy network is trained using a reward function. Under identical experimental conditions, the proposed algorithm exhibits higher training time compared with other RL algorithms designed for continuous action spaces. Specifically, the training time of the proposed algorithm is reduced by 24.81%, 40.29%, and 34.51% compared with those of the DDPG, traditional TD3, and TRPO algorithms, respectively. In addition, the time of the proposed algorithm compared with the DDPG algorithm, TRPO algorithm, and traditional TD3 algorithm, the time required to complete a given task is reduced by 1.807, 3.703, and 5.011 s, respectively. These results indicate that the proposed optimization algorithm offers improved efficiency and faster convergence than the DDPG, traditional TD3, and TRPO algorithms, ultimately generating an efficient task trajectory. Moreover, the method effectively minimizes the large impacts on each joint, ensuring that the excavator robotic system operates with high efficiency and stability.

Keywords

excavator robotic TD3 algorithm online trajectory planning time optimal reinforcement learning

Introduction

Excavators, as widely used mechanical equipment, play a crucial role in industries such as transportation, mineral excavation, forestry, mining, and construction.¹ However, the dangerous and harsh working environments of excavators, coupled with long-term high-intensity operations, pose notable challenges. Traditional manual operation not only strains workers but also increases fatigue risk, leading to low efficiency, unnecessary energy loss, and equipment failure. To address these issues and ensure personnel safety while improving equipment productivity, the intelligent transformation of excavators has become a focal point for researchers worldwide. Autonomous excavators offer considerable economic and social benefits by eliminating safety hazards in dangerous environments, making them suitable for applications in nuclear radiation, space exploration, and underwater operations.² As a result, the automation and intelligence of excavators have become an inevitable development trend.^3,4

Currently, research on intelligent excavators, both domestically and internationally, has made notable progress. At Lancaster University^5–7 in the United Kingdom, advancements include the integration of GPS positioning and laser scanning for location information acquisition and environmental detection, along with hydraulic system modifications controlled by a PC104 computer. Similarly, the Australian Robotics Center^8,9 and Seoul National University in South Korea¹⁰ have developed autonomous excavation systems. Their research focuses on trajectory planning, position control, and trajectory tracking, which satisfy the requirements of intelligent operations.

Shanhe Group, a domestic company, has developed an excavation robot test platform that allows excavators to transition gradually from manual to autonomous operation.¹¹ Zhejiang University successfully created a WY-3.5 experimental excavation robot capable of hierarchical planning and local autonomous control, providing a theoretical foundation for excavators operating under real-world conditions.¹² Based on the above, among the various intelligent technologies for excavators, operation task trajectory planning is a high-level planning technology that plays a pivotal role. This planning technology not only directly influences the working efficiency and energy consumption of the excavator but also ensures the repeatable accuracy of the planned trajectory-an essential factor for maintaining the stability and reliability of equipment during operation. Thus, as a core technology for autonomous excavator operation, effective trajectory planning holds great importance.

In recent years, considerable progress has been made in trajectory planning research. Park et al.¹³ developed a minimum energy-consumption excavation trajectory optimization model in South Korea, considering both the operating range and the excavation environment during the optimization process. This approach resulted in an effective low-energy consumption trajectory. Kim et al.¹⁴ employed B-spline interpolation curves and proposed a recursive geometric algorithm to achieve a minimum time and torque operation trajectory with high reliability and robustness. Yoshida et al.¹⁵ focused on minimizing the energy consumption by creating an excavation material model using the discrete element method, ultimately planning an optimal energy-efficient trajectory while accounting for excavation resistance. Similarly, Bi et al.¹⁶ optimized the time and energy consumption through a staged approach using genetic algorithms to derive a high-efficiency, low-energy trajectory. Wang et al.¹⁷ also used genetic algorithms to refine the polynomial interpolation, achieving an optimal energy-consumption trajectory for excavation tasks. Zhang et al.¹⁸ developed a trajectory planning program that automates the speed and acceleration planning of each hydraulic cylinder by setting the motion trajectory of the bucket teeth, thereby ensuring smoother excavator operation. Zhao et al.¹⁹ proposed a novel trajectory generation method for autonomous excavation teaching. This method converts inefficient and equipment-damaging human-operation trajectories into fast, smooth trajectories and integrates this framework into a complete autonomous excavation platform. On-site environmental validation confirmed the effectiveness of their approach. Fan et al.²⁰ introduced a cubic polynomial S-curve interpolation method for planning multi-objective trajectories and used a multi-objective simplification algorithm based on the decomposition of mixed constraints to optimize constrained multi-objective problems. Zou et al.²¹ optimized the trajectory planning for horizontal and slope excavation operations by using numerical optimization iterations to derive trajectories that meet the constraints and ensure high efficiency. Inner and Kucuk²² employed a particle swarm optimization algorithm to optimize the dexterous workspace of 10 GSP (General Stewart-Gough Platforms) configurations, maximizing the mechanism movement flexibility. Ege and Kucuk²³ proposed an energy-optimization method based on actuator power consumption for a new three-axis robotic knee prosthesis to minimize battery power usage. Kucuk²⁴ applied a particle swarm optimization algorithm to obtain a high-efficiency operational trajectory by optimizing cubic spline interpolation for a full-plane parallel mechanism using time as the optimization target. Simon and Isik²⁵ proposed a trigonometric function parameterization-based robot trajectory generation method to overcome the smoothness and vibration suppression limitations of traditional polynomial trajectories. In summary, all of the above trajectory planning methods employ interpolation functions in principle and use numerical optimization or intelligent algorithms to obtain optimal trajectories. However, these methods primarily rely on offline optimization and are generally designed for static environments. They do not fully account for real-time environmental changes, limiting their flexibility in complex construction scenarios and making it difficult to meet real-time replanning demands in dynamic environments.

With the continuous innovation and rising maturity of artificial intelligence, self-learning algorithms based on reinforcement learning (RL) and deep learning have been recently applied to the research of excavator task trajectory planning in the recent years. Egli et al.^26–28 employed the Trust Region Policy Optimization (TRPO) algorithm to perform high-precision tracking of the target trajectory at the end of the bucket tooth tip of a data-driven excavator robotic. Kurinov et al.²⁹ used the “covariance matrix adaptive” Proximal Policy Optimization (PPO) algorithm to time-optimize the unloading trajectory. Hodel³⁰ employed the TRPO algorithm to optimize a smooth trimming plane operation trajectory. A newly proposed learning function proposed by Yang et al.³¹ introduce an effective method for applying adaptive agent models in reliability assessment. However, the PPO algorithm relies on a complex clipping mechanism, while the TRPO algorithm can lead to overestimation due to its use of a single Q-network, which negatively influences policy stability.

Regarding the problems existing by the above studies, this study focuses on the PC1012 excavator robotic and employs an improved version of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm to determine the time-optimal trajectory for the excavator’s working device, as shown in Figure 1. The input of the framework is the state information of the boom, arm and bucket joints (i.e. the value of each joint angle), while the output corresponds to the action information (i.e. the change in each joint angle). Once the current state information of each joint is input, the data is normalized and passed to the deep RL module. During autonomous neural network training, the environment constructed by the excavator robotic and the target point provides the necessary training data. Specifically, the joint angles and the target angles of the boom, arm, and bucket serve as inputs to the neural network, which then calculates and outputs the corresponding changes in joint angles based on these state variables. The output is evaluated using a reward function that accounts for joint angle constraints, total motion time, and relative distance to the target. The neural network parameters are updated based on these evaluation metrics. Through iterative training and learning, the system ultimately generates an optimal strategy with a high reward value.

Figure 1.

Online trajectory planning framework for excavator robotics based on reinforcement learning.

The main contributions of this study are summarized as follows:

(1) A deep RL algorithm is employed to achieve efficient and stable trajectory planning for autonomous, online trimming plane operations of excavator robotics.

(2) A multiagent system is developed for the boom, arm, and bucket joints, incorporating an adaptive weight sampling mechanism and a centralized training-distributed execution method. The TD3 algorithm is applied to optimize the time-optimal motion trajectory.

(3) A reward function is designed that considers whether the joint angle exceeds its permissible limit as well as the total movement time and the relative distance to the target. The network output is evaluated based on the reward value, which allows parameter correction of the system’s training network. The optimal movement strategy with a high reward value is generated through continuous training and learning.

The remainder of this paper is organized as follows: Section “Multiagent, autonomous-learning trajectory planning” introduces multiagent, autonomous-learning trajectory planning. Section “Excavator working device system modeling” discusses the kinematic modeling of the boom, arm, and bucket joints. Section “Experiment and result analysis” analyzes the experimental results, and Section “Conclusions” concludes the study.

Multiagent, autonomous-learning trajectory planning

For autonomous excavator robotics, planning the movement of each joint of the working device to achieve high efficiency and smooth operation while assuming that the drive and joint spaces are interchangeable is a key challenge. This study focuses on end-to-end trajectory planning task of the excavator robotics, optimizing working time. The boom, arm, and bucket joints are treated as independent decision-making intelligent entities. A deep neural network is used to approximate the trajectory planning strategy for the target task, and the TD3 algorithm derives dynamic decision behaviors that maximize the reward value for these joints. In other words, the time-optimal trajectory formed by the end of the bucket tooth tip is a combined decision sequence of three joints.

TD3 algorithm

It is not easy to accurately design the Q-value function used for evaluation because the boom, arm, and bucket joint operations are continuous actions. Hence, the policy gradient algorithm is employed to solve this problem. TD3 is a deterministic policy gradient algorithm based on the Actor–Critic framework, which directly optimizes the strategy via maximization of expected cumulative reward. And the TD3 algorithm structure framework is shown in Figure 2. The TD3 algorithm contains Actor and Target Actor as actuators, and Critic_0, Critic_1, Target Critic_0, and Target Critic_1 as evaluators, which are referred to as decision networks and estimation networks, respectively.

Figure 2.

Structural framework of the TD3 algorithm.

During the training process, each action of the agent generates experience information $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ , namely current state, action, reward value, and next state, respectively, which is saved as tuples in the experience replay buffer. The Actor network and the Critic_0 and Critic_1 networks update their parameters via random extraction of Minibatch data from the buffer. Of these, the Critic_0 and Critic_1 networks train the network via minimization of the loss function $L_{i}$ :

{\begin{matrix} L_{i} = \frac{1}{T} \sum_{i}^{T} {(y_{i} - Q_{i} (s_{i}, a_{i} | θ^{Q}))}^{2} \\ y_{i} = r_{i} + γ min_{j = 1, 2} {Q^{'}}_{j} (s_{i + 1}, \tilde{a} | θ^{Q_{j}^{'}}) \\ \tilde{a} = π^{'} (s_{i + 1} | θ^{π^{'}}) + ε \end{matrix}

(1)

In equation (1), $L_{i}$ is the loss function of the i-th Critic network, $T$ is the number of sampled data points, $a_{i}$ denotes the action generated by $i$ -th agent actuator as per state $s_{i}$ , $θ^{Q}$ is a parameter that the evaluation network needs to learn, $y_{i}$ cumulative reward of current iteration, $Q_{i}$ is the Critic network (with parameter $θ^{Q}$ ) predicts the Q value of the current state $s_{i}$ and action $a_{i}$ , $r_{i}$ is the immediate reward of current state $s_{i}$ and action $a_{i}$ , $γ$ is a discount factor used to balance the importance of current and future rewards, $Q_{j}^{'}$ is the Critic network (with parameter $Q_{j}^{'}$ ) predicts the Q value of the next state $s_{i + 1}$ and target policy action $\tilde{a}$ , $π'$ is the policy network, $ε$ is the added Gaussian white noise, and $\tilde{a}$ action space after addition of Gaussian white noise (so that the strategy can explore the environment and avoid falling into the local optimum).

The regularization technique of target policy smoothing is employed in the Bellman update of the Actor network to reduce the phenomenon that the deterministic policy method produces high variance target values when updating the Actor. Actor network update gradients:

\nabla_{θ^{π}} J ~ \frac{1}{T} \sum \nabla_{a} Q (s, a | θ_{a}^{Q}) |_{s = s_{i,} a = a_{i}}, \nabla_{θ^{π}} π (s | θ^{π}) |_{s = s_{i}}

(2)

Deep neural networks with parameters $θ^{π}, θ_{a}^{Q}$ , and $θ_{b}^{Q}$ are used to represent the Actor, Critic_0, and Critic_1 networks, respectively, and the target network update mechanism is adopted for each network so as to ensure the stability of the updates of each network. Simultaneously, during the training process, the network adopts a delayed soft update method to update the parameters of the Actor network to those of the Target Actor network and the parameters of the Critic_0 and Critic_1 networks to those of the Target Critic_0 and Target Critic_1 networks, respectively.

Time-optimal trajectory based on TD3 algorithm

In line with the theoretical knowledge provided in Section “TD3 algorithm,” aiming at enhancing the optimization performance of the TD3 algorithm, its sampling mechanism and training method are improved to obtain an efficient autonomous operation trajectory.

Priority sampling mechanism

The sequence data of the boom, arm, and bucket joints that can reach the target point are stored in the experience pool as a four-tuple: $(s_{i}, a_{i}, s_{i + 1}, r_{i + 1})$ , where $s_{i}$ denotes the state information, $a_{i}$ is the agent’s action information, $s_{i + 1}$ is the state information after the agent performs a new action, and $r_{i + 1}$ is the reward value from the action. To address the issue that random sampling does not effectively use high-quality samples and may lead the model into a local optima, an adaptive weighted priority sampling mechanism is adopted. This mechanism ensures convergence of training results as the sample size changes continuously.

First, the temporal difference (TD) error is calculated using a dual Critic network. A priority experience playback sampling probability distribution is then constructed based on the absolute value of the TD error. The adaptive weights increase the influence of the loss function value on the sampling weights. The loss function value for each sample is calculated and sorted in absolute value from large to small. And high error samples are preferentially extracted for training. Secondly, to ensure that the total sampling probabilities sum to 1, the adaptive weight calculation formula is given as follows:

P_{i} = \frac{\frac{1}{i}}{\sum_{i = 1}^{M} \frac{1}{i}}

(3)

In equation (3), $M$ denotes total number of samples for an agent, $i$ is theserial number of the sample, and $P_{i}$ represents the sampling probability of sample $i$ . The equation (3) increases the probability of samples with larger loss function values, allowing boundary reward value samples to influence policy network training. It also balances exploration and utilization by adjusting the probabilities across different samples.

Finally, in the multiagent system, this mechanism forms a closed-loop optimization cycle of “high error sample priority training-weight focus on stubborn samples-update priority distribution.” To facilitate coordinated learning among agents, each agent’s sampling weight is calculated individually. Each agent then selects samplesaccoridng to its own weights to update the policy network parameters.

Centralized training-distributed execution mechanism

In the multiagent system comprising boom, arm, and bucket joints, the state transitions depend on the actions of all agents, and each agent’s reward is influenced by other agents. In other words, changing one agent’s strategy directly influences the optimal decisions and value function estimations of the other agents. Therefore, this study adopts a centralized training-distributed execution architecture for simulation in order to ensure convergence of the multiagent system algorithm, as shown in Figure 3.

Figure 3.

Centralized training-distributed execution architecture.

Since joint actions determine the state transitions and reward value functions of the entire system, and the decisions among agents are coupled, centralized training is implemented. During training, the state $s$ transition of the multiagent environment is jointly influenced by the agent’s actions $a_{1} ~ a_{n}$ . Consequently, a single agent’s strategy change directly infuences the optimal decisions of the other agents and the accuracy of value function estimation. Therefore, the input to the target Critic network during training includes the state information of all agents and their joint actions, and the output value function incorporates guidance information for multiagent cooperation. This approach prevents strategy changes in one agent from negatively influencing the decision of others.

During the execution phase, each agent (boom, arm, and bucket) relies solely on its own observed state (i.e. its joint angle values) and action information, without access to the actions or states of the other agents. To address this limitation, each agent adopts a distributed execution mechanism, in which the information available to a single agent is used as the input to its Actor network, and the output is the corresponding action. This method enables independent decision-making without real-time communication among agents. Once a sufficient number of training iterations have been completed, coordinated behavior is achieved through the trained policies alone, eliminating the need for an additional coordination mechanism. This method effectively compensates for the model’s limited exploration capabilities.

Time-optimal trajectory planning

The core task of this study is to establish a self-learning system that enables the excavator’s working device to autonomously plan the time-optimal trajectory. For the boom, arm, and bucket joint agents, the actuator’s estimation network (Actor) is used for policy iteration and updates. The actuator decision network (Target Actor) interacts with the experience pool for sampling, and its network parameters are regularly updated by the Actor. The evaluators, Critic_0 and Critic_1, iteratively update the value function and calculate the Q-value for the current Actor’s behavior. The evaluator decision networks, Target Critic_0 and Target Critic_1, calculate the global reward, and their parameters are regularly updated from Critic_0 and Critic_1. To ensure efficient operation of the excavator’s working device within the permissible range, the agent’s reward function is defined as follows:

{\begin{matrix} r_{11} = - 10 * | (θ_{2} - (- 53.83)) * (θ_{2} < (- 53.83)) | \\ r_{12} = - 10 * | (θ_{2} - 54.62) * (θ_{2} > 54.62) | \\ r_{21} = - 10 * | (θ_{3} - (- 156.61)) * (θ_{3} < (- 156.61)) | \\ r_{22} = - 10 * | (θ_{3} - (- 32.2)) * (θ_{3} > (- 32.2)) | \\ r_{31} = - 10 * | (θ_{4} - (- 165.4)) * (θ_{4} < (- 165.4)) | \\ r_{32} = - 10 * | (θ_{4} - 14.6) * (θ_{4} > 14.6) | \\ r_{t} = - t - min (D_{t}) \end{matrix}

(4)

r = r_{11} + r_{12} + r_{21} + r_{22} + r_{31} + r_{32} + r_{t}

(5)

In these equations, $θ_{2}, θ_{3}, θ_{4}$ represent angle values of the boom, arm, and bucket joints, respectively; $r_{11}$ and $r_{12}$ represent rewards obtained when the boom joint motion exceeds its permissible motion range. $θ_{2} < - 53.83$ and $θ_{2} > 54.62$ are Boolean expressions, where the result is 0 if the boom joint angle is within it, and 1 if it exceeds the permissible range. Similarly, $r_{21}$ and $r_{22}$ , and $r_{31}$ and $r_{32}$ denote rewards for arm and bucket joint motions exceeding their permissible ranges, and $θ_{3} < (- 156.61)$ and $θ_{3} > (- 32.2)$ , $θ_{4} < (- 165.4)$ and $θ_{4} > 14.6$ are their corresponding Boolean expressions. $D_{t}$ denotes the distance from the current position of the bucket tooth tip to the target endpoint, and $t$ is the total working time.

Because the TD3 algorithm struggles to learn effectively in environments with sparse reward signals due to low exploration efficiency, a dense reward function that considers whether the joint angle exceeds the limit is added to the sparse reward function that only considers the total movement time and the relative distance to the target, thereby speeding up the training of the agent. From equations (4) and (5), we can deduce that the reward decreases when any joint motion exceeds its permissible range. Similarly, a longer motion time and a greater distance between the bucket tooth tip and target point also reduces the reward. Because each joint is an independent agent, each agent’s reward value is the same during interaction with the environment, and the shared evaluation network is influenced by the actions of all agents.

The time-optimal trajectory planning process based on the TD3 algorithm is outlined in Algorithm 1.

Algorithm 1. Optimal trajectory planning for excavator robotics based on TD3 algorithm.
Initialize Critic networks $θ_{a}^{Q}, θ_{b}^{Q}$ and Actor network $θ^{π}$ with random parameters $a, b, π$ Initialize target networks $a^{'} \leftarrow a$ , $b' \leftarrow b$ , $π' \leftarrow π$ Initialize replay buffer $Φ$ for t=1to T do Select action with exploration noise $a ~ θ^{π} (s) + ε$ , $ε ~ N (0, σ)$ and observe reward $r$ and new state $s'$ Store transition tuple $(s, a, r, s')$ in $Φ$ Sample minibatch of $N$ transitions $(s, a, r, s')$ from $Φ$ $\tilde{a} \leftarrow θ^{π'} (s') + ε, ε ~ clip (N (0, σ), - c, c)$ $y \leftarrow r + γ min_{i = 1, 2} Q_{j}^{'} (s', \tilde{a})$ Update Critics $(a, b) \leftarrow \arg min_{(a, b)} N^{- 1} \sum {(y - Q_{(a, b)} (s, a))}^{2}$ if t mod d then Update $π$ via the deterministic policy gradient $\nabla_{π} J = N^{- 1} \nabla_{a} Q (s, a \| θ_{a}^{Q}) \|_{a = θ^{π} (s)} \nabla_{π} θ_{π} (s)$ Update target networks: $\begin{matrix} a' \leftarrow τ a + (1 - τ) a' \\ b' \leftarrow τ b + (1 - τ) b' \\ π' \leftarrow τ π + (1 - τ) π' \end{matrix}$ end if end for

Algorithm 1. Optimal trajectory planning for excavator robotics based on TD3 algorithm.

Initialize Critic networks

θ_{a}^{Q}, θ_{b}^{Q}

and Actor network

θ^{π}

with random parameters

a, b, π

Initialize target networks

a^{'} \leftarrow a

b' \leftarrow b

π' \leftarrow π

Initialize replay buffer

Φ

for t=1to T do
Select action with exploration noise

a ~ θ^{π} (s) + ε

ε ~ N (0, σ)

and observe reward

r

and new state

s'

Store transition tuple

(s, a, r, s')

Φ

Sample minibatch of

N

transitions

(s, a, r, s')

from

Φ

\tilde{a} \leftarrow θ^{π'} (s') + ε, ε ~ clip (N (0, σ), - c, c)

y \leftarrow r + γ min_{i = 1, 2} Q_{j}^{'} (s', \tilde{a})

Update Critics

(a, b) \leftarrow \arg min_{(a, b)} N^{- 1} \sum {(y - Q_{(a, b)} (s, a))}^{2}

if t mod d then
Update

π

via the deterministic policy gradient

\nabla_{π} J = N^{- 1} \nabla_{a} Q (s, a | θ_{a}^{Q}) |_{a = θ^{π} (s)} \nabla_{π} θ_{π} (s)

Update target networks:

\begin{matrix} a' \leftarrow τ a + (1 - τ) a' \\ b' \leftarrow τ b + (1 - τ) b' \\ π' \leftarrow τ π + (1 - τ) π' \end{matrix}

end if
end for

Excavator working device system modeling

Although most robotic studies^32,33 employ the Denavit–Hartenberg method, this study uses the product of exponentials (PoE) formula from screw theory to establish the kinematic model of the excavator. The PoE method was chosen for two distinct advantages: (1) It provides a unified representation of both revolute and prismatic joints through twist coordinates, eliminating the need for assigning separate coordinate frames; (2) Its compact matrix exponential form offers a clear geometric interpretation for multi-degree-of-freedom systems such as excavator arms. As shown in Figure 4, where 1 denotes the slewing platform, 2 the boom joint, 3 the arm joint, and 4 the bucket joint. As shown in the Figure 4, the motion of the boom, arm, and bucket joints is mainly described based on the basic coordinate system $U$ and the end-tool coordinate system $T$ . $ξ_{i}$ $(i = 1, 2, 3, 4)$ represents the motion screw coordinate of $i - th$ joint, $θ_{i}$ angle value of $i - th$ joint, $a_{j}$ $(j = 0, 1, 2, 3, 4)$ rod length of the mechanical arm, and $a_{0} = 0.69$ ; the specific values of the parameters are presented in Table 1.

Figure 4.

Structure diagram of the excavator based on the product of exponentials formula (PoE) method.

Table 1.

Structural parameters of the PC1012 excavator robotic.

Joint ( $/ i$ )	Rod length ( $a_{i} / m$ )	Joint angle range ( $θ_{i} /^{,}$ )
Rotation	0.53	−180, 180
Boom	1.475	−53.83, 54.62
Arm	0.797	−156.61, −32.2
Bucket	0.425	−165.4, 14.6

First, when each joint is at zero position, that is, $θ_{i} = 0 (i = 1, 2, 3, 4)$ , the homogeneous transformation matrix of the coordinate system $T$ relative to the coordinate system $U$ is established as follows:

g_{ST} (0) = [\begin{matrix} 1 & 0 & 0 & a_{1} + a_{2} + a_{3} + a_{4} \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & a_{0} \\ 0 & 0 & 0 & 1 \end{matrix}]

(6)

The rotation coordinate expression for each joint is expressed as follows:

ξ_{i} = [\begin{matrix} ω_{i} \\ ν_{i} \end{matrix}] i = 1, 2, 3, 4

(7)

where $ω_{i}, ν_{i}$ denote the motion spinor, $ω_{i} \in R^{3}$ the unit vector on the rotation axis of $i - th$ joint; $ω = (ω_{x}, ω_{y}, ω_{z})^{T} \in R^{3}$ the unit vector in the moving direction of $i - th$ joint; and $q_{i} \in R^{3}$ the unit vector on the rotation axis of $i - th$ joint. Notably, $ν_{i} = ω_{i} \times q_{i}$ .

Second, the position description of the coordinate system $T$ relative to the coordinate system $U$ can be obtained by combining the rotation motions of the boom, arm, and bucket joints:

g_{ST} (θ) = e^{\overset{\land}{ξ_{1}} θ_{1}} e^{\overset{\land}{ξ_{2}} θ_{2}} e^{\overset{\land}{ξ_{3}} θ_{3}} e^{\overset{\land}{ξ_{4}} θ_{4}} g_{ST} (0)

(8)

where $θ_{i}$ denotes rotation of $i - th$ joint; ${\hat{ξ}}_{i}$ kinematic spin of $i - th$ joint; ${\hat{ξ}}_{i} = [\begin{matrix} \overset{\land}{ω} & ν \\ 0 & 0 \end{matrix}] \in se (3)$ the Lie algebra se(3) corresponding to the special Euclidean group SE(3), where $\overset{\land}{ω} = [\begin{matrix} 0 & - ω_{z} & ω_{y} \\ ω_{z} & 0 & - ω_{x} \\ - ω_{y} & ω_{x} & 0 \end{matrix}] \in so (3)$ represents the Lie algebra so(3) corresponding to the three-dimensional (3D) special orthogonal group SO(3).

Finally, using equations (6)–(8) and Chasles’ theorem, the forward kinematic equation of the excavator’s working device is calculated as follows:

g_{ST} (θ) = [\begin{matrix} R (θ) & P (θ) \\ 0 & 1 \end{matrix}]

(9)

where $R (θ) = [\begin{matrix} c_{1} c_{234} & - c_{1} s_{234} & s_{1} \\ s_{1} s_{234} & - s_{1} c_{234} & - c_{1} \\ s_{234} & c_{234} & 0 \end{matrix}]$ denotes the posture matrix of the working device and $P (θ) = [\begin{matrix} c_{1} (a_{4} c_{234} + a_{3} c_{23} + a_{2} c_{2} + a_{1}) \\ s_{1} (a_{4} c_{234} + a_{3} c_{23} + a_{2} c_{2} + a_{1}) \\ a_{4} s_{234} + a_{3} s_{23} + a_{2} s_{2} + a_{0} \end{matrix}]$ the $x, y, z$ coordinate values of the end of the bucket tooth tip in the coordinate system $U$ . Here, $c_{1} = \cos θ_{1}$ , $c_{2} = \cos θ_{2}$ , $s_{23} = \sin (θ_{2} + θ_{3})$ , $s_{1} = \sin θ_{1}$ , $s_{2} = \sin θ_{2}$ , $c_{234} = \cos (θ_{2} + θ_{3} + θ_{4})$ , $s_{234} = \sin (θ_{2} + θ_{3} + θ_{4})$ , and $c_{23} = \cos (θ_{2} + θ_{3})$ .

Inverse kinematics

For the trajectory planning task of autonomous operation of excavator robotics, under the premise of a given target point, motion curves of the boom, arm, and bucket joints that meet the optimization objectives and constraints are designed in the joint space, such that the end of the bucket tooth tip forms a corresponding operation trajectory. The numerical analysis method³⁴ is used for kinematic inverse solution in order to realize transformation from the target point at the end of the bucket tooth tip to the angle values of each joint of the working device in the early stage of task planning:

\tan θ_{1} = \frac{y}{x} \Rightarrow θ_{1} = \arctan 2 (\frac{y}{x}) r

(10)

\begin{matrix} θ_{2} = \arctan 2 \\ [\frac{(a_{2} + a_{3} c_{3}) (z - a_{0} - a_{4} s_{234}) - a_{3} s_{3} (\frac{x}{c_{1}} - a_{1} - a_{4} c_{234})}{(a_{2} + a_{3} c_{3}) (\frac{x}{c_{1}} - a_{1} - a_{4} c_{234}) + a_{3} s_{3} (z - a_{0} - a_{4} s_{234})}] \end{matrix}

(11)

\cos θ_{3} = \frac{{(\frac{x}{c_{1}} - a_{1} - a_{4} c_{234})}^{2} + {(z - a_{0} - a_{4} s_{234})}^{2} - a_{2}^{2} - a_{3}^{2}}{2 a_{2} a_{3}}

(12)

\sin θ_{3} = \pm \sqrt{1 - \cos^{2} θ_{3}}

(13)

θ_{3} = \arctan 2 (\sin θ_{3}, \cos θ_{3})

(14)

θ_{4} = ζ - θ_{2} - θ_{3}

(15)

where $(x, y, z)$ denotes coordinates of the end of the bucket tooth tip; $ζ$ attitude angle of the end of the bucket tooth tip; and $θ_{1}, θ_{2}, θ_{3}, θ_{4}$ the angle values of the slewing platform, boom, arm, and bucket joints, respectively.

Experiment and result analysis

An autonomous ground trimming plane operations by the PC1012 excavator robotic was considered as the optimization target. This study then selected the target point of the operation and the corresponding kinematic inverse solution results (see Table 2). The operating system environment was Windows10 x64, and the software toolkit version used was TensorFlow 2.1.0. The hardware information is as follows: Intel i5-9600K, GTX1060, DDR4 16 GB, 240 GB SSD. Simulation verification and data processing were implemented in the MATLAB 2022b environment.

Table 2.

Transformation of target points from pose space to joint space.

Target point $(/ m)$	Boom joint $(/^{o})$	Arm joint	Bucket joint $(/^{o})$
(2.55, 0, 0)	14.7737	−73.4659	−6.3078
(1.05, 0, 0)	2.6254	−147.6339	10.0085

Model parameter configuration

Prior to using the TD3 algorithm to train the agent, the essential elements of the model were defined.

State $s$ : Joint angle change value, and target angle value of the action information outputted by the strategy network; boom, arm, and bucket joint angles that can reach the target pointare defined as observation information. This information is used as the input of the strategy network and normalized.

Action $a$ : Output of the policy network is defined as the value of change in each joint angle.

Reward function $r$ : The reward comprises two parts:

(a) the first part considers whether each joint angle movement exceeds its allowed range of motion;

(b) the second part takes into consideration total time to complete the task and the distance between current bucket tooth tip end point and the given target point.

Network design: The Actor and Critic network structures are essentially the same, sampling the fully connected network with a double hidden layer structure; the hidden layer contains 512 neurons, and the rectified linear unit function is used as the activation function. Of these, the Actor network receives normalized state observation information, and after passing the fully connected layer, the Softmax function is set as the last layer of the neural network to convert the output result into the change value of each joint angle, as shown in Figure 5; the Critic network outputs a 1D state value function, as shown in Figure 6.

Hyperparameter setting: Sample Adam network optimizer with a learning rate of 0.00015, a batch size of 256, an experience library capacity set to 5000, and an initial number of training samples of 2000.

Figure 5.

Structure diagram of the Actor neural network.

Figure 6.

Structure diagram of the Critic neural network.

Analysis of experimental results

Since there exists no unified standard for evaluation of quality of RL algorithms, this study performed evaluations from two aspects:

(1) Trend of the reward value curve: the larger the reward value, the better the algorithm; the faster the curve converges, the better the convergence of the algorithm.

(2) Final optimization result of the algorithm, that is, time required for each joint to complete the task; the higher the efficiency of completing the task, the better the algorithm’s performance.

Therefore, this study used the deep deterministic policy gradient (DDPG) algorithm, TRPO algorithm, traditional TD3 algorithm, and improved TD3 algorithm suitable for solving continuous action space in RL to achieve time-optimal trajectory for the excavator.

For a comprehensive evaluation, two key indicators were used: (1) training efficiency, measured by the number of iterations required for convergence; and (2) task completion time, measuring how quickly the algorithm guides the system to complete the task. Under identical environmental conditions, including the same state and action space dimensions and hyperparameter configurations, the four algorithms-DDPG, TRPO, traditional TD3, and the improved TD3-were each applied to obtain the time-optimal trajectory of the excavator robotic. The training times for each are shown in Table 3. As illustrated, the improved TD3 algorithm achieved a training time that was 24.81% shorter than DDPG, 40.29% shorter than traditional TD3, and 34.51% shorter than TRPO. These results demonstrate that the improved TD3 algorithm developed in this study offers significantly shorter training time and higher optimization efficiency.

Table 3.

Training time of various algorithms.

Algorithm	Training time $/ h$
DDPG algorithm	27
TRPO algorithm	31
Traditional TD3 algorithm	34
Improved TD3 algorithm	20.3

DDPG: deep deterministic policy gradient; TD3: twin delayed deep deterministic policy gradient algorithm; TRPO: trust region policy optimization.

Furthermore, based on the average reward value curves presented in Figure 7, the improved TD3 algorithm achieved convergence by the 600th iteration, while DDPG, TRPO, and traditional TD3 converged at the 986th, 1500th, and 1495th iterations, respectively. As training progressed, the reward values of all four algorithms gradually increased and eventually stabilized. This trend confirms that, through continuous interaction between the policy network and the environment, the reward function effectively guided parameter updates, enabling all networks to converge toward an optimal strategy. Notably, the average reward value achieved by the improved TD3 algorithm not only stabilized earlier but also reached a higher average reward value compared with the other three algorithms. Thus, the improved TD3 algorithm demonstrated not only shorter training time and greater optimization efficiency, but also lower reward fluctuation, higher reward values, and an overall more efficient training process.

Figure 7.

Comparison of algorithms with respect to reward value curves.

Finally, the DDPG algorithm, TRPO algorithm, traditional TD3 algorithm, and the improved TD3 algorithm, with its optimal strategy as obtained in this study, were selected to solve the angle change curve of autonomous operation of each joint, as shown in Figures 8 to 11.

Figure 8.

Joint angles optimized via the deep deterministic policy gradient (DDPG) algorithm.

Figure 9.

Joint angles optimized by the TRPO algorithm.

Figure 10.

Joint angles optimized by the traditional TD3 algorithm.

Figure 11.

Joint angles optimized by the improved TD3 algorithm.

Figures 8 to 11 show that the planning results of the four algorithms exhibit jitter in the early stage. This is likely due to the low probability of the incorrect decision actions being output by the agent during the initial stage, leading to the jitter effect. However, as the training progresses, the curves gradually become smoother due to the balance between exploration and utilization. In comparison, the joint angle curve produced by the improved TD3 algorithm, trained in this study, is smoother, with each joint making a small change to the target point, which helped protect the hydraulic drive device. In addition, the improved TD3 algorithm completed the ground leveling task in 6.925 s, while the DDPG algorithm took 8.732 s, the TRPO algorithm took 10.628 s, and the traditional TD3 algorithm took 11.936 s, demonstrating that the trajectory planned by the proposed method is more efficient.

In light of the training time, reward value, and joint angle motion curve of each algorithm, the DDPG algorithm with shorter training time, larger reward value, and smoother joint angle is selected, as well as the optimal strategy of the improved TD3 algorithm obtained in this paper. Under the condition of known operation time, the joint angle values generated by the optimal strategy were substituted into the excavator robotics model, with the trajectory generated by the end of the bucket tooth tip shown in Figure 12. The figure shows that the time-optimal trajectory obtained upon training the DDPG algorithm is relatively rough, potentially causing large impacts on the boom, dipper, and bucket joints. In contrast, the operation trajectory generated by the improved TD3 algorithm was not only efficient but also continuous and smooth. The experimental results exceeded those of traditional reinforcement learning algorithms and traditional planning algorithms, with good performance and stable results, verifying the effectiveness of the algorithm framework.

Figure 12.

Trimming plane operation comparison chart.

Conclusions

Based on the establishment of kinematic equations for the excavator robotic working device using screw theory, this study proposes a trajectory planning method for autonomous learning in a multiagent system. The TD3 algorithm-an adaptive weight sampling mechanism and a centralized training-distributed execution framework-as used to train the neural network. By comparing optimization results in continuous action space among the DDPG, TRPO, traditional TD3, and the improved TD3 algorithms, the following findings were obtained. First, the training time of the improved TD3 algorithm was 24.81% shorter than that of DDPG, 40.29% shorter than that of the traditional TD3, and 34.51% shorter than that of TRPO. In addition, it achieved the highest average reward value. Second, the time of the proposed algorithm compared with the DDPG algorithm, TRPO algorithm, and traditional TD3 algorithm, the time required to complete a given task is reduced by 1.807, 3.703, and 5.011 s, respectively. These results demonstrate that the improved TD3 algorithm developed in this study converges more quickly and efficiently to a time-optimal trajectory, enabling efficient and stable autonomous operation of the excavator robotic.

This study primarily addresses optimal trajectory planning based on kinematic modeling, without considering the system’s dynamic model or corresponding energy consumption optimization. Future work can incorporate dynamic modeling to optimize trajectories that account for time and energy consumption, ultimately enabling multiobjective optimal trajectory planning for excavator robotics.

Footnotes

Handling Editor: Divyam Semwal

ORCID iD

Yunyue Zhang

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China (Grant No. 61905172), General Project of Shanxi Provincial Basic Research Program (Grant No. 202303021211169), Key Research and Development Plan of Shanxi Province (Grant No. 202202150401007), Shanxi Province Science and Technology Cooperation and Exchange Project (Grant No. 202304041101001).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Dang

Dinh

, et al. Developments in energy regeneration technologies for hydraulic excavators: a review. Renew Sustain Energy Rev 2021; 145(3): 111076.

Rubenstone

. Komatsu bringing intelligent excavators to North America. Eng News Rec 2014; 273: 21.

Fan

Yang

, et al. Research status and development trend of intelligent excavators. J Mech Eng 2020; 56: 14.

Yoon

. A review on mechanical and hydraulic system modeling of excavator manipulator system. J Constr Eng 2016; 2016: 1–11.

David

Derek

. The development, control and operation of an autonomous robotic excavator. J Intell Robot Syst 1998; 21: 73–75.

Seward

Margrave

Sommerville

, et al. LUCIE the robotic excavator-design for system safety. In: Proceedings of IEEE international conference on robotics and automation, Minneapolis, MN, USA, 22–28 April 1996, pp.963–968. New York: IEEE.

Shaban

Taylor

Chotai

. State dependent parameter proportional-integral-plus (SDP-PIP) control of a nonlinear robot digger arm. In: UKACC international conference on control, Bath, Avon, England, September 2004, pp.458. New York: IEEE.

Nguyen

Rye

, et al. Impedance control of a hydraulically actuated robotic excavator. Autom Constr 2000; 9: 421–435.

Nguyen

Rye

, et al. Force/position tracking for electrohydraulic systems of a robotic excavator. In: Proceedings of the 39th IEEE conference on decision and control (CAT. No. 00CH37187), Sydney, NSW, Australia, 12–15 December 2000, pp.5224–5229. New York: IEEE.

10.

Yoo

Kim

. Development of a 3D local terrain modeling system of intelligent excavation robot. KSCE J Civil Eng 2017; 21: 214–221.

11.

Gao

Feng

Peng

, et al. Phased power matching control technology for hydraulic excavator. J Harbin Eng Univ 2017; 38: 1461–1469.

12.

Pan

Tong

. Excavator trajectory planning method based on control stability. J Zhejiang Univ (Eng) 2006; 40: 1311–1314.

13.

Park

Bae

Hong

. A path planning for autonomous excavation based on energy function minimization. J Korean Soc Precis Eng 2010; 27: 76–83.

14.

Kim

Kang

, et al. Dynamically optimal trajectories for earthmoving excavators. Autom Constr 2013; 35: 568–578.

15.

Yoshida

Koizumi

Tsujiuchi

, et al. Digging trajectory optimization by soil models and dynamics models of excavator. SAE Int J Commer Veh 2013; 6: 429–440.

16.

Wang

, et al. Digging trajectory optimization for cable shovel robotic excavation based on a multi-objective genetic algorithm. Energies 2020; 13: 3118.

17.

Wang

Liu

Sun

, et al. Multidisciplinary and multi-fidelity design optimization of electric vehicle battery thermal management system. J Mech Des 2018; 140: 094501.

18.

Zhang

Wang

Liu

, et al. Research on trajectory planning and autodig of hydraulic excavator. Math Probl Eng 2017; 2017: 1–10.

19.

Zhao

Liu

, et al. Spline-based optimal trajectory generation for autonomous excavator. Machines 2022; 10: 538.

20.

Fan

Yang

. Multi-objective trajectory optimization of intelligent electro-hydraulic shovel. Front Mech Eng 2023; 17: 50.

21.

Zou

Chen

Pang

. Task space-based dynamic trajectory planning for digging process of a hydraulic excavator with the integration of soil–bucket interaction. Proc IMechE, Part K: J Multi-body Dynamics 2019; 233: 598–616.

22.

Inner

Kucuk

. A dexterous workspace optimization for ten different types of General Stewart-Gough platforms. In: Küçük

(ed.) Recent advances in robot manipulators. Intech Open, 2022.

23.

Ege

Kucuk

. Energy minimization of new robotic-type above-knee prosthesis for higher battery lifetime. Appl Sci 2023; 13: 3868.

24.

Kucuk

. Maximal dexterous trajectory generation and cubic spline optimization for fully planar parallel manipulators. Comput Electr Eng 2016; 56: 634–647.

25.

Simon

Isik

. Optimal trigonometric robot trajectories. Robotica 1991; 9: 379–386.

26.

Egli

Hutter

. Towards RL-based hydraulic excavator automation. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021, pp.2692–2697. New York: IEEE.

27.

Egli

Hutter

. A general approach for the automation of hydraulic excavator arms using reinforcement learning. IEEE Robot Autom Lett 2022; 7: 5679–5686.

28.

Egli

Gaschen

Kerscher

, et al. Soil-adaptive excavation using reinforcement learning. IEEE Robot Autom Lett 2022; 7: 9778–9785.

29.

Kurinov

Orzechowski

Hamalainen

, et al. Automated excavator based on reinforcement learning and multibody system dynamics. IEEE Access 2020; 8: 213998–214006.

30.

Hodel

. Learning to operate an excavator via policy optimization. Procedia Comput Sci 2018; 140: 376–382.

31.

Yang

Meng

Wang

, et al. A novel learning function for adaptive surrogate-model-based reliability evaluation. Philos Trans R Soc A 2024; 382: 20220395.

32.

Kucuk

. Kinematics, singularity and dexterity analysis of planar parallel manipulators based on DH method. In: Lazinica

Kawai

(eds) Robot manipulators new achievements. Intech Open, 2010, pp.387–400.

33.

Ayiz

Kucuk

. The kinematics of industrial robot manipulators based on the exponential rotational matrices. In: IEEE international symposium on industrial electronics, Seoul, Korea (South), 5–8 July 2009, pp.977–982. New York: IEEE.

34.

Küçük

. The inverse kinematics of a new planar hybrid robot manipulator. In: Küçük

(ed.) Exploring the world of robot manipulators. Intech Open, 2024, pp.53–64.

Time-optimal online trajectory planning for excavator robotics based on reinforcement learning

Abstract

Keywords

Introduction

Multiagent, autonomous-learning trajectory planning

TD3 algorithm

Time-optimal trajectory based on TD3 algorithm

Priority sampling mechanism

Centralized training-distributed execution mechanism

Time-optimal trajectory planning

Excavator working device system modeling

Inverse kinematics

Experiment and result analysis

Model parameter configuration

Analysis of experimental results

Conclusions

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

References