Sage Journals: Discover world-class research

Abstract

As the key to the movement of automated guided vehicle (AGV), the design of control algorithm directly affects whether AGV can follow the preset path. Aiming at the difficulty of AGV control, an AGV path tracking control method based on global vision and reinforcement learning is proposed. Firstly, the global view is obtained by the visual sensor, and the position information of obstacles and AGV is obtained by the target detection algorithm. Secondly, the path planning algorithm is used to obtain the driving path information which is used to establish a virtual environment. Thirdly, the position and pose of the physical AGV are introduced into the virtual environment by the visual sensor, and the virtual AGV is reset. Finally, the image obtained by virtual vehicle camera is input into the reinforcement learning model and the output action is sent to the physical AGV for execution. In the experimental part, this method can not only plan the driving path in different environments but also well control AGV to drive along the specified path, which proves that this method has strong robustness and feasibility.

Keywords

Reinforcement learning guidance technology AGV path tracking global vision

Introduction

Automated guided vehicle (AGV) is an intelligent industrial transport device, which has the characteristics of high autonomy, flexibility, programmable control, and applicability. Therefore, it is considered to be one of the most promising technologies.^1,2 At present, it has become an important part of industrial production intelligence and automation and has become a key equipment in flexible manufacturing systems and logistics automatic transmission systems.³ However, AGV is a strong coupling and time-varying nonlinear complex system, which is easily disturbed by uncertain factors in the operation process. This uncertainty will lead to AGV deviating from the preset path. Therefore, the quality of its control algorithm plays a decisive role in whether AGV can complete the transportation task.

Current traditional control algorithms can be divided into model-based and model-free algorithms according to different characteristics. Model-free algorithm mainly includes classical Proportional-Integral-Differential (PID) control algorithm and fuzzy control algorithm.⁴ In PID control, Zhang et al.⁵ used PID algorithm to control the real-time speed of AGV drive motor, so as to improve the path tracking accuracy of AGV. Wang et al.⁶ used PID control algorithm to construct AGV steering control system to reduce the center deviation in the process of AGV steering. He et al.⁷ used PID algorithm to adjust the speed of direct-current (DC) motor to correct the pose of AGV in real time. However, PID control learning and adaptive ability is relatively weak, in the case of external disturbances, the controller effect may be unstable. If combined with other related technologies to solve the problem of PID adaptability is weak, such as genetic algorithm⁸ or particle swarm algorithm,⁹ there may appear large overshoot and lead to fall into local optimum or other problems. In fuzzy control, Wang et al.¹⁰ used fuzzy control algorithm on four-wheel independent steering robot to achieve high-precision tracking control. Awad et al.¹¹ proposed a multi-input and multi-output linear prediction model based on fuzzy control to achieve the purpose of optimizing the steering angle and steering speed. Ren et al.¹² output real-time adjustment of AGV forward speed and rotation angular velocity through fuzzy control algorithm. The quality of fuzzy control depends on the fuzzy membership rules formulated by professionals, and the robustness of the established model is poor. With the change in the environment, a lot of adjustment parameters are needed.

Model-based control mainly includes inversion control and sliding mode control. In the inversion control, Mahgoub et al.¹³ designed the robot tracking controller based on the inversion control idea and finally realized the curvature continuity at the joint. Pan et al.¹⁴ use inversion control algorithm combined with nonlinear exponential observer method to obtain the motion law, in order to achieve high-precision trajectory tracking control objectives. However, inversion control is applicable when the system model is determined, otherwise there will be many limitations. And the Lyapunov function of inversion control needs to be redesigned as the system changes, which increases the difficulty of controller setting. In the sliding mode control, Sun et al.¹⁵ introduced a new approaching synovial control algorithm to improve the control performance of the inspection robot. Zhang et al.¹⁶ linearized the established system through linear transformation method and designed the sliding membrane structure controller to realize the reverse path tracking. However, it is difficult for the sliding mode control to slide along the sliding mode toward the equilibrium point. Instead, it fluctuates up and down on both sides to reach equilibrium, which will cause jitter and affect the actual use.

With the development of deep learning technology, reinforcement learning is gradually applied to various fields. In the field of AGV, reinforcement learning is mainly applied to AGV path planning, AGV scheduling, and adjustment parameters. Liu et al.¹⁷ proposed an Improved Dyna-Q algorithm for AGV path planning in large complex dynamic environments. This algorithm can effectively reduce the path search space and improve the efficiency of obtaining the optimal path. Sagar et al.¹⁸ proposed a real-time AGV scheduling strategy based on reinforcement learning technology, which uses a novel deep reinforcement double Q-learning method to schedule different AGVs to different states. Sierra-Garcia et al.¹⁹ proposed an intelligent hybrid control scheme that combines reinforcement learning-based control with conventional Proportional-Integral (PI) regulators. In this control scheme, PID is used to control the speed of two AGV wheels, and the parameters of PID are adjusted by reinforcement learning algorithm, which makes the path tracking mode strong robustness.

In general, there are few reinforcement learning algorithms that directly control AGV dual wheel for path tracking. Aiming at the problems existing in traditional path tracking algorithms, this paper proposes an AGV path tracking control method based on global vision and reinforcement learning in undisturbed indoor environment with varied layout. This method has strong adaptability and portability, and does not need to redesign the control function when replacing the mobile environment or mobile devices. In the experimental part, the results verify that the proposed control method has strong robustness.

The paper is organized as follows: In the section “Path tracking process and reinforcement learning algorithm”, the control method flow and reinforcement learning algorithm are introduced. In the section “Virtual environment building and model training”, the training environment required by reinforcement learning is set up and the reinforcement learning model trained in the virtual environment is retrained in the real environment via transfer learning. In the section “Physical model validation”, the reinforcement learning model is transplanted to the physical AGV to verify the feasibility of the proposed method. Finally, the fifth section concludes this article.

Path tracking process and reinforcement learning algorithm

Path tracking process design

The path tracking process based on global vision and reinforcement learning is shown in Figure 1.

Figure 1.

Path tracking flow chart based on global vision and reinforcement learning.

The first step is to obtain position information of AGV and obstacle by using the image obtained by the vision sensor through the target detection algorithm.

The second step is to generate a binary map based on the AGV and obstacle information obtained in first step, and use the obtained binary map as the input graph of the path planning algorithm to plan the walking path of the AGV intelligently.

The third step is to establish a virtual environment based on the planned path information and load the reinforcement learning model to complete the training.

The fourth step is to get the image by the global visual sensor and obtain the position and pose information of AGV by identifying the vehicle marks.

The fifth step is to input the information obtained in the fourth step into the virtual environment. The current virtual vehicle camera image is obtained by resetting the AGV pose and position in the virtual environment. Then the image is input into the reinforcement learning model and the action in the current state is obtained.

The sixth step is to execute the action sent by the fifth step on the actual AGV and judge whether to reach the destination position. If so, the program is terminated, otherwise, it jumps to the fourth step.

Selection and design of reinforcement learning algorithm

Theoretical basis of reinforcement learning

Reinforcement learning is a popular machine learning method in recent years. Its basic idea is to obtain reward value by interacting with the environment and carrying out self-learning according to this value. Reinforcement learning mainly includes agent, environment, action, state, and reward, and its relationship is shown in Figure 2.²⁰

Figure 2.

Reinforcement learning diagram.

In any time step, the agent first observes the state $S_{t}$ of the current environment. Based on the state agent, the action $A_{t}$ is determined, and the environment is updated after the action is executed to obtain the state $S_{t + 1}$ and reward $R_{t + 1}$ of the next time step. So there is a sequence generated during the interaction between the agent and the environment: $S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2} \dots$ , a simple sequence decision process.²¹

DeepMind team formally proposed deep reinforcement learning in 2013 to implement end-to-end learning by combining deep learning with reinforcement learning.²² The rapid development of deep reinforcement learning has produced many excellent reinforcement learning algorithms, which are mainly divided into value-based methods and policy-based methods.²³

Value-based reinforcement learning algorithm is to obtain the optimal value function through training to make decisions, and select the actions corresponding to the maximum value function in each decision-making process. The representative algorithms include Q-learning, State-Action-Reward-State-Action (SARSA), etc., which are characterized by high sample utilization rate, small variance of estimated value of value function, and not easy to fall into local optimal, but prone to overfitting and only suitable for discrete space.²⁴

The main way of the policy-based reinforcement learning algorithm is to train both the value function and the policy function in the training process, and only rely on the policy function after the training. The representative algorithms include proximal policy optimization (PPO) algorithm, twin delayed deep deterministic policy gradient (TD3) algorithm, and deep deterministic policy gradient (DDPG) algorithm. PPO algorithm has become the default algorithm of OpenAI due to its advantages such as stable training, simple tuning, and strong robustness. Therefore, this algorithm is selected as the control algorithm of AGV path tracking in this paper.

Theoretical basis of PPO algorithm

Policy gradient (PG)²⁵ algorithm is a policy-based method. Its principle is to update the policy by calculating the value of the PG and using the random gradient rise algorithm. The most widely used gradient calculation formula is shown in Equation (1).

\hat{g} = {\hat{E}}_{t} [▽_{θ} \log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}^{}]

(1)

where

π_{θ}

is randomized policy, and

{\hat{A}}_{t}^{}

is the estimation of the advantage function at time t, which indicates the advantage that action

a_{t}

can obtain under state

s_{t}

. Expectation

{\hat{E}}_{t}

represents the average for a limited batch of samples.

However, the traditional PG algorithm is difficult to obtain good results in practical problems, because the gradient policy update is very sensitive to the selection of step size. PPO²⁶ algorithm modifies the logarithmic probability gradient of the original PG formula for this problem. The ratio of the action probability of the current policy to the action policy of the previous policy. The mathematical expression is shown in Equation (2), and the objective function of the PG is updated to Equation (3).

r (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

(2)

L (θ) = {\hat{E}}_{t} [\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})} {\hat{A}}_{t}^{}]

(3)

In Equation (2),

r (θ)

represents the probability ratio of the old and new strategies to select an action, but if we only continuously maximize this value, it may lead to a significant update of the policy weight. To limit this magnitude, PPO-Clip is usually used to trim the update magnitude, as shown in Equation (4).

L_{θ}^{C l i p} = {\hat{E}}_{t} [m i n (r (θ) {\hat{A}}_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(4)

where

ϵ

is hyperparameter, and the setting of this parameter can limit the ratio between the old and new strategies in the interval

[1 - ϵ, 1 + ϵ]

to limit the update range.

In general, the PPO algorithm first obtains the current state, action, and reward, and then calculates the advantage function and objective function. The parameter $θ$ is updated by gradient ascending and step clipping, and finally the purpose of model convergence is achieved.

PPO network structure construction

The structure of PPO network in this paper is shown in Figures 3 and 4. Figure 3 is the feature extraction network structure layer of PPO algorithm, and Figure 4 is the structure layer of PPO algorithm policy network and value network.

Figure 3.

Feature extraction network structure layer by PPO algorithm.

Figure 4.

PPO algorithm policy network structure layer (top) and value network structure layer (bottom).

It can be seen from Figure 3 that the input form of feature network extraction layer of PPO algorithm used in this paper is 84 × 84 × 3 picture. After multiple convolution operations and linear transformation, 64 feature values are finally extracted from this network. Figure 4 shows that PPO policy network and value network take these 64 characteristic values as input. Then, multiple linear transformations are performed, respectively, to output the action controlling AGV movement and value evaluating the current network selection policy.

Virtual environment building and model training

Building virtual environment

The training reinforcement learning network needs to build a virtual environment for algorithm operation, which is mainly used for the training of PPO control algorithm model in this paper. In addition, when the AGV moves, it is necessary to convert the binary map into virtual map and obtain the rotational speed value of the AGV twin wheels.

Pybullet is a Python module based on the famous open source physics engine bullet, which is mainly used for physical simulation of robots, games, visual effects, and machine learning.²⁷ Pybullet is selected as the virtual environment of reinforcement learning in this paper because it has the characteristics of low difficulty and the simulation results are practical.

The construction of virtual environment mainly includes the establishment of agent model, the setting of virtual path, the input and output of reinforcement learning algorithm, and the setting of reward function of training reinforcement learning model. The components are shown in Figure 5.

Figure 5.

Composition of virtual environment.

The main function of the virtual AGV is to synchronize the pose information of the AGV to the virtual environment through the established model. The main purpose of setting the virtual path is to synchronize the planned path information to the virtual environment by constructing virtual points and lines. After setting the virtual AGV and virtual path, a reasonable observed value should be selected as the input of the reinforcement learning algorithm, and the size of the output value controls the movement of the virtual AGV. Finally, in order to make the virtual AGV travel along the planned path, a reward function limiting AGV movement should be set, which can guide the virtual AGV to move along the path during reinforcement learning training.

Agent modeling

In order to verify the effectiveness of the proposed algorithm, a two-wheel differential car with flexible movement but high control difficulty is selected as the control object in the experiment. The kinematics model is shown in Figure 6.^28,29

Figure 6.

Kinematic simplified model of two-wheel differential AGV.

AGV is driven by two motors independently, and the rotation center is the center of two wheels. According to the motion model of AGV, the current position of AGV is estimated. The relationship is as follows:

v = \frac{v_{L} + v_{R}}{2} = \frac{w_{L} R + w_{R} R}{2}

(5)

w = \frac{v_{R} - v_{L}}{b} = \frac{- w_{L} R + w_{R} R}{b}

(6)

where

v_{L}

is the linear speed of the left wheel;

v_{R}

is the linear velocity of the right wheel;

w_{L}

is the angular velocity of the left wheel;

w_{R}

is the angular velocity of the right wheel; v is the linear speed of AGV; w is the angular velocity of AGV; b is the axis distance; R is the radius of the wheel.

Assuming that the coordinate of the mobile robot in the two-dimensional coordinate system is (x, y), and the angle $θ$ between the running direction of the AGV and the x-axis, the position and pose of the AGV can be represented by the vector $(x, y, θ)$ . Its formula is as follows:

[\begin{matrix} {\dot{x}}_{0} \\ {\dot{y}}_{0} \\ \dot{θ} \end{matrix}] = [\begin{matrix} c o s θ \\ s i n θ \\ 0 \end{matrix} \begin{matrix} 0 \\ 0 \\ 1 \end{matrix}] \times [\begin{matrix} v \\ w \end{matrix}]

(7)

where

\dot{x}

\dot{y}

and

\dot{θ}

represent the changing speed of x direction, y direction, and angle of AGV, respectively. Combining Equations (5) and (6), the AGV pose formula can be obtained as follow:

[\begin{matrix} {\dot{x}}_{0} \\ {\dot{y}}_{0} \\ \dot{θ} \end{matrix}] = R \times [\begin{matrix} \frac{c o s θ}{2} \\ \frac{s i n θ}{2} \\ \frac{1}{b} \end{matrix} \begin{matrix} \frac{c o s θ}{2} \\ \frac{s i n θ}{2} \\ - \frac{1}{b} \end{matrix}] \times [\begin{matrix} w_{L} \\ w_{R} \end{matrix}]

(8)

It can be seen from Equation (8) that when AGV deviates from the scheduled route, only adjusting the speed of the left and right wheels can realize the movement and adjustment of each position and angle to achieve the purpose of controlling AGV.

According to the above principle, the three-dimensional model is established as shown in Figure 7. The model has three coordinate systems, namely, the base coordinate system, the left-wheel coordinate system, and the right-wheel coordinate system. The establishment of the left-wheel coordinate system and the right-wheel coordinate system is mainly used to control AGV moving in the virtual environment. The establishment of the base coordinate system is used to determine the position of the car and determine whether the car walks according to the specified path.

Figure 7.

3D model diagram of AGV.

Virtual path setting

In this paper, after obtaining the location information of AGV and obstacles, the path planning algorithm is needed to obtain the AGV travel path information, and the information is introduced into the virtual environment. Among the common path planning algorithms, A-Star(A*)³⁰ algorithm is the most effective direct search method for solving the shortest path in a static road network. It is widely used in path planning tasks because of its good performance and accuracy. Therefore, this paper selects this algorithm as the planning algorithm.

In path planning, A* algorithm searches the node as a unit, so the final path is the connection between the node and the node. According to this feature, in the path information transmission, only the coordinate position of the path turning point is introduced into the virtual environment, and the effect is shown in Figures 8 and 9.

Figure 8.

Binary graph of path planning.

Figure 9.

Import paths to the environment.

Figure 8 is a binary map after the implementation of A* path planning, and its path turning point information and driving route have been annotated in the map. Figure 9 shows the virtual environment after loading the turning point of the path. The black thin line represents all the paths that AGV needs to walk, the red thick line represents the path that AGV needs to walk in the current stage, and the yellow point represents the target point in the current stage. When the AGV reaches the current stage target point, it will determine whether there is a next stage target point. If any, open the next stage path and the target point to continue the path tracking, the effect is shown in Figure 10, otherwise, the AGV reaches the target point to complete the tracking task.

Figure 10.

Reach current phase target point and open next phase target point.

Observation value and output action setting

As the input of reinforcement learning algorithm, different observation values will train different network strategies. In this paper, the reinforcement learning model needs to realize the path tracking task, so the observation value should include path information and point information. Based on this, the paper uses virtual vehicle camera to obtain the above required information, and its structure is shown in Figure 11.

Figure 11.

Structure diagram of vehicle vision sensor.

It can be seen from Figure 11 that the field of view of the virtual vehicle vision sensor is 60°. In order to obtain a wider field of view, the angle between the optical axis direction and the vertical direction of the vision sensor is set to 60°. Setting the parameter into the virtual environment, the effect is shown in Figure 12.

Figure 12.

Virtual camera shooting mode diagram.

It can be seen from Figure 12 that the set virtual camera can well obtain the path and the target point information of the current stage, which proves the rationality of the set parameters.

The above image is taken as the input image of reinforcement learning algorithm, and the output result is set as the rotational speed value of the left and right wheels of AGV. The value is between [−1, 1], where the symbol represents the steering of the wheel, and the value represents the rotational speed of the wheel. In this paper, the two-wheel speed of AGV is matched with the two-wheel speed of virtual environment by proportional amplification.

Training reward setting

The setting of reinforcement learning reward function is mainly to specify and digitize the target task, so that the agent can obtain the reward value of the selected action and determine the direction of the next action update. Its design directly determines whether the training model can converge quickly and whether the agent can acquire the correct policy. In the path tracking task in this paper, the reward function should be set by integrating the planned path information and AGV poses information. For this task, the reward function in this paper is composed of multiple factors, and its mathematical as follows:

R_{t o t a l} = R_{p} + R_{a} - R_{d}

(9)

In Equation (9),

R_{t o t a l}

is the final value of reward, which consists of three parts.

$R_{p}$ is the distance $d_{o l d}$ from the AGV to the target point before the action minus the distance $d_{n e w}$ after the action? The diagram is shown in Figure 13.

Figure 13.

AGV distance target point diagram.

When AGV moves toward the target point, the reward value is positive, while when AGV moves away from the target point, the reward value is negative. Therefore, the setting of this value effectively ensures that AGV moves toward the target point continuously.

$R_{a}$ is the angle $θ_{o l d}$ between the AGV direction vector and the path direction vector before the execution of the action minus the angle $θ_{n e w}$ after the execution of the action. The diagram is shown in Figure 14.

Figure 14.

AGV axis and path angle diagram.

When the angle between AGV direction and path direction becomes smaller, the reward value is positive, otherwise is negative. The setting of the reward value enables AGV to travel along the planned path direction during the training process without large angle change in the process of movement.

$R_{d}$ represents the vertical distance d between the AGV center and the planned path, as shown in Figure 15. The value increases as the AGV moves away from the planned path and decreases as the AGV moves closer to the planned path. It can be seen from Equation (9) that the sign of this reward value is negative. Therefore, in the training process, the closer AGV is to the planned path, the smaller the penalty brought by this value will be; otherwise, the greater the penalty will be. The value of this function is set to make the AGV run along the planned path as much as possible during training.

Figure 15.

AGV base point and path distance diagram.

Model training and training results

During the training process, the virtual environment is set to randomly select positions and postures at the beginning of each round to reset the information of the virtual AGV. And seven target points of the path are randomly selected as driving nodes in non-overlapping areas. The connection of these seven target points is used as the driving path of the AGV, and the effect is shown in Figure 16. The center of the picture is the global visual image, and the left is the images captured by virtual vehicle camera.

Figure 16.

Virtual environment information reset diagram.

After simple debugging, the final training parameters adopted by PPO algorithm are shown in Table 1.

Table 1.

PPO algorithm training parameters.

Parameter	Value
Learning rate	0.00025
Steps	10000
Batch size	2000
Epochs	10

Table 2.

Basic Parameter of AGV.

Parameter	Value
Quality	4 kg
Dimension	150 mm × 180 mm
Wheel base	160 mm
Wheel radius	30 mm

The results of reinforcement learning training model are shown in Figure 17, where the abscissa is the training time step and the ordinate is the loss value. The total step size of the model training is 8 million times. In the first 1 million step size, the loss function of PPO network shows a trend of increase and then decrease with the oscillation. In the 1 million to 5 million step size, the loss function of the network shows a slow downward trend. And in the last 3 million step size, the loss function of the network shows a gentle trend, which proves that the model has reached convergence after training 8 million step size.

Figure 17.

Loss curve of PPO algorithm.

The trained model is put into the virtual environment for testing. The global planning path is set as a black line, and the AGV actual walking path is set as a blue line. The effect is shown in Figure 18.

Figure 18.

Virtual environment test diagram.

In Figure 18, the reinforcement learning model finally trained has the ability of rotating in place, correcting deviation, and walking in straight line. When vehicle visual sensor views there is no yellow target and the red path image will spin around looking for target and path. When visual sensor captures target point and path information vehicle will be to move. And according to the target and the path the image in different position, so as to constantly adjust the AGV two rounds of speed. Finally, the AGV runs along the preset path and arrives at the set point.

Transfer learning

The control policy obtained from the virtual environment model cannot be directly applied to the actual environment, because there are certain differences between the virtual environment and the actual environment, as well as between the virtual AGV and the actual AGV. The actual environment and AGV are often more complex and time-varying. Therefore, in order to ensure the actual AGV moving accuracy, this paper puts the control model in the virtual environment into the actual environment for transfer learning.

In this paper, the way of transfer learning is model transfer, which is to train the control model in the virtual environment in the actual environment. The training method is shown in Figure 19.

Figure 19.

Model transfer training diagram.

Firstly, AGV is randomly placed in the driving area, and then point A and any point B are selected as the driving path of AGV. Then, the AGV is controlled to drive according to the path. After reaching the selected point B, point A and other point B are selected as the driving path, so that the AGV is controlled to move repeatedly and the reinforcement learning adjustment control policy is continuously carried out. The final loss value of reinforcement learning in the actual environment is shown in Figure 20.

Figure 20.

Model transfer loss curve.

The loss value of model transfer is shown in Figure 20. PPO_1 training is carried out first. The loss value shows a downward trend from the orange line in Figure 20, but the loss value of this training does not finally tend to be flat. Therefore, the second training continued on the basis of PPO_1 model, and the PPO_2 model was finally obtained. From the blue line in the graph, the loss value shows a gentle trend, which indicates that the model PPO_2 has converged.

Physical model validation

In order to verify the reliability and feasibility of the method proposed in this paper, Arduino is used as the control board, the stepper motor is used as the driver, and the Apriltag is used as the sign board component. AGV is shown in Figure 21.

Figure 21.

Physical AGV.

The parameter is shown in Table 2.

Experimental process

Firstly, the global vision sensor is used to obtain the image of AGV driving area, as shown in Figure 22.

Figure 22.

Image taken by vision sensor.

Secondly, the obstacle information and vehicle information in Figure 22 are obtained through target detection algorithm. The binary map is established with obstacle information. Then the binary map is used as the input map of the path planning task. The starting point is the center point of the two rounds of AGV, and the end point is the manual selection point. The effect after the planning is shown in Figure 23. Figure 23(a) is the binary graph after path planning, and Figure 23(b) is the physical graph drawing the planned path information.

Figure 23.

Path planning.

The planned path information and AGV pose information are input into the virtual environment, so that the virtual environment can establish virtual path, virtual AGV, and virtual target points similar to the real environment, as shown in Figure 24.

Figure 24.

Virtual environment establishment.

Through the global vision sensor, the actual AGV pose and position information is continuously introduced into the virtual environment. And the AGV in the virtual environment is also reset. The virtual vehicle shooting map under this pose is input into the trained reinforcement learning model. Finally, the action output from the reinforcement learning model is sent to the actual AGV for execution. By continuously repeating the above process, AGV is controlled to reach the designated position. The effect is shown in Figure 25.

Figure 25.

AGV moving diagram.

In Figure 25(a), the blue line is the virtual AGV moving path, and the red is the planning path. In Figure 25(b), the black point is the actual AGV moving path, and the black line is the planning path. It can be seen from Figure 25(b) that AGV can reach the target point according to the planned path through the above way, which effectively proves the feasibility of the method proposed in this paper.

Different layout experiment

In real life, different application scenarios have different layouts. In order to verify the robustness of the method described in this paper, different layouts are simulated by placing different positions and numbers of obstacles in the experiment.

In this paper, single obstacle, double obstacle, three obstacle, and five obstacle maps are used as the driving environment of AGV. With the increase of obstacles, the longer the path AGV needs to travel, the more turning points need to pass. This setting method can well test the robustness and tracking ability of the path tracking algorithm, as shown in Figure 26.

Figure 26.

AGV mobility diagram under different layout environments.

From the actual driving path of AGV in Figure 26, it can be seen that the reinforcement learning algorithm trained in this paper has a strong ability to move and correct deviation and to decelerate at the inflection point. The correction ability can ensure that AGV can timely adjust the wheel speed and move to the preset path when it deviates from the planned path due to external factors. The deceleration capacity at the inflection point can ensure the smooth running of AGV at the turning point and the stop accuracy of AGV.

In order to check the moving accuracy of the proposed algorithm, 60 moving experiments are carried out in a single obstacle environment, and the maximum path displacement deviation and stop deviation in each experiment are recorded. The experimental results are shown in Figures 27 and 28.

Figure 27.

Maximum path displacement deviation.

Figure 28.

Stop deviation.

Finally, the average maximum path displacement deviation is 28.6 pixels, and the average stop deviation is 13.9 pixels. The actual deviation distance is related to the resolution of the visual sensor. For example, under the version of vision sensor used in this paper, it can be seen through measurement that the length of 180 mm occupies 200 pixels. Therefore, by calculating the ratio of 9:10, the average maximum path displacement deviation is 25.74 mm and the average stop deviation is 12.51 mm. If a visual sensor with higher pixels is used and the ratio of length to pixels is reduced, the actual error of the final movement will be smaller. In addition, the AGV movement accuracy is also affected by the reinforcement learning training conditions. If the proportion of R_d in the reward value setting is increased, the trained model will have smaller AGV movement displacement deviation, but this setting will also increase the difficulty of model convergence.

Conclusions

This paper proposes an AGV path tracking control method based on global vision and reinforcement learning. This method first detects the target of the global image to obtain the obstacle information of the map and the position and orientation information of AGV, and then establishes a binary map and conducts path planning. The planned path information is mapped to the virtual environment, and the two-wheel speed value of AGV is obtained by continuously resetting the virtual AGV. Finally, the speed value is sent to the actual AGV for execution until the AGV reaches the destination.

The method is verified in the section “Physical model validation” of the experiment. The final experimental results show that the method described in this paper can not only well complete the path planning tasks under different layout environments, but also the reinforcement learning algorithm can control AGV to travel according to the specified path and finally reach the destination, which proves that the method described in this paper has adaptability and feasibility. Compared with the algorithm which needs a large number of sensors and carries out route planning in advance, the advantage of the proposed algorithm in this paper is low price and strong flexibility.

The experimental background of the proposed path tracking technology is undisturbed indoor environment with variable layout. In future work, this technology can be used as a basis for expansion, such as adding obstacle recognition sensors to achieve AGV mobile obstacle avoidance, and establishing a virtual environment matching the real environment to achieve automated driving, etc.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Key R&D Project of Zhejiang Province (2022C01242), Longgang Institute of Zhejiang Sci-Tech University (LGYJY2021004), and Zhejiang Provincial General Scientific Research Projects Fund of China (Y202148127).

ORCID iD

Zhaolun Zheng

Author biographies

Yujun Lu is a PhD professor at Zhejiang Sci-Tech University, Hangzhou, Zhejiang, China. His research interests include digital design and manufacturing, intelligent manufacturing, networked collaborative manufacturing, manufacturing big data, machine vision, and industrial engineering.

Qixuan Zhu is a graduate student at Zhejiang Sci-Tech University, Hangzhou, Zhejiang, China. His research interests include machine learning and AGV.

Zhaolun Zheng is a graduate student at Zhejiang Sci-Tech University, Hangzhou, Zhejiang, China. His research interests include machine vision and reinforcement learning.

Chao Wang is a graduate student at Zhejiang Sci-Tech University, Hangzhou, Zhejiang, China. His research interests include digital twins.

References

Rozsa

Sziranyi

. Obstacle pre-diction for automated guided vehicles based on point clouds measured by a tilted lidar sensor. IEEE Trans Intell Transp Syst 2018; 19: 2708–2720.

Michalos

Kousi

Makris

, et al. Performance assessment of production systems with mobile robots. Procedia Cirp 2016; 41: 195–200.

Chen

Shi

Yuan

, et al. Trajectory tracking control method and experiment of AGV. In: Proceedings of the 2016 IEEE 14th International Workshop on Advanced Motion Control (AMC), Auckland, New Zealand, 22–24 April 2016, pp.24–29. Manhattan, NY: IEEE.

Mustafa

Wang

Tian

. Vibration control of an active vehicle suspension systems using optimized model-free fuzzy logic controller based on time delay estimation. Adv Eng Softw 2019; 127: 141–149.

Zhang

Tong

. Research on AGV trajectory tracking control based on double closed-loop and PID control. J Phys Conf Ser 2018; 1074: 012136.

Wang

Dong

Zhang

, et al. Research on stability design of differential drive fork-type AGV based on PID control. Electronics (Basel) 2020; 9: 1072.

Wang

Yang

, et al. Design of heavy-duty AGV control system based on vision and PID algorithm. Instrum Techn Sens 2022; 2: 84–88.

Lin-Chen

Jiang

. Path tracking of mobile robot based on PID control. Laser J 2016; 37: 114–116.

Mousakazemi

Ayoobian

. Robust tuned PID controller with PSO based on two-point kinetic model and adaptive disturbance rejection for a PWR-type reactor. Prog Nucl Energy 2019; 111: 183–194.

10.

Wang

Liu

, et al. Research on tracking control of four-wheel independent steering robot based on fuzzy control algorithm. Manuf Autom 2020; 42: 70–74.

11.

Awad

Lasheen

Elnaggar

, et al. Model predictive control with fuzzy logic switching for path tracking of autonomous vehicles. ISA Trans 2022; 129: 193–205.

12.

Ren

Zhao

Guo

, et al. Path planning control of automated guided vehicle based on workshop measurement positioning system and fuzzy control. Acta Opt Sin 2019; 39: 1–8.

13.

Mahgoub

Sanhoury

. Back stepping tracking controller for wheeled mobile robot. In: 2017 International Conference on Communication, Control, Computing and Electronics Engineering (ICCCCEE), Khartoum, Sudan, 16–18 January 2017, pp.1–5. Manhattan, NY: IEEE.

14.

Pan

Jiang

Zhu

, et al. Pre-scribed performance back-stepping control of fast trajectory tracking for quad-rotor aircraft. Contr Eng China 2021; 28: 2199–2208.

15.

Sun

Tan

. Design of motion control system of downhole inspection robot based on synovial membrane control. Ind Instrum Autom 2019; 1: 119–126.

16.

Zhang

Yang

Zhao

. Backward path tracking control for tractor-trailer system based on sliding model control. Automob Appl Technol 2021; 46: 31–34.

17.

Liu

Yan

Zhao

, et al. Improved dyna-Q: a reinforcement learning method focused via heuristic graph for AGV path planning in dynamic environments. Drones 2022; 6: 365.

18.

Sagar

Jerald

. Real-time automated guided vehicles scheduling with Markov decision process and double Q-learning algorithm. Mater Today: Proc 2022; 64: 279–284.

19.

Sierra-Garcia

Santos

. Combining reinforcement learning and conventional control to improve automatic guided vehicles tracking of complex trajectories. Expert Syst 2022; e13075.

20.

Gapdm

Lbm

Jnadb

, et al. Vision-based robust control framework based on deep reinforcement learning applied to autonomous ground vehicles. Control Eng Pract 2020; 104: 104630.

21.

Chen

Zheng

. A survey of robot manipulation behavior research based on deep reinforcement learning. Robot 2022; 44: 236–256.

22.

Mnih

Kavukcuoglu

Silver

, et al. Playing Atari with deep reinforcement learning. 2013; arXiv:2013;1312.5602. https://doi.org/10.48550/arXiv.1312.5602

23.

Tang

Sun

Yang

, et al. A reinforcement learning algorithm for 2V2 close-range air combat. Tact Miss Technol 2022; 01: 120–130.

24.

Sun

Zhang

. Deep reinforcement learning for motion planning of mobile robots. Control Decis 2021; 36: 1281–1292.

25.

Silver

Lever

Heess

, et al. Deterministic policy gradient algorithms. Int Conf Machine Learn 2014; 32: 287–295.

26.

Schulman

Wolski

Dhariwal

, et al. Proximal policy optimization algorithms. 2017; arXiv:1707.06347. https://doi.org/10.48550/arXiv.1707.06347

27.

Collins

Mcvicar

Wedlock

, et al. Benchmarking simulated robotic manipulation through a real world dataset. IEEE Robot Autom Lett 2020; 5: 250–257.

28.

Jung

Moon

Jung

, et al. Design of test track for accurate calibration of two wheel differential mobile robots. Int J Precis Eng Manuf 2014; 15: 53–61.

29.

Yin

Ying

. Combinatorial inertial guidance system for an automated guided vehicle. In: 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), Zhuhai, China, 27–29 March 2018, pp.1–6. Manhattan, NY: IEEE.

30.

Chen

Zhao

Wang

. Three dimensional path planning of UAV based on adaptive particle swarm optimization algorithm. J Phys Conf Ser 2021; 1846: 012007.

Research on AGV path tracking method based on global vision and reinforcement learning

Abstract

Keywords

Introduction

Path tracking process and reinforcement learning algorithm

Path tracking process design

Selection and design of reinforcement learning algorithm

Theoretical basis of reinforcement learning

Theoretical basis of PPO algorithm

PPO network structure construction

Virtual environment building and model training

Building virtual environment

Agent modeling

Virtual path setting

Observation value and output action setting

Training reward setting

Model training and training results

Transfer learning

Physical model validation

Experimental process

Different layout experiment

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Author biographies

References