Sage Journals: Discover world-class research

Abstract

Deep reinforcement learning methods have been applied to mobile robot navigation to find the optimal path to the target. The rewards are usually given when the task is completed, which may lead to the local optima during the training procedure. It seriously affects the training efficiency and navigation performance of the mobile robot. To this end, this paper proposes an intrinsic reward mechanism with intrinsic curiosity module and randomness enhanced module, combining the TD3 (twin-delayed deep deterministic policy gradient) reinforcement learning algorithm for mobile robot navigation. It effectively resolves the issue of slow convergence caused by sparse rewards in continuous action spaces. It also encourages mobile robots to explore unknown areas and reduces the occurrence of local optima. The experimental results show that the proposed navigation method significantly improves the training efficiency of mobile robots. Out of 1000 test episodes, only 3 exceeded the maximum step limit. This approach significantly reduces the occurrence of local optima. Furthermore, it increases the success rate to an impressive 83.5%, outperforms the existing navigation methods.

Keywords

Mobile robots deep reinforcement learning intrinsic reward curiosity random enhancement

Introduction

Recent advancements in robotics technology¹ have led to a surge of robot applications across various domains, with mobile robots emerging as essential tools in both daily life and industrial operations.² Navigation is one of the critical technologies for intelligent mobile robots. Traditional navigation methods for robots struggle to adapt to new environments, and their performance typically degrades as the complexity of the tasks increases.³ In recent years, researchers have turned to deep reinforcement learning (DRL)⁴ algorithms to improve the performance of mobile robot navigation. By leveraging the inherent trial-and-error mechanism in DRL, robots can learn to perceive and navigate in unknown environments.⁵ Through iterative refinement guided by reward functions, the optimal path can be obtained to reach the target.⁶

However, DRL-based navigation⁷ for mobile robots is a challenging task when operating in unexplored regions with limited rewards, leading to frequent falling into the trap with local optima.⁸ This issue stems from the scarcity of consistent rewards during the process of completing the task.⁹ This sparsity of rewards severely hinders the robot's ability to obtain meaningful rewards during its exploration. Consequently, it is crucial to design a mechanism that provides more consistent rewards, improving the performance of mobile robot navigation in environments with infrequent rewards.

To address the aforementioned challenges, this paper proposes an intrinsic reward mechanism which consists of the intrinsic curiosity module (ICM) and the randomness enhanced module (REM). This enables robots to autonomously navigate to their destination in unknown environments. The TD3 (twin-delayed deep deterministic policy gradient (DDPG)) reinforcement learning algorithm serves as the foundational framework for navigation, addressing the challenges of continuous action spaces. The ICM calculates the difference between the predicted next state and the actual next state to provide an intrinsic reward signal for the mobile robot, guiding its exploration of unknown regions. However, due to the inevitable interference of environmental noise when the robot predicts the next state, the intrinsic rewards are not always accurate. This imprecision hinders the full potential of the intrinsic reward mechanism, leading the robot occasionally falling into the trap with local optima. To address this, this paper proposes the REM. By estimating the randomness in unknown regions of the environment, it provides an additional reward signal for exploration. This approach diminishes the emergence of local optima traps, thereby enhancing the exploration efficiency of the algorithm.

Additionally, a novel reward function has been designed to encourage the mobile robot to develop better navigation strategies within its environment. During the training process, the mobile robot obtains external rewards from the state of the mobile robot within the environment and internal rewards generated through the intrinsic reward mechanism. By integrating external rewards and internal rewards, the reward during the mobile robot's training phase is increased, promoting sustained exploration and learning in subsequent trials. The combined use of the intrinsic reward mechanism and the new reward function significantly enhances the efficiency and robustness of the DRL algorithm. This ensures that the mobile robot actively explores unknown regions, thereby enhancing its capability to navigate in complex environments.

The main contributions of this paper are summarized as follows:

By integrating the TD3 reinforcement learning algorithm with the ICM for mobile robot navigation, the intrinsic reward mechanism effectively addresses the slow convergence issue caused by sparse rewards in continuous action spaces and motivates the mobile robot to explore unknown regions.

REM has been designed on the foundation of the ICM. By estimating the randomness in unknown regions of the environment, it amplifies the impact of intrinsic rewards on navigation strategies. This approach diminishes the chances of the mobile robot falling into the trap with local optima.

A novel reward function that combines both external rewards and internal rewards has been designed to enhance the mobile robot's navigation strategies. This integrated reward function increases the rewards for the mobile robot during its training, motivating continued exploration and learning in subsequent trials.

Related work

Algorithms of mobile robot navigation are categorized into global and local strategies based on acquired environmental data.¹⁰ Global path planning chooses a path using a known static environmental map, where methods like A-star¹¹ and rapidly exploring random trees¹² are less effective in unknown environments. Local path planning methods such as dynamic window approaches¹³ are commonly used to navigate in dynamic environments. However, traditional navigation algorithms are sensitive to sensor noise and require high accuracy,¹⁴ limiting robots in complex environments.

During the “Man vs. Machine Go Match,” researchers overcame limitations of conventional obstacle avoidance methods by employing DRL techniques for autonomous navigation tasks.¹⁵ This approach expanded navigation capabilities to complex tasks, offering advantages such as map-free navigation, strong learning capabilities, and reduced reliance on sensor accuracy.¹⁶ Navigation techniques based on DRL algorithms frame the navigation process as a Markov decision process. Neural networks are employed to process sensor data, utilizing sensor observations as the state representation. To maximize expected action rewards, optimal strategies are generated through interactions with the environment to guide the robot to its target location.

The DRL networks can generate action policies for mobile robots in continuous action spaces and be applied to local obstacle avoidance scenarios. For example, Feng et al.¹⁷ developed a local path planning approach based on deep double Q-learning (DDQN) reinforcement learning algorithm, combining the DDQN algorithm with topology-based global planning. Wang et al.¹⁸ proposed an improved method based on DDPG. These DRL techniques have demonstrated remarkable performance in local obstacle avoidance scenarios. However, due to the environmental information being local, these methods are susceptible to local optima issues, especially in unknown regions, resulting in increased training times. To address the lack of global information, Pokle et al.¹⁹ proposed a hierarchical motion planning approach, dividing navigation into local planning, global planning, and velocity control module. Besides, recent advancements for reinforcement learning have further addressed exploration challenges. Methods such as Savinov et al.²⁰ used episodic memory to create novelty bonuses by comparing current observations with past ones based on the number of steps needed to reach target point. This technique incorporates environment dynamics and mitigates issues where agents exploit actions leading to unpredictable consequences. Hafez et al.²¹ proposed a behavior self-organization which supports task inference for continual robot learning by performing unsupervised learning of behavior embeddings. Burda et al.²² and Wu et al.²³ introduced an exploration bonus for DRL methods based on the error of a neural network predicting features of observations given by a fixed randomly initialized neural network, which enhances exploration capabilities in complex tasks. Zhao et al.²⁴ used sound as a modality to guide exploration and improve representation learning in unsupervised reinforcement learning. Sekar et al.²⁵ leveraged self-supervised world models to plan and seek out expected future novelty, improving both exploration and fast adaptation to new tasks. Despite the integration of local and global planning, the computed path does not consider the process of robot's exploration and perception of its environment. In summary, although DRL-based methods show potential in improving robot navigation skills and applicability,²⁶ challenges remain in overcoming local optima issues, achieving comprehensive exploration, and navigating in unknown environments.

Method

Inspired by human curiosity, this paper proposes an intrinsic reward mechanism and integrates it into the autonomous navigation of mobile robots. This integration aims to improve the robot's capability to explore its environment and address the challenge of limited rewards in reinforcement learning. Unlike traditional exploration methods, the intrinsic reward mechanism closely mirrors human cognitive processes, generating a more thoughtful exploration strategy and improving learning effectiveness. The paper begins with an analysis of the intrinsic reward mechanism, followed by its integration with the TD3 DRL algorithm. It then investigates the analysis of the curiosity module and the REM. Finally, the design of a reward function combining the external rewards and internal rewards is presented. This comprehensive approach ultimately leads to a significant improvement in navigation efficiency.

Analysis of intrinsic reward mechanism

The intrinsic reward mechanism involves an internally generated reward signal within the mobile robot, which evaluates its current behavior quality. Traditional DRL-based navigation methods usually only offer rewards when the robot takes correct actions and penalizes it otherwise. Most of these methods focus solely on the outcome of the task, overlooking details such as speed, direction, and exploration of the environment during movement. This leads to reduced learning efficiency and frequent falling into the trap with local optima. In contrast, the intrinsic reward mechanism guides the robot's learning process even when external reward is lost, thereby enhancing both learning efficiency and navigation performance.

During the training phase, the intrinsic reward mechanism is commonly designed based on state-value rewards. State-value rewards assign a numerical value to each state, allowing the mobile robot to assess the current state and the expected outcomes of different actions. These state-value rewards act as intrinsic signals, effectively guiding the robot's learning process and improving learning efficiency. By motivating the robot to explore unknown regions, the intrinsic reward mechanism promotes the accumulation of additional environmental knowledge, ultimately boosting the anticipated long-term rewards.

Figure 1 depicts the intrinsic reward module. Here, S signifies the environmental state, a denotes the mobile robot's action, and R stands for the intrinsic reward. The intrinsic reward is determined by assessing the difference between the predicted subsequent state $S_{t + 1}^{'}$ , given the present environmental state $S_{t}$ and the robot's action $a_{t}$ , and the real subsequent state $S_{t + 1}$ . During the mobile robot's movement toward the destination, if similar environmental states occur, the prediction error will significantly decrease. For a new state-action sample, if the prediction error is small, it indicates that the region has been explored before, so a lower reward is given. Otherwise, it suggests that the region has not been explored, and thus a higher reward is given.

Figure 1.

The intrinsic reward module.

Mobile robot navigation based on intrinsic reward mechanism

This section introduces a mobile robot navigation approach that combines the ICM and REM with the TD3 DRL algorithm. The method utilizes a laser rangefinder sensor to capture environmental data, covering a 180° frontal range and the robot's polar coordinates as input states. The sensor focuses on the forward region, with linear velocity limited to non-negative values and excluding backward movement. In new environments, the robot employs laser sensors to gather vital information, storing it for training purposes. Through optimizing its motion strategy, the robot seeks to achieve maximum rewards for task completion. Rewards consist of both external environmental rewards and internal rewards from the intrinsic reward mechanism. These reward components influence the robot's navigation strategy, refined over iterative training, enhancing task efficiency.

In the realm of network design, as shown in Figure 2, the mobile robot obtains state information such as $s_{t}$ and $s_{t + 1}$ from the environment, which is input into the actor network. Together with action information $a_{t}$ , it is also input into the critic network. The critic network outputs the Q-value, while the actor network outputs the action $a_{t}$ based on the state information $s_{t}$ . The environment provides an external reward $R_{e}$ based on the action $a_{t}$ . This, in conjunction with the outputs $R_{1}$ and $R_{2}$ calculated by the intrinsic reward mechanism, jointly influences the navigation strategy of the mobile robot. Pairs of critic networks and target critic networks sharing the same structure but with distinct weights. Similarly, actor networks and target actor networks are utilized, all having the same architecture but different weights. The actor network consists of three fully connected layers utilizing ReLU activation functions, with a final Tanh activation function. Before the mobile robot takes action $a_{t}$ , the specific values for the action are determined using equation (1), where $v_{m a x}$ denotes the maximum linear velocity, $ω_{m a x}$ represents the maximum angular velocity, $a_{1}$ signifies the linear velocity factor, and $a_{2}$ represents the angular velocity factor.

a = [v_{m a x} (\frac{a_{1} + 1}{2}), ω_{m a x} a_{2}] .

(1)

Figure 2.

The framework for mobile robot navigation based on intrinsic reward mechanism.

Both critic networks utilize a similar three-layer fully connected structure and operate concurrently, allowing for differences in parameter values between the networks. The target critic networks take both the state-action pair s and a as inputs. The state s is fed into a fully connected layer, followed by a ReLU activation producing $L_{s}$ . Subsequently, $L_{s}$ and action a are separately passed into two distinct fully connected layers, $τ_{1}$ and $τ_{2}$ . As shown in equation (2), in this equation, $L_{c}$ represents the fully connected layer. $W_{τ_{1}}$ and $W_{τ_{2}}$ are the weights for $τ_{1}$ and $τ_{2}$ , respectively, while $b_{τ_{2}}$ is the bias for $τ_{2}$ . After passing through a ReLU activation, the final layer produces the Q-value. The minimum Q-value from the critic networks is selected as the final output to prevent overestimation of the state-action function.

L_{c} = L_{s} W_{τ_{1}} + a W_{τ_{2}} + b_{τ_{2}}

(2)

ICM

The ICM consists of feature extraction, forward model, and inverse model, as shown Figure 3. In the input, $a_{t}$ represents the action executed by the mobile robot at time t, while $S_{t}$ represents the environmental state, including data from the laser sensor and the current coordinates of the mobile robot. Initially, feature extraction is applied on $S_{t}$ and $S_{t + 1}$ , with the extracted features denoted as $ϕ (S_{t})$ and $ϕ (S_{t + 1})$ , respectively. The forward model then uses the current action $a_{t}$ and $ϕ (S_{t})$ to predict the next state, denoted as $ϕ^{'} (S_{t + 1})$ . Finally, the intrinsic reward $R_{i}$ is calculated by comparing $ϕ (S_{t + 1})$ and $ϕ^{'} (S_{t + 1})$ . In contrast, the inverse model uses two consecutive states, $S_{t}$ and $S_{t + 1}$ , to infer the action $a_{t}^{'}$ taken by the robot. The difference between the predicted action $a_{t}^{'}$ and the actual action $a_{t}$ is used to train the feature extraction network. This method tends to align the output of the feature extraction network more closely with the task.

Figure 3.

The intrinsic curiosity module.

In the forward model, $ϕ (S_{t})$ and $a_{t}$ are input to a fully connected neural network to predict $ϕ^{'} (S_{t + 1})$ as shown in equation (3).

ϕ^{'} (S_{t + 1}) = f (ϕ (S_{t}), a_{t}; θ_{F}) .

(3)

The computation of the intrinsic curiosity reward, defined by equation (4), incorporates a weighting factor λ designed to balance the exploration process. As the mobile robot explores its environment, the forward model employs its present state

S_{t + 1}

and executed action

a_{t}

to predict the subsequent state

ϕ^{'} (S_{t + 1})

. If the robot is unfamiliar with the current environment, the discrepancy between the predicted next state

ϕ^{'} (S_{t + 1})

and the actual next state

ϕ (S_{t + 1})

will be significant. In such cases, the ICM outputs a larger reward signal, encouraging the robot to continue exploring the area. Conversely, when the robot becomes more familiar with its current environment, the difference between the predicted next state

ϕ^{'} (S_{t + 1})

and the actual next state

ϕ (S_{t + 1})

diminishes. Consequently, the ICM produces a smaller reward signal, diminishing the robot's incentive to continue exploring the area and prompting it to seek out unknown regions for exploration.

R_{i} = \frac{λ}{2} ‖ ϕ^{'} (S_{t + 1}) - ϕ (S_{t + 1}) ‖_{2}^{2}

(4)

The inverse model within the ICM is designed to extract task-relevant features from the state while disregarding irrelevant ones. For instance, if the mobile robot encounters a television during its exploratory path, the TV's content is rich and unpredictable. The robot remains fixated on the television, exhibiting a continuous high level of curiosity toward the program content; this “Noisy-TV” problem could significantly hinder task progression. Therefore, generating interest in features unrelated to the task is undesirable for the curiosity mechanism. The objective of the inverse model is to minimize the difference between

a_{t}

and

a_{t}^{'}

. The forward and inverse models must balance between maximizing the discrepancy to satisfy curiosity and ensuring accurate predictive capabilities to keep the mobile robot aligned with the task, thereby making predictions that conform to the objective logic of reality.

REM

The REM designed in this paper presents a novel network architecture. The central idea is to utilize the outputs of both the prediction and target modules to estimate the randomness of unknown areas in the environment, subsequently providing an additional reward signal for exploration. This supplementary reward signal amplifies the state randomness of the mobile robot, promoting the navigation algorithm to explore unknown regions. It reduces the occurrence of local optima traps without altering the intrinsic reward influence factor, thereby increasing the algorithm's exploration efficiency. For instance, when the mobile robot navigates to previously visited paths, it learns that the reward at its current position is significantly less compared with the reward of moving to unexplored regions. This inclines the mobile robot to depart from its current location and venture toward unknown regions. The structure of the REM is illustrated in Figure 4.

Figure 4.

The randomness enhanced module.

Initially, the prediction module will forecast $φ^{'} (S_{t + 1})$ based on the next moment's state $S_{t + 1}$ , as illustrated in equation (5). Both the prediction module and the target module have similar neural network structures. After initialization, the network parameters of the target module remain fixed, while parameters of the prediction module are continuously updated. After training, the network of the prediction module will approximate that of the target module more closely. This convergence is attributed to the continuous minimization of differences between the state predictions output by both networks.

φ^{'} (S_{t + 1}) = h (S_{t + 1}; θ_{N})

(5)

Subsequently, the current state

S_{t}

undergoes feature extraction, resulting in

ϕ (S_{t})

. Together with the current action

a_{t}

, they are jointly fed into the prediction module, producing

\hat{φ} (S_{t + 1})

, as illustrated in equation (6).

\hat{φ} (S_{t + 1}) = m (ϕ (S_{t}), a_{t}; θ_{F}) .

(6)

The target module predicts the value

ϕ^{'} (S_{t + 1})

for the next state

S_{t + 1}

. The mean squared error between

ϕ (S_{t + 1})

and

\hat{φ} (S_{t + 1})

is computed, producing R1, as illustrated in equation (7).

R_{1} = \frac{λ}{2} ‖ ϕ (S_{t + 1}) - \hat{φ} (S_{t + 1}) ‖_{2}^{2}

(7)

Next, the state

S_{t + 1}

undergoes feature extraction, resulting in

ϕ (S_{t + 1})

. The state error between

ϕ^{'} (S_{t + 1})

and

φ^{'} (S_{t + 1})

is computed, producing

R_{2}

, as depicted in (8). Upon calculating

R_{1}

and

R_{2}

, the final intrinsic reward

R_{i}

is derived.

R_{2} = \frac{λ}{2} ‖ φ^{'} (S_{t + 1}) - ϕ^{'} (S_{t + 1}) ‖_{2}^{2}

(8)

Design of the reward function

In this paper, at the end of a training episode, all intrinsic rewards stop accumulating, as illustrated in equation (9). Here, H represents the moment when the episode terminates.

R = \sum_{t = 1}^{H} γ^{t - 1} R_{t}

(9)

External rewards are obtained when the mobile robot experiences events such as collisions or reaching a target point. In contrast, intrinsic rewards are derived from state error computations. In this section, the navigation method utilizes the reward value

R_{e}

corresponding to the external reward and the reward value

R_{i}

corresponding to the intrinsic reward, with both network models trained separately. The overall value function is defined as

R = R_{e} + R_{i}

, as presented in equation (10).

r (s_{t}, a_{t}) = {\begin{matrix} r_{g} + R_{i}, i f D_{t} < η_{D} \\ r_{c} + R_{i}, i f c o l l i d e d \\ v - | ω | + R_{i}, o t h e r w i s e \end{matrix}

(10)

Furthermore, the weight factor

β

is employed to regulate the balance between external rewards and intrinsic rewards, as depicted in equation (11).

R_{t}

represents the reward obtained at time t. In the external rewards obtained by the mobile robot, there exists a significant difference between the rewards for collision with obstacles and the rewards for reaching the goal. A reasonable setting for

β

should not impact the feedback received by the mobile robot when reaching the goal or colliding with obstacles. Otherwise, it may lead to the mobile robot excessively exploring, exceeding the episode step limit, and consequently extending the training time. Therefore, appropriately setting the weight factor is crucial for the mobile robot to achieve optimal performance in the environment.

R_{t} = R_{e} + β R_{i} .

(11)

As the mobile robot continuously perceives new states from the environment and inputs them into the REM, the prediction module's network is exposed to an increasing number of known states. Consequently, the prediction module becomes more accurate in forecasting the output of the target module's network. The error between the networks of the prediction module and the target module diminishes, which in turn affects the final intrinsic reward of the module. The underlying principles can be summarized as follows:

Data Input: The REM takes the current state $S_{t}$ , the next state $S_{t + 1}$ , and the current action $a_{t}$ as inputs. It then trains the prediction module's network and computes the final intrinsic reward.

Feature Extraction: The current state $S_{t}$ and the next state $S_{t + 1}$ undergo feature extraction. The extracted features are then compared with the output of the prediction module to compute the state error, from which the reward value is derived.

Prediction Module Network Update: The prediction module trains to forecast the output of the target module based on a vast amount of input data. It shares the same input data with the target network. During each training iteration, the mean squared error between the outputs of the two networks is used as the loss function to update the network parameters.

Target Module Output: The target module produces a predicted value for the next state. This predicted output is compared with the output of the prediction module to compute the error. A larger error indicates a more novel environment, leading to a correspondingly larger reward value.

Computing Intrinsic Reward: The intrinsic reward mechanism generates three reward signals. The first is derived from the error calculation between the next state's features and the output of the prediction module. The second is obtained from the computation between the prediction module and the target module. The final reward signal is determined based on the previous two rewards.

Experimental design and evaluation methodology

In this paper, a simulation environment was constructed within the Gazebo simulation simulator, and the Pioneer P3DX mobile robot platform was employed to train the navigation model and conduct experimental testing. Gazebo offers a physics-based simulation environment, enabling the simulation of interactions between mobile robots and their environments. Users can control the behavior of the mobile robot and utilize sensors to gather environmental data.

The method proposed in this paper is implemented based on the PyTorch framework and runs on a GeForce RTX 2080Ti GPU. During network training, batch processing is employed, with each batch comprising 40 sets of data randomly selected from the experience pool. In the training phase, the mobile robot moves a maximum of 300 steps per episode, either until it collides with an obstacle or reaches the target point. The maximum training movement is set to 3 × $10^{6}$ steps. In the actual data, the action is a tuple a = ( $v$ , $ω$ ), where v represents the linear velocity and $ω$ denotes the angular velocity. The maximum values for v and $ω$ are set to 0.5 m/s and 1 rad/s, respectively. As for the state s information, it consists of 21 sets of LiDAR data and the relative position data from the current position to the target point. The delayed reward is updated every 10 steps, with the delay parameter set to update every two episodes.

To validate the performance of the navigation method based on the intrinsic reward mechanism, this paper uses data from training episodes as a reference and conducts a statistical analysis of the data results from 1000 test episodes across different DRL navigation methods based on the ICM. Additionally, this paper places particular emphasis on analyzing episodes where the robot navigation method based on the intrinsic reward mechanism exceeds the maximum step limit, thereby verifying the effectiveness of the intrinsic reward mechanism.

This paper employs multiple metrics to assess the mobile robot's performance, including average success steps (ASS), success rate (SR), collision rate (CR), and collision steps (CS). Success steps represent the steps the robot takes to reach the target without collisions. CR is the ratio of collision episodes to total testing episodes. CS is the steps taken during episodes with collisions. The SR reflects the ratio of episodes in which the mobile robot successfully reaches the target point to the total number of test episodes. It is commonly used to assess the exploration strategy of the mobile robot, as depicted in equation (12).

SR = \frac{Total Successful Episodes}{Total Episodes}

(12)

Here, total successful episodes represent the total number of episodes in which the mobile robot reaches the target point without colliding with any obstacles. Total episodes refers to the overall count of test episodes, which also includes episodes where the mobile robot collides with obstacles and episodes where the robot exceeds the maximum step limit.

CR = \frac{Collision Episodes}{Total Episodes}

(13)

The CR reflects the ratio of the total number of episodes in which the mobile robot collides with obstacles to the overall count of test episodes, as illustrated in equation (13). The CS refers to the total number of steps taken by the mobile robot in episodes where it collides with obstacles.

Experimentation and analysis

As illustrated in Figure 5, the chart depicts the range of steps taken by various methods to reach the target point.

Figure 5.

Steps counted for reaching destination. (a) TD3. (b) ICM + TD3. (c) IRM + TD3.

From the chart, it is evident that the mobile robot navigation method based on the intrinsic reward mechanism has a distinct advantage in terms of the overall step range. The steps required for the mobile robot to reach the target point are primarily distributed within the 20–60 step range, whereas the other two methods are more concentrated in the 30–80 step range. Moreover, within the 0–80 range, the navigation method based on the intrinsic reward mechanism accounts for 721 episodes, significantly surpassing the 569 episodes of navigation using only the TD3 algorithm and the 577 episodes of navigation based on the ICM.

The mobile robot navigation method based on the intrinsic reward mechanism exhibits a more compact step range, indicating more stable movement during training. The successful steps are primarily concentrated in the 20–60 range, while navigation using only the TD3 algorithm and navigation based on the ICM are more clustered in the 30–80 range. This demonstrates that the mobile robot navigation method based on the intrinsic reward mechanism not only enhances training efficiency but also improves the robustness of navigation to the target point.

Figure 6 presents a bar chart comparing the number of episodes exceeding the maximum step limit for the mobile robot navigation method based on the intrinsic reward mechanism, the navigation method based on the ICM, and navigation using only the TD3 DRL algorithm. The maximum step limit refers to the maximum number of steps required for a mobile robot to reach the target point. If the number of steps exceeds this limit, it is considered that the robot cannot reach the target point within this round of movement. From the results of 1000 test episodes, it is evident that the navigation method based on the intrinsic reward mechanism has the fewest episodes exceeding the maximum step limit, with only three episodes. This outcome is superior to the navigation method based on the ICM, validating the effectiveness of the intrinsic reward mechanism. It indicates that the mobile robot using the navigation method with the intrinsic reward mechanism, when exploring the environment, further reduces the likelihood of the robot getting trapped in local optima. This is attributed to the addition of the intrinsic reward mechanism in the navigation method, which employs two types of state errors as intrinsic motivation, continuously training the mobile robot to explore unknown environments, thus demonstrating the effectiveness of the intrinsic reward mechanism.

Figure 6.

Comparison of episodes exceeding the maximum step limit.

In conclusion, from the statistical data of 1000 test episodes, which includes ASS, SR, CR, and CS for the mobile robot navigation method based on the intrinsic reward mechanism, the navigation method based on the ICM, and other DRL algorithm navigation, it can be identified from the analysis that the results of the navigation method based on the intrinsic reward mechanism have shown significant improvement compared with other methods. A comparison was made between the mobile robot navigation method based on the intrinsic reward mechanism and the method proposed, as well as other reinforcement learning navigation methods, as shown in Table 1.

Table 1.

Performance comparison using different deep reinforcement learning methods.

Methods	SR (%)	ASS (step)	CR (%)	CS (step)
DDPG	66.8	59	24.8	14754
A3C	62.5	61.6	32.4	18804
TD3	73.1	56	23.1	9925
ICM + TD3 IRM + TD3	77.9	52.4	21.2	7618
ICM + TD3 IRM + TD3	83.5	51.9	16.2	6146

From the perspective of the SR, the method proposed in this section has an SR of 83.5%, which is 5.6% higher than the navigation method based on the ICM. It significantly surpasses navigation methods solely based on DDPG, A3C, and TD3. The CR is also the lowest among the compared methods, at only 16.2%. The CS provides a more direct indication of the improvement in navigation efficiency brought about by the method proposed in this section. This demonstrates that the combination of the intrinsic reward mechanism with reinforcement learning navigation algorithms not only enhances the exploration efficiency of the mobile robot during navigation, further reducing the occurrence of local optima traps, but also improves the overall navigation performance of the mobile robot.

Conclusion

In this paper, DRL mobile robots replaced the localization and map-building modules as well as the local path planning module in traditional navigation frameworks. This allows for movement toward the target point while avoiding obstacles. By integrating the intrinsic reward mechanism with DRL navigation algorithms, the final intrinsic reward value is obtained by calculating two types of state errors. This effectively reduces the likelihood of the mobile robot getting trapped in local optima during exploration and enhances both training efficiency and navigation performance. Experimental results demonstrate that the navigation algorithm based on the intrinsic reward mechanism significantly reduces instances where the mobile robot falls into local optima traps. Only three episodes encountered local optima in 1000 training episodes, and the SR increased to 83.5%. However, the following limitations exist in our method, and we aim to address them in future work:

This study conducted training and testing in a static obstacle environment. In the future, dynamic obstacles can be incorporated to improve the obstacle avoidance performance of mobile robots, making the simulation more applicable to real-world scenarios.

The intrinsic reward mechanism is proposed based on state errors. It can lead to inaccuracies in generating intrinsic rewards, affecting the mobile robot's ability to judge the novelty of the environment.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China, Beijing Nova Program (grant numbers 62272322, 62272323, 20230484409).

ORCID iDs

Jianan Yang

Zhenzhou Shao

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Jiang

Wang

Yau

, et al. A brief survey: deep reinforcement learning in mobile robot navigation. In: 2020 15th IEEE conference on industrial electronics and applications (ICIEA). Kristiansand, Norway: IEEE, 2020, pp.592–597. DOI: 10.1109/ICIEA48937.2020. 9248288.

Cimurs

Lee

Suh

. Goal-oriented obstacle avoidance with deep reinforcement learning in continuous action space. Electronics 2020; 9: 411.

Fernandes

Costa

Lima

, et al. Towards an orientation enhanced astar algorithm for robotic navigation. In: 2015 IEEE international conference on industrial technology (ICIT). Seville, Spain: IEEE, 2015, pp.3320–3325.

Zhu

Mottaghi

Kolve

, et al. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA). Singapore: IEEE, 2017, pp.3357–3364.

Chen

, et al. The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method. Knowledge-Based Systems 2020; 196: 105201.

Kahn

Villaflor

Pong

, et al. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:170201182 2017.

Devo

Mezzetti

Costante

, et al. Towards generalization in target-driven visual navigation by using deep reinforcement learning. IEEE Transactions on Robotics 2020; 36: 1546–1561.

Everett

Chen

How

. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). Madrid, Spain: IEEE, 2018, pp.3052–3059.

Cimurs

Suh

Lee

. Goal-driven autonomous exploration through deep reinforcement learning. IEEE Robotics and Automation Letters 2021; 7: 730–737.

10.

Wang

, et al. Fuzzy logic based robot path planning in unknown environment. In: 2005 International conference on machine learning and cybernetics, volume 2. Guangzhou, China: IEEE, 2005, pp.813–818.

11.

Hart

Nilsson

Raphael

. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics 1968; 4: 100–107.

12.

Zhu

Cao

Xia

, et al. Dsvp: dual-stage viewpoint planner for rapid exploration by dynamic expansion. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). Prague, Czech Republic: IEEE, 2021, pp.7623–7630.

13.

Fox

Burgard

Thrun

. The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine 1997; 4: 23–33.

14.

Mnih

Badia

Mirza

, et al. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd international conference on international conference on machine learning—volume 48. New York, USA: ICML’16, JMLR.org, 2016, pp.1928–1937.

15.

Alavizadeh

Jang-Jaccard

. Deep q-learning based reinforcement learning approach for network intrusion detection. Computers 2022; 11: 41.

16.

Zhu

Zhang

. Deep reinforcement learning based mobile robot navigation: a review. Tsinghua Science and Technology 2021; 26: 674–691.

17.

Feng

Ren

Wang

, et al. Mobile robot obstacle avoidance based on deep reinforcement learning. In: International design engineering technical conferences and computers and information in engineering conference, volume 59230. Anaheim, California, USA: American Society of Mechanical Engineers, 2019, pp.V05AT07A048.

18.

Wang

Zhang

, et al. Autonomous navigation of UAV in large-scale unknown complex environment with deep reinforcement learning. In: 2017 IEEE global conference on signal and information processing (GlobalSIP). Montreal, QC, Canada: IEEE, 2017, pp.858–862.

19.

Pokle

Martín-Martín

Goebel

, et al. Deep local trajectory replanning and control for robot navigation. In: 2019 International conference on robotics and automation (ICRA). Montreal, QC, Canada: IEEE, 2019, pp.5815–5822.

20.

Savinov

Raichuk

Marinier

, et al. Episodic curiosity through reachability. arXiv preprint arXiv:181002274 2018.

21.

Hafez

Wermter

. Behavior self-organization supports task inference for continual robot learning. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp. 6739–6746.

22.

Burda

Edwards

Storkey

, et al. Exploration by random network distillation. arXiv preprint arXiv:181012894 2018.

23.

Wang

Esfahani

, et al. BND*-DDQN: learn to steer autonomously through deep reinforcement learning. IEEE Transactions on Cognitive and Developmental Systems 2019; 13: 249–261.

24.

Zhao

Weber

Hafez

, et al. Impact makes a sound and sound makes an impact: sound guides representations and explorations. In: 2022 IEEE/RSJ international conference on intelligent robots and systems (IROS). Kyoto, Japan: IEEE, 2022, pp.2512–2518.

25.

Sekar

Rybkin

Daniilidis

, et al. Planning to explore via self-supervised world models. In: International conference on machine learning. Vienna, Austria: PMLR, 2020, pp.8583–8592.

26.

Zhelo

Zhang

Tai

, et al. Curiosity-driven exploration for mapless navigation with deep reinforcement learning. arXiv preprint arXiv:180400456 2018.

Mobile robot navigation based on intrinsic reward mechanism with TD3 algorithm

Abstract

Keywords

Introduction

Related work

Method

Analysis of intrinsic reward mechanism

Mobile robot navigation based on intrinsic reward mechanism

ICM

REM

Design of the reward function

Experimental design and evaluation methodology

Experimentation and analysis

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

Data availability statement

References