Sage Journals: Discover world-class research

Abstract

The article presents a random neural Q-learning strategy for the obstacle avoidance problem of an autonomous mobile robot in unknown environments. In the proposed strategy, two independent modules, namely, avoidance without considering the target and goal-seeking without considering obstacles, are first trained using the proposed random neural Q-learning algorithm to obtain their best control policies. Then, the two trained modules are combined based on a switching function to realize the obstacle avoidance in unknown environments. For the proposed random neural Q-learning algorithm, a single-hidden layer feedforward network is used to approximate the Q-function to estimate the Q-value. The parameters of the single-hidden layer feedforward network are modified using the recently proposed neural algorithm named the online sequential version of extreme learning machine, where the parameters of the hidden nodes are assigned randomly and the sample data can come one by one. However, different from the original online sequential version of extreme learning machine algorithm, the initial output weights are estimated subjected to quadratic inequality constraint to improve the convergence speed. Finally, the simulation results demonstrate that the proposed random neural Q-learning strategy can successfully solve the obstacle avoidance problem. Also, the higher learning efficiency and better generalization ability are achieved by the proposed random neural Q-learning algorithm compared with the Q-learning based on the back-propagation method.

Keywords

Autonomous mobile robot obstacle avoidance online sequential extreme learning machine Q-learning unknown environments

Introduction

Obstacle avoidance is one of the primary tasks for autonomous mobile robots (AMRs), whose purpose is to enable mobile robots to arrive at the target points without colliding with any obstacles in the working environment. A lot of approaches have been developed to solve this problem, such as grid methods,^1–3 potential field methods,^4,5 fuzzy logic methods,^6,7 and so on. These methods are usually based on specific environment models and rely on more priori knowledge, such as experience and rules. Also they lack self-learning abilities to adapt to various unknown environments. Once there is a change in the environment or task, the corresponding strategies need to be updated manually. Thus, it is better to incorporate self-learning function to realize autonomous obstacle avoidance (AOA) in the unknown environments. Reinforcement learning (RL) is considered to be a more appropriate method to accomplish the task by directly interacting with environment without requiring any prior knowledge about the environment models. RL is a learning process based on a set of mapping from states of environment to actions through maximizing a value function that estimates the expected cumulative reward in the long term. The most well-known method is Q-learning, in which the value function is the function of state–action pairs, called Q-value. However, ordinary Q-learning can only deal with discrete states and actions. In the AOA task, states of environment are generally continuous and Q-learning cannot directly used for it.

To handle this problem, many research studies^8–13 have extended the Q-learning method to deal with continuous situation spaces by means of neural networks. In all these works, different neural networks including feedforward and recurrent networks are used to approximate the Q-values pertaining to specific situations with its universal approximation capability. Also the parameters of the networks are updated based on the back-propagation (BP) learning algorithm where gradients are computed by propagation from the output to the input. As we know, there are several practical issues that BP learning algorithms suffer from. Specifically, the BP learning method easily converges to a local minimum if the estimated Q-value is not convex with respect to its parameters.¹⁴ It is undesirable that the learning algorithm stops at a local minimum if it is located far above a global minimum. Meanwhile, the BP learning algorithm has a very slow convergence speed in most applications, including Q-learning. This will take the robot too long and too many collisions to arrive at the goal.

Recently, a new fast neural learning algorithm referred to as extreme learning machine (ELM) has been developed for single-hidden layer feedforward networks (SLFNs) in Huang et al.,¹⁵ Huang and Siew,¹⁶ and Huang et al.^17,18 ELM and its different improvements^19–21 have been successfully applied in some applications.^22–30 It has been shown that ELM can provide better generalization performance at extremely high learning speed.^{17,18,27,28,30} The universal approximation capability of ELM has been rigorously proved³¹ using an incremental method (named incremental extreme learning machine (I-ELM)). An online sequential learning version of the batch ELM (OS-ELM) has been developed.³² In OS-ELM, the parameters of hidden nodes in the SLFNs are randomly generated and fixed. Based on this, the output weights are analytically determined. OS-ELM can learn the training data not only one-by-one, but also chunk by chunk (with fixed or varying length) and discard the data for which the training has already been done.

In this article, a novel random neural Q-learning (RNQL) approach is proposed for collision-free behavior selection of an AMR in unknown environments. In the strategy, the SLFN is used as the function approximator to estimate the Q-value. Different from the existing methods, the parameters of the SLFN are adjusted based on OS-ELM. Furthermore, to achieve a fast convergence speed, the initial output weights in the initialization phase of OS-ELM are estimated subjected to quadratic inequality constraint. Simulation results show that this proposed method not only obtains high learning efficiency, but also produces good generalization performance compared with the existing Q-learning algorithms updated based on the BP technology.

This article is organized as follows. In section “Problem definition,” the problem definition for the AOA of an AMR is presented. The details of the proposed RNQL method are described in section “RNQL algorithm.” Section “Performance evaluation” presents a quantitative performance comparison between RNQL algorithm with back-propagation Q-learning (BPQL) algorithm for the obstacle avoidance problem of an AMR. Section “Conclusion” summarizes the conclusions from this study.

Problem definition

The study presented in this article is applicable to any AMR, independent of its size and type. Here, we discuss our study with a special reference to an AMR consisting of one unactuated and not-sensed steer wheel and two actuated and sensed wheels. The detailed configuration of the robot is shown in Figure 1. This type of chassis provides three degrees of freedom (3-DOF) locomotion in a vehicle coordinate denoted by $X_{b} O_{b} Y_{b}$ , that is, X_b-translation, Y_b-translation, and rotation along the center $O_{b}$ . The robot can observe the environment through five ultrasonic sensors fixed on the head of robot. The ultrasonic sensors determine the distance from obstacles to the robot. Each sensor, $S_{i} for i = 1, \dots, 5$ , covers an angular view of 36°. The detailed configuration of the sensors is shown in Figure 2. Here, XOY denotes the world coordinate. The robot at time t is presented by $[x_{t}, y_{t}, θ_{t}]$ . $x_{t}$ and $y_{t}$ are coordinate of the vehicles center. $θ_{t}$ is the angle between $X_{b}$ and X, referred as yaw angle.

Figure 1.

Sketch of the AMR.

Figure 2.

Configuration of sensors.

When the environments are unknown, the obstacle avoidance problem of an AMR can be considered as a behavior selection task. In this task, the robot can automatically produce a correct action of reaching the destination without collision according to the environment information perceived by the sensors equipped on the robot. The task details are shown in Figure 3. The blue dots represent the trace of the robot and the red circle dot is the goal represented by $[x_{goal}, y_{goal}, θ_{goal}]$ . $θ_{goal}$ is the angle between the connection line of the robot’s center with the goal and the axis X. All the green circle dots represent the obstacles. In the figure, $d_{m}$ is the farthest detection range of the sensors. $θ_{i}, i = 1, \dots, 5$ is the detection angle from the ith sensor with respect to the body frame $X_{b}$ . $d_{0}$ is the safe distance of the mobile robot. $Δ d$ is the distance between the robot and the goal, $Δ θ$ is the angle between the orientation of the robot’s movement and the connection line of the robot with the goal, which are calculated as follows

\begin{matrix} Δ d & = \sqrt{{(x_{goal} - x)}^{2} + {(y_{goal} - y)}^{2}} \\ Δ θ & = θ_{goal} - θ, Δ θ \in [- π, π] \end{matrix}

(1)

Figure 3.

Diagram of control variables.

In order to navigate the mobile robot to its goal, it is assumed that these variables are always known at each time step t. Therefore, an obstacle avoidance task is to obtain these variables, $d_{it}$ , $Δ d_{t}$ , and $Δ θ_{t}$ at each time step t, where $t = 0, 1, \dots$ , and based on them determine a situation-action mapping process until the goal is achieved. Q-learning is the most popular RL approach used in the field of robotics research for its simplicity, ease of implementation, and adaptability to extend. It is a learning technique based on trial-and-error interactions for a learner with its environment.³³ It is a learning of what to do, and how to map situations to actions so as to maximize a numerical reward signal.³⁴ The structure of Q-learning is shown in Figure 4. In one step, the robot would first observe the current state $s_{t}$ of its working environment, followed by randomly picking out an action $a_{t}$ to perform. After the action $a_{t}$ is performed, it would receive immediately a reinforcement signal $r_{t}$ corresponding to reward or penalty from the environment to indicate the consequences of its action $a_{t}$ . And then it would continue to observe the new state of its local surroundings $s_{t + 1}$ . All of the information, including $s_{t}$ , $s_{t + 1}$ , and $r_{t}$ , are input to the Q-value function to update the Q-value, $Q (s_{t}, a_{t})$ . This cycle of capturing the current state continued by taking a random action is then repeated all over again.⁹

Figure 4.

Structure of RL.

But Q-learning is usually applied to the discrete set of states and the Q-values of state–action pairs are stored in a table. As the perceived states of an AMR are continuous, a large storage space is required to store all the possible state–action pairs. In order to solve these problems, the SLFN is applied to approximate the Q-function and used as a Q-value estimator. The state of the robot is the inputs of the neural network and the outputs are the Q-values of all the actions. It is worth noting that the AMR in our work is operating in changing and unknown environments and it needs to explore its environment from consecutively collecting sufficient samples of the necessary experience for learning. Thus, a RNQL method with the online learning capability is employed so that the AMR can adapt to the environment without human intervention. In the proposed method, the SLFN is used to approximate the Q-value, and the parameters of hidden nodes in the SLFN are assigned randomly based on the OS-ELM algorithm. The output weights are recursively updated analytically. The design details are described below.

RNQL algorithm

In this section, the design process of the proposed RNQL is presented. The mathematical models of the SLFN and the ELM used in the design are first reviewed.

Mathematical description of unified SLFN

The output of a SLFN with $\tilde{N}$ hidden nodes (additive or radial basis function (RBF) nodes) can be represented by

f_{\tilde{N}} (x) = \sum_{i = 1}^{\tilde{N}} θ_{i} G (x; c_{i}, a_{i}), x \in R^{n}, c_{i} \in R^{n}

(2)

where $c_{i}$ and $a_{i}$ are the learning parameters of hidden nodes, $θ_{i}$ is the output weight connecting the ith hidden node to the output node, and $G (x; c_{i}, a_{i})$ is the output of the ith hidden node with respect to the input $x$ . For additive hidden nodes with the Sigmoid or threshold activation function $g (x) : R \to R$ , $G (x; c_{i}, a_{i})$ is given by

G (x; c_{i}, a_{i}) = g (c_{i} \cdot x + a_{i}), a_{i} \in R

(3)

where $c_{i}$ is the weight vector connecting the input layer to the ith hidden node, and $a_{i}$ is the bias of the ith hidden node, and $c_{i} \cdot x$ denotes the inner product of vectors $c_{i}$ and $x$ in $R^{n}$ .

For RBF hidden nodes with the Gaussian or triangular activation function, $g (x) : R \to R$ , $G (x; c_{i}, a_{i})$ is given by

G (x; c_{i}, a_{i}) = g (a_{i} | | x - c_{i} | |), a_{i} \in R^{+}

(4)

where $c_{i}$ and $a_{i}$ are the center and impact factor of ith RBF node. $R^{+}$ indicates the set of all positive real values. The RBF network is a special case of SLFN with RBF nodes in its hidden layer. Each RBF node has its own centroid and impact factor, and its output is given by a radially symmetric function of the distance between the input and the center.

ELM algorithm

For N arbitrary distinct samples $(x_{k}, t_{k}) \in R^{n} \times R^{m}$ , if a SLFN with $\tilde{N}$ hidden nodes can approximate these N samples with zero error, it then implies that there exist $β_{i}$ , $c_{i}$ , and $a_{i}$ such that

\sum_{i = 1}^{\tilde{N}} β_{i} G (x_{k}; c_{i}, a_{i}) = t_{k}, k = 1, \dots, N

(5)

Equation (5) can be written compactly as follows

H β = T

(6)

where

\begin{matrix} H (c_{1}, \dots, c_{\tilde{N}}, a_{1}, \dots, a_{\tilde{N}}, x_{1}, \dots, x_{N}) \\ = {[\begin{matrix} G (x_{1}; c_{1}, a_{1}) & \dots & G (x_{N}; c_{\tilde{N}}, a_{\tilde{N}}) \\ ⋮ & \dots & ⋮ \\ G (x_{1}; c_{1}, a_{1}) & \dots & G (x_{N}; c_{\tilde{N}}, a_{\tilde{N}}) \end{matrix}]}_{N \times \tilde{N}} \end{matrix}

(7)

β = {[\begin{matrix} β_{1}^{T} \\ ⋮ \\ β_{\tilde{N}}^{T} \end{matrix}]}_{\tilde{N} \times m} and T = {[\begin{matrix} t_{1}^{T} \\ ⋮ \\ t_{N}^{T} \end{matrix}]}_{N \times m}

(8)

$H$ is called the hidden layer output matrix of the network;¹⁸ the ith row of $H$ is the ith hidden node’s output vector with respect to inputs $x_{1}, x_{2}, \dots, x_{N}$ and the kth column of $H$ is the output vector of the hidden layer with respect to input $x_{k}$ .

It has been proven that a SLFN with the ELM algorithm can approximate any a nonlinear continuous function to an arbitrary accuracy.¹⁸ The following lemma is introduced.

Lemma 1

For a SLFN with additive or RBF hidden nodes and activation function $g (x) : R \to R$ which is infinitely differentiable in any interval, there exists $\tilde{N} \leq N$ such that for N arbitrary distinct input vector ${x_{i} | x_{i} \in R^{n}, i = 1, \dots \tilde{N}}$ , for any ${(c_{i}, a_{i})}_{i = 1}^{\tilde{N}}$ randomly generated according to any continuous probability distribution $∥ H β - T ∥ < ε$ with probability one, where $ε > 0$ is a small positive value.

The ELM algorithm offers a fast neural learning algorithm. All the parameters of the hidden nodes $(c_{1}, \dots, c_{\tilde{N}}, a_{1}, \dots, a_{\tilde{N}})$ can be randomly generated unlike the conventional implementation of the neural networks where they need to be tuned. This will simplify the design of the learning process. Its outstanding computational efficiency about learning speed, adaptability, and generalization has been verified in the work.^15–18 In essence, the learning capability of ELM can be used as the learning basis for the obstacle avoidance task.

RNQL algorithm

In the study, the obstacle avoidance task can be achieved by two behaviors: avoidance and goal-seeking. The former behavior is inherently nearsighted as it only considers how to avoid obstacles and ignores whether it causes the vehicle to deviate from the goal, whereas the latter behavior is inherently farsighted as it enables the vehicle to move toward the goal disregarding potential collisions. Thus, the two behavior modules are independent of each other and their functions always conflict. To achieve their respective goals, they are independently designed by the proposed RNQL algorithm. But the state space from the two modules is different, although the learning process is similar. For the obstacle avoidance module, the element of the state $s_{ot}$ at time step t is $[d_{1 t}, d_{2 t}, \dots, d_{5 t}]$ while for the goal-seeking module, the element of the state $s_{gt}$ is $[Δ d_{t}, Δ θ_{t}]$ . In order for a mobile robot to arrive at the goal position without colliding with obstacles, appropriate actions must be selected according to situation around the mobile robot. Here, the same action space A is defined for the two behavior modules, which includes three simply executable actions labeled as $A = {LeftTurn, RightTurn, Forward}$ . The action, $LeftTurn$ , is induced by the turn of wheel 2 (the right wheel) as shown in Figure 1 with the velocity of 0.5 m/s while the wheel 1 (the left wheel) maintains static. Similarly, the action, $RightTurn$ , is induced by the turn of wheel 1 (the left wheel) as shown in Figure 1 with the velocity of 0.5 m/s while the wheel 2 (the right wheel) maintains static. The $Forward$ action is realized by the simultaneous movement of the wheel 1 and wheel 2 with the velocity of 0.5 m/s.

The two behaviors are independently designed and then are combined to navigate the robot in a new environment without further learning when their mapping between input state space and output action space is correctly established. To efficiently combine the two behaviors, a switching function is used as the behavior selector to choose one behavior to be used at next action step, which is defined as follows

ξ = {\begin{matrix} 1 & if | θ_{i} | \leq \frac{π}{2} and d_{i} \leq r \\ 0 & otherwise \end{matrix}

(9)

where r represents the avoidance region and i is any one of the set $[1, \dots, 5]$ . When the switching signal $ξ$ is equal to 1, the avoidance behavior is activated and if $ξ = 0$ , the goal-seeking behavior is selected. Noting these, a general design process suitable to the two behavior modules are described here for simplicity.

Estimation of Q-value

In our proposed RNQL algorithm, a SLFN is used to estimate the Q-value, which is given as follows

\hat{Q} (s_{t}, a_{t}) = H (s_{t}; w, b) β

(10)

where $\forall s_{t} = {s_{ot}, s_{gt}}$ . The hidden node parameters ${w, b}$ existing in $\hat{Q} (s_{t}, a_{t})$ are randomly assigned based on the ELM algorithm. For fixed hidden node parameters ${w, b}$ , training a SLFN is simply equivalent to finding a least-square solution of the output weights $β$ in equation (6). When all the training data are available, calculation of the output weights can be done based on the Moore–Penrose generalized inverse in a single step according to the ELM algorithm. However, in our proposed RNQL, the AMR needs to adapt to the environment from continuously capturing the information of the sensors. Thus, the online version of ELM, that is, OS-ELM, is used to update the $β$ .

OS-ELM consists of two phases, namely an initialization phase and a sequential learning phase. In the initialization phase, the initial output weights $β^{(0)}$ are estimated with a small chunk of initial training data $ℵ_{0} = {(s_{t}, q_{t})}_{t = 1}^{N_{0}}, N_{0} \geq \tilde{N}$ . Here, $s_{t}$ is the state of the robot with respect to the environment and online randomly selected around the initial states. This will cause certain correlation among the training data. In the case, the initial hidden layer output matrix may be singular and then a large Euclidean norm of $β^{(0)}$ is caused. Usually, a large Euclidean norm of $β^{(0)}$ makes $\hat{Q} (s_{t}, a_{t})$ hard to converge to the true value through learning. Furthermore, $q_{t}$ is the desired value of the $\hat{Q} (s_{t}, a_{t})$ at time t and assigned by a small random value in the initial stage which is far from its true value indeed, that is, having big noise. In order to degrade the influence of noises of Q-values in the initial learning stage and get a small Euclidean norm of $β^{(0)}$ , a linear system with the constraint is formulated as follows

\begin{matrix} H_{0} β^{(0)} = Q_{0} \\ Subject to : ∥ β^{(0)} ∥_{2} \leq c \end{matrix}

(11)

where c is a constant value and $c > 0$ . $H_{0}$ is the initial hidden layer output matrix and given as follows

\begin{matrix} H_{0} (w_{1}, \dots, w_{\tilde{N}}, b_{1}, \dots, b_{\tilde{N}}; s_{1}, \dots, s_{N_{0}}) \\ = {[\begin{matrix} G (s_{1}; w_{1}, b_{1}) & \dots & G (s_{1}; w_{\tilde{N}}, b_{\tilde{N}}) \\ G (s_{2}; w_{1}, b_{1}) & \dots & G (s_{2}; w_{\tilde{N}}, b_{\tilde{N}}) \\ ⋮ & \dots & ⋮ \\ G (s_{N_{0}}; w_{1}, b_{1}) & \dots & G (s_{N_{0}}; w_{\tilde{N}}, b_{\tilde{N}}) \end{matrix}]}_{N_{0} \times \tilde{N}} \end{matrix}

(12)

$Q_{0}$ is the initial reference outputs $Q_{0} = [q_{1} q_{2} \dots q_{N_{0}}]^{T}$ .

The problem to the solution $β^{(0)}$ in equation (11) is equivalent to solve the least-square minimization with a quadratic inequality constraint. Thus, the optimization (11) becomes a linear programming problem with an inequality constraint as follows

\begin{matrix} Minimize : ∥ H_{0} β^{(0)} - Q_{0} ∥_{2} \\ Subject to : ∥ β^{(0)} ∥_{2} \leq c \end{matrix}

(13)

The singular value decomposition could be used to solve equation (13). The details are presented in section “Least-square minimization with a quadratic inequality constraint.”

On each step of interaction, the Q-value function is updated with the following equation

\begin{matrix} Q (s_{t}, a_{t}) \leftarrow (1 - α) \hat{Q} (s_{t}, a_{t}) \\ + α (r_{t} + γ \max_{a_{t + 1} \in A} \hat{Q} (s_{t + 1}, a_{t + 1})) \end{matrix}

(14)

where $s_{t}$ is the old state of the robot, $s_{t + 1}$ is the new state of the robot, $a_{t}$ is the chosen action in the state $s_{t}$ , $α$ is a learning rate, and $γ$ is a discount rate parameter. $r_{t}$ represents the reinforcement signal to be described in section “Reinforcement signal.” According to equation (14), the error term $e_{t}$ at time step t is as follows

e_{t} = α (r_{t} + γ \max_{a_{t + 1} \in A} \hat{Q} (s_{t + 1}, a_{t + 1}) - \hat{Q} (s_{t}, a_{t}))

(15)

Based on equation (14), the output weights $β^{(t)}$ at time step t are updated as follows

\begin{matrix} P_{t} & = P_{t - 1} - P_{t - 1} H_{t}^{T} {(I + H_{t} P_{t - 1} H_{t}^{T})}^{- 1} H_{t} P_{t - 1} \\ β^{(t)} & = β^{(t - 1)} + P_{t} H_{t}^{T} e_{t} \end{matrix}

(16)

where $H_{t}$ is the hidden node’s output vector with respect to state $s_{t}$ and equals as follows

\begin{matrix} H_{t} (w_{1}, \dots, w_{\tilde{N}}, b_{1}, \dots, b_{\tilde{N}}; s_{t}) \\ = {[\begin{matrix} G (s_{t}; w_{1}, b_{1}) & \dots & G (s_{t}; w_{\tilde{N}}, b_{\tilde{N}}) \end{matrix}]}_{1 \times \tilde{N}} \end{matrix}

(17)

After a period of learning, the value of $Q (s_{t}, a_{t})$ does not change evidently, which means that the Q-value is converged to the expected value and stands for the proper state–action pair. The learning process is terminated and then an optimal policy $Π$ for mapping states to actions is obtained as follows

Π (s) \equiv ma x_{a \in A} Q (s, a), (\forall s)

(18)

The optimal policy (equation (18)) is used to choose the best action. According to the state–action value $Q (s, a)$ , an action a in a state s can be chosen. The action a shall be the one with the highest $Q (s, a)$ .

However, at the whole learning phase of the proposed algorithm, in order to explore all possible actions, the Boltzmann probability distribution is employed here to select a possible action

P (a_{t} | s) = \frac{e^{Q (s, a_{t}) / T_{b}}}{\sum_{a_{i} \in A} e^{Q (s, a_{i}) / T_{b}}}

(19)

where $T_{b}$ is the temperature coefficient and represents the probability of selecting all the actions. The greater $T_{b}$ value reflects the stronger randomness of the action selection. Thus, it is slowly decreased during learning to make the action selection policy greedier and updated by the annealing method

T_{b} = T_{\min} + (T_{b} - T_{\min}) \times B_{b}

(20)

where $T_{\min}$ is the minimum value of $T_{b}$ and $B_{b}$ is the annealing coefficient. During the learning process, colliding with an obstacle or reaching the target point from the start point is referred as one episode which consists of a series of training steps. $T_{b}$ will be updated at the beginning of each episode.

Least-square minimization with a quadratic inequality constraint

In this section, the singular value decomposition³⁵ will be used to solve equation (13), where $H_{0} \in ℝ^{N_{0} \times \tilde{N}}$ $(N_{0} \geq \tilde{N})$ , $Q_{0} \in ℝ^{N_{0}}$ , and $c \geq 0$ . If

\begin{matrix} U^{T} H_{0} V \\ = {[\begin{matrix} Σ_{\tilde{N}} \\ O_{(N_{0} - \tilde{N}) \times \tilde{N}} \end{matrix}]}_{N_{0} \times \tilde{N}} \end{matrix}

(21)

where $Σ_{\tilde{N}} = diag (α_{1}, \dots, α_{\tilde{N}})$ , $U^{T} U = I_{N_{0}}$ , and $V V^{T} = I_{\tilde{N}}$ is the singular value decomposition of $H_{0}$ . Set

\begin{matrix} D_{H_{0}} \\ = {[\begin{matrix} Σ_{\tilde{N}} \\ O_{(N_{0} - \tilde{N}) \times \tilde{N}} \end{matrix}]}_{N_{0} \times \tilde{N}} \end{matrix}

(22)

and then problem (13) transforms to

\begin{matrix} Minimize : ∥ D_{H_{0}} y - {\tilde{Q}}_{0} ∥_{2} \\ Subject to : ∥ y ∥_{2} \leq c \end{matrix}

(23)

where ${\tilde{Q}}_{0} = U^{T} Q_{0}$ and $y = V^{T} β^{(0)}$ . The simple form of the objective function

∥ D_{H_{0}} y - {\tilde{Q}}_{0} ∥_{2}^{2} = \sum_{i = 1}^{\tilde{N}} (α_{i} y_{i} - {\tilde{Q}}_{0 i})^{2} + \sum_{i = \tilde{N} + 1}^{N_{0}} {\tilde{Q}}_{0 i}^{2}

(24)

and the constraint equation

∥ y ∥_{2}^{2} = \sum_{i = 1}^{\tilde{N}} y_{i}^{2} \leq c^{2}

(25)

facilitate the analysis of the least-square minimization with a quadratic inequality constraint problem.

Consideration of equation (24), the vector $y \in R^{\tilde{N}}$ , defined by

y_{i} = {\begin{matrix} {\tilde{Q}}_{0 i} / α_{i} & α_{i} \neq 0 \\ 0 & α_{i} = 0 \end{matrix} i = 1, \dots, \tilde{N}

(26)

is a minimizer of $∥ D_{H_{0}} y - {\tilde{Q}}_{0} ∥_{2}$ . If this vector is also the solution to equation (25), then we have a solution

β^{(0)} = V y

(27)

to equation (13). If this is not the solution of minimum two norms, we assume that

\sum_{i = 1}^{\tilde{N}} {(\frac{{\tilde{Q}}_{0 i}}{α_{i}})}^{2} > c^{2}

(28)

This implies that the solution to the least-square minimization with a quadratic inequality constraint problem occurs on the boundary of the feasible set. Thus, our remaining goal is to

\begin{matrix} Minimize : ∥ D_{H_{0}} y - {\tilde{Q}}_{0} ∥_{2} \\ Subject to : ∥ y ∥_{2} = c \end{matrix}

(29)

To solve this problem, Lagrange multipliers method is popular. Defining

h (λ, y) = ∥ D_{H_{0}} y - {\tilde{Q}}_{0} ∥_{2}^{2} + λ (| y ∥_{2}^{2} - c^{2})

(30)

we see that the equations $\partial h / \partial y_{i} = 0, i = 1, \dots, \tilde{N}$ lead to the linear system

(D_{H_{0}}^{T} D_{H_{0}} + λ I_{\tilde{N}}) y = D_{H_{0}}^{T} {\tilde{Q}}_{0}

(31)

Assuming that the matrix of coefficients is nonsingular, this has a solution $y (λ)$ where

y_{i} (λ) = \frac{α_{i} {\tilde{Q}}_{0 i}}{α_{i}^{2} + λ}

(32)

To determine the Lagrange parameter, we define

ϕ (λ) = ∥ y ∥_{2}^{2} = \sum_{i = 1}^{\tilde{N}} {(\frac{α_{i} {\tilde{Q}}_{0 i}}{α_{i}^{2} + λ})}^{2}

(33)

and seek a solution to $ϕ (λ) = c^{2}$ . From equation (28), we see that $ϕ (0) > c^{2}$ . Now $ϕ (λ)$ is monotone decreasing for $λ > 0$ , and equation (28) therefore implies the existence of a unique positive $λ^{*}$ for which $ϕ (λ^{*}) = c^{2}$ . It is easy to show that this is the desired root. It can be found through the Newton’s method. The solution of the original least-square minimization with a quadratic inequality constraint problem is as follows

β^{(0)} = V y (λ)

(34)

Reinforcement signal

The performance of an action $a_{t}$ chosen in the state $s_{t}$ , good or bad, is reflected by a reinforcement signal $r_{t}$ corresponding to the numerical reward/penalty. In our work, the artificial potential field is used to determine its value. According to the artificial potential field method, the robot is attracted to its goal position and is repulsed away from the obstacles. Hence, the avoidance module and goal-seeking module apply the repulsive potential function and the attractive potential function for calculating $r_{t}$ , respectively. The details are illustrated in the following.

The repulsive potential function $U_{o}$ is defined as follows

U_{o} = \sum_{i = 1}^{m} [\frac{z_{i}}{{(d_{i} - d_{0})}^{2}} - \frac{z_{i}}{{(d_{m_{i}} - d_{0})}^{2}}] (\cos θ_{i} + j \sin θ_{i})

(35)

where $z_{i}$ is the weight coefficient to show the importance of the ith sensor. $U_{o}$ reflects the relative position relationship between the mobile robot and the obstacles. The greater value of $U_{o}$ means that the distance between the mobile robot and the obstacles is closer, and vice versa. The difference of repulsive potential between the adjacent moment is calculated as follows

Δ U_{ot} = | U_{o (t + 1)} | - | U_{ot} |

(36)

$Δ U_{ot}$ shows the tendency of robot movement. A negative value of $Δ U_{ot}$ manifests that the mobile robot is further away from obstacles and a reward would be given to the robot. On the other hand, its positive value shows that the mobile robot is closer to obstacles and a penalty would be given to the robot. Thus, the relationship between the reinforcement signal $r_{t}$ and $Δ U_{ot}$ is given by

r_{t} = \frac{1 - e^{Δ U_{ot} / L}}{1 + e^{Δ U_{ot} / L}}

(37)

where L is the proportional coefficient and determined by the designer.

The attractive potential function $U_{g}$ defined as follows

U_{g} = z_{g} Δ d^{2}

(38)

where $z_{g}$ is the proportional coefficient. The less value of $U_{g}$ shows the closer distance between the mobile robot and the goal, and vice versa. The difference of attractive potential between the adjacent moment is calculated by

Δ U_{gt} = | U_{g (t + 1)} | - | U_{gt} |

(39)

$Δ U_{gt}$ shows the trend of robot movement. Similarly, $Δ U_{gt} < 0$ indicates that the mobile robot is closer to the target and a reward would be given to the robot. On the other hand, its positive value presents that the mobile robot is further from the target and a penalty would be given to the robot. Thus, the relationship between the reinforcement signal $r_{t}$ and $Δ U_{gt}$ is given by

r_{t} = \frac{1 - e^{Δ U_{gt} / L}}{1 + e^{Δ U_{gt} / L}}

(40)

Based on the analysis mentioned above, the proposed RNQL algorithm can be summarized as follows:

RNQL algorithm

First, select the type of nodes (additive or RBF) and the corresponding activation function g and the hidden node number $\tilde{N}$ .

Step 1: initialization phase: Initialize the learning using a small chunk of initial training data $ℵ_{0} = {(s_{t}, q_{t})}_{t = 1}^{N_{0}}, N_{0} \geq \tilde{N}$ , where $s_{t}$ is collected from the information of sensors and $q_{t}$ is randomly assigned.

(a) Assign random input weights $w_{i}$ and bias $b_{i}$ (for additive hidden nodes) or center $w_{i}$ and impact factor $b_{i}$ (for RBF hidden nodes), $i = 1, \dots, \tilde{N}$ .

(b) Calculate the initial hidden layer output matrix $H_{0}$ according to equation (12) and estimate the initial output weight $β^{(0)}$ according to equation (27) or equation (34). Set the training step $t = 1$ .

Step 2: Sequential Learning Phase:

(a) Obtain the current state $s_{t}$ of the mobile robot and select an action $a_{t}$ randomly.

(b) Execute the action selected and obtain the next state of the mobile robot $s_{t + 1}$ .

(c) Calculate numerical reward/penalty $r_{t}$ according to equation (37) for avoidance module or equation (40) for goal-seeking module.

(d) Calculate the error term $e_{t}$ according to equation (15). If $| e_{t} | < ε$ , where $ε$ is a small constant, the learning process is terminated. Otherwise go to the following step.

(e) Estimate and update the Q-value $Q (s_{t}, a_{t})$ according to equation (14) and update $β^{(t)}$ according to equation (16).

(f) Update $T_{b}$ according to equation (20) when collision occurs or the goal is arrived at. Then, the robot is initiated to the starting position and continues to learn. Otherwise, keep $T_{b}$ unchanged. Set $t = t + 1$ and go to step 2.

Remark 1

As OS-ELM, the proposed RNQL consists of two phases, namely, an initialization phase and a sequential learning phase. In the initialization phase, the initial weight $β^{(0)}$ is determined according to equation (27) or equation (34) in section “Least-square minimization with a quadratic inequality constraint,” where a quadratic inequality constraint is considered to avoid a large Euclidean norm of $β^{(0)}$ due to the certain correlation among the training data. This is different from the original OS-ELM. In the sequential learning phase, the weight $β^{(t)}$ is recursively updated using the information, including $s_{t}$ , $s_{t + 1}$ , and $r_{t}$ until $e_{t}$ is close to $ε$ in equation (15). In our study, we set $| e_{t} | < ε$ as the ending condition of the learning phase.

Remark 2

The obstacle avoidance task is achieved by an avoidance module and a goal-seeking module. The two modules independently apply the proposed RNQL to find the reference Q-value. But in the learning process, the state variables and the calculation of the reinforcement signal $r_{t}$ for the two modules are different. The calculation of the reinforcement signal is achieved according to the potential energy concept. When using the concept, the attractive potential field and repulsion potential field are separately used for the tasks of arriving at the target goal and avoiding the obstacles. Since the directions of the two fields are opposite, the simultaneous training of the target and the obstacles in a whole process counteracts their functions. This may cause an impossible convergence. Concerning this, two independent modules, that is the avoidance module without considering the targets and goal-seeking module without considering obstacles, are adopted in our work. In the avoidance module, the state variable $s_{t}$ is set as $s_{ot} = [d_{1 t} d_{2 t} \dots, d_{5 t}]$ and the reinforcement signal $r_{t}$ is calculated according to equation (37) while in the goal-seeking module, the state variable $s_{t}$ is set as $s_{gt} = [Δ d_{t}, Δ θ_{t}]$ and the reinforcement signal $r_{t}$ is calculated according to equation (40).

Performance evaluation

In the section, the performance of the proposed RNQL algorithm is evaluated. First, the avoidance module and the goal-seeking module are independently trained using the RNQL algorithm under different simulation environments. Then, the trained modules are combined to navigate a robot to the goal without colliding the obstacles in the unknown environments. Note that all the environments are generated from the simulation software SimRobot (developed by the students and teachers in Department of Control and Instrumentation, Brno University of Technology).

For the purpose of comparison, the performance of the proposed RNQL algorithm is compared with the BPQL, which is commonly used in Yang et al.,⁸ Li et al.,⁹ Qiao et al.,¹¹ and Ganapathy et al.¹² All the simulations are carried out in MATLAB 6.5 environment running in a Intel^® Core^™2 Quad CPU, 2.83 GHz.

Parameter configuration

The parameter values used for all the simulations are summarized in Table 1. For better understanding, the determination details of these parameters are described in the following.

Table 1.

Parameter values used for all of simulation studies.

Parameter	Value	Parameter	Value
$α$	0.05	$z_{1}$	1
$γ$	0.95	$z_{2}$	2
$d_{m}$	15 m	$z_{3}$	4
$d_{0}$	2 m	$z_{4}$	2
L	0.001	$z_{5}$	1
$T_{b}$	20	$z_{g}$	0.01
$T_{\min}$	5	$\tilde{N}$	20
$B_{b}$	0.8	$ε$	0.1

$α$ is the learning rate of Q-learning algorithm and set to be a small positive value of 0.05. $γ$ is the discount rate parameter of Q-learning algorithm. Here, $γ$ is a constant, $0 \leq γ < 1$ , and determines the relative proportion of the future reward and immediate reward. If $γ = 0$ , the immediate reward is only considered. When $γ$ is close to 1, the future reward is more important than the immediate reward. In this case, it is set to be 0.95 in our work for the emphasis on the future reward. $d_{m}$ is the farthest detection range of the sensors and the value is 15 m according to real configuration. $d_{0}$ is the safe distance of the mobile robot and the value is 2 m determined by the robot’s dynamics characteristics.

L is a proportional coefficient for calculating the reinforcement signal $r_{t}$ . Its value ensures $| Δ U_{ot} / L | < 10$ and $| Δ U_{gt} / L | < 10$ . Here, it is set to be 0.001. $T_{b}$ and $T_{\min}$ are selected according to equation (20). The selection of their values 20 and 5 is to ensure $| Q (s, a_{t}) / T_{b} | < 1$ . The annealing coefficient $B_{b}$ is set to be 0.8. $z_{i} (i = 1, \dots, 5)$ represents the weight coefficient of the ith sensors for the repulsive potential field function. As shown in Figure 2, sensor 3 configured along the X_b-axis detects the obstacles in front of the robot is more important than the others. Thus, the largest value is allocated to $z_{3}$ . Larger values are allocated to $z_{2}$ and $z_{4}$ that are closer to $z_{3}$ . The minimum values are assigned to $z_{1}$ and $z_{5}$ .

$z_{g}$ reflects the proportional coefficient for the attractive potential field function. Since only one sensor is used to calculate the attractive potential, $z_{g}$ is set to 0.01. $\tilde{N}$ is the number of hidden nodes and is chosen from the range [10, 50] with the interval 2. The optimal number is 20 according to the training performance. $ε$ is a small constant to terminate the training process and chosen as 0.1.

Performance comparison of avoidance module

In the avoidance module, the AMR is only required to move among many obstacles without collisions and without a specific target. Here, a simple simulation environment with four obstacles is designed to train the robot using the RNQL and then the generalization of the RNQL algorithm is verified in a more complex environment and a maze after the robot finishes its training. In the avoidance module, the state variable $s_{t}$ is set as $s_{ot} = [d_{1 t} d_{2 t}, \dots, d_{5 t}]$ . The inputs of the SLFN are set as $I N_{it}, i = 1, \dots, 5$ and are normalized as follows

I N_{it} = {\begin{matrix} 1 & d_{it} > d_{m} \\ 1 / (d_{it} - 1) & d_{0} \leq d_{it} \leq d_{m} \\ 0 & d_{it} < d_{0} \end{matrix}

(41)

The performance of the RNQL algorithm is evaluated from two criteria, namely continuous moving time without collision and training time. Here, we set up 50,000 s as the maximum continuous moving time that is the maximum simulation time. The comparison of the performance between the RNQL algorithm and BPQL algorithm is shown in Table 2.

Table 2.

Performance comparison between RNQL and BPQL algorithms for avoidance module.

Algorithms	RNQL			BPQL
Algorithms	Mean	STD	Closest	Mean	STD	Closest
Training steps	2283	208	2227	7127	650	6953
Training time (ms)	314	29	306	4440	405	4332
Moving time in training environment (s)	50,000			50,000
Moving time in verification environment (s)	50,000			6380

RNQL: random neural Q-learning; BPQL: back-propagation Q-learning; STD: standard deviation.

For each algorithm, the results are averaged over 20 trials. The average training steps, the average training time, and the trial result that is the closest to the mean are shown in Table 2. From Table 2, it can be seen that the mobile robot trained by the proposed RNQL and BPQL algorithm can smoothly move 50,000 s after each trial. But the training steps in the RNQL are much lesser than those of BPQL. Besides, the training time of RNQL is much lesser that of BPQL. For the BPQL algorithm, because the learning parameters of the hidden nodes and the weights connecting the hidden nodes to the output nodes are all tuned in one training step, the speed of convergence for the algorithm is slow and the learning efficiency is low. It can be concluded that the robot trained by the proposed RNQL algorithm can move a longer time without collision after the same training steps compared with the BPQL algorithm. Furthermore, the shorter training time can be achieved by the proposed RNQL algorithm.

For the purpose of illustration, the results of the RNQL algorithm that is the closest to the mean are shown in Figure 5(a) and (b). It can be seen from Figure 5(b) that the robot successfully moves in a more complex environment no collision without being trained any more. Similarly, the training procedure and the testing result of the BPQL algorithm are shown in Figure 6(a) and (b) separately. In Figure 6(a), the mobile robot trained by the BPQL algorithm can move at least 50,000 s collisionless in the simple training environment after 6953 training steps. But in Figure 6(b), it can only move 6380 s without collision and then hit the obstacle in the complex environment.

Figure 5.

Moving trajectories of the robot using RNQL: (a) training environment and (b) testing environment.

Figure 6.

Moving trajectories of the robot using BPQL: (a) training environment and (b) testing environment.

The BPQL algorithm is based on the gradient-descent technique to training a SLFN. Gradient-based learning is generally time-consuming and is easy to stop at a local minimum if the unsuitable initial values of adjustable parameters are located far from an optimal solution. In Figure 6, the SLFN is trained by the BPQL algorithm and gets the sub-optimal solution. In this case, the global state–action mapping cannot be learnt. Thus, the robot collides with the edge of a block in testing environment, although it can move successfully without collision in training environment. In the proposed RNQL algorithm based on the ELM algorithm, the problem of training a SLFN is converted to obtain a solution of a linear system with the random parameters of the input layer. The smallest least-square training error can be reached by the Moore–Penrose generalized inverse approach. In Figure 5, the proposed algorithm may find the optimal solution compared with the BPQL algorithm. Thus, it can extend to move in the complex environments successfully.

To further evaluate the performance of the proposed RNQL algorithm, a more complex environment, that is, a maze is applied here. Figure 7(a) shows that the mobile robot trained by RNQL algorithm successfully moves at least 50,000 s without collision in the maze environment. As shown in Figure 7(b), the mobile robot trained by the BPQL algorithm can only move 9157 s without collision and then hit the obstacle in the environment.

Figure 7.

Moving trajectories of the robot in a maze: (a) trained by RNQL and (b) trained by BPQL.

Therefore, from Figures 5 –7, it can be concluded that the proposed RNQL algorithm has better generalization ability and higher learning efficiency than the BPQL, where the mobile robot trained by the proposed algorithm can move without collision in different unknown environments without being trained any more.

Besides, in the proposed RNQL algorithm, a different initialization phase from that of OS-ELM is presented to estimate the initial value of the output weights for improving the learning performance. To verify this point, the performance between the proposed initialization method and the original one is compared and shown in Table 3, from which it can be observed that the proposed initialization estimation method has faster convergence speed with the lesser training steps and training time. Also it achieves better generalization ability without collision in the complex environments, whereas the original initialization phase in the OS-ELM moves 41,797 s without collision and then hits the obstacle in the complex environment.

Table 3.

Performance comparison between the proposed initial phase and the original one.

Algorithms	Proposed			Original
Algorithms	Mean	STD	Closest	Mean	STD	Closest
Training steps	2283	208	2227	2817	498	2785
Training time (ms)	314	29	306	561	110	541
Moving time in verification environment (s)	50,000			41,797

STD: standard deviation.

Performance comparison of goal-seeking module

In the goal-seeking module, the AMR is trained to move to a target point without any obstacles. In this module, the state variable $s_{t}$ is set as $s_{gt} = [Δ d_{t}, Δ θ_{t}]$ . The inputs of the SLFN are set as $I N_{dt}, I N_{θ t}$ and are normalized as

I N_{dt} = {\begin{matrix} 1 - (Δ d_{t} - 1) / (d_{\max} - d_{0}) & d_{0} \leq Δ d_{t} \leq d_{\max} \\ 1 & Δ d_{t} < d_{0} \end{matrix}

(42)

I N_{θ t} = \frac{Δ θ_{t}}{π}

(43)

where $d_{\max}$ is the farthest diameter of the map and its value is 400 m in the training environment and 550 m in the testing environment and the maze in the article.

First, a simulation environment without obstacles as shown in Figure 8(a) is designed to train the robot using the RNQL where the red circle dot is the target point and the initial position is set as (49, 52, 0). The numerical comparison between the RNQL algorithm and BPQL algorithm is shown in Table 4. For each algorithm, the results are averaged over 20 trials. The average training steps, the average training time, and the result of trial that is the closest to the mean are shown in Table 4. The moving trajectories of the robot from the initial training point to the goal are shown in Figure 8. That is the result of trial which is the closest to the mean. From the table, it can be observed that the lesser training steps and training time are required by the RNQL algorithm than those of the BPQL algorithm. This means that the RNQL algorithm has faster convergence speed than the BPQL algorithm.

Figure 8.

Moving trajectories of the robot from the training start position (49, 52, 0) to the goal: (a) RNQL and (b) BPQL.

Table 4.

Performance comparison of RNQL algorithm and BPQL algorithm for goal-seeking module.

Algorithms	RNQL			BPQL
Algorithms	Mean	STD	Most close	Mean	STD	Most close
Training steps	2391	218	2333	5537	505	5402
Training time (ms)	519	47	506	2732	249	2625
Moving steps after training	689			774

RNQL: random neural Q-learning; BPQL: back-propagation Q-learning.

An additional four different initial positions are applied to verify the performance of the goal-seeking behavior module. Figures 9 –12 show the moving trajectories from the four different initial positions to the goal based on the RNQL and BPQL algorithms. As shown in these figures, although the initial positions are varied, the robot is able to move to the target using the two learning algorithms above. The numerical performance comparison of the moving steps based on the RNQL algorithm and BPQL algorithm is shown in Table 5. From the table, it can be observed that the lesser moving steps can be obtained by the RNQL algorithm than those of the BPQL algorithm. This means that the robot trained by the RNQL algorithm can arrive at the goal faster than the one trained by the BPQL algorithm.

Figure 9.

Moving trajectories of the robot from the position (41, 38, 0) to the goal: (a) RNQL and (b) BPQL.

Figure 10.

Moving trajectories of the robot from the position (248, 157, 0) to the goal: (a) RNQL and (b) BPQL.

Figure 11.

Moving trajectories of the robot from the position (106, 147, $- π / 2$ ) to the goal: (a) RNQL and (b) BPQL.

Figure 12.

Moving trajectories of the robot from the position (97, 108, $π / 2$ ) to the goal (a) RNQL and (b) BPQL.

Table 5.

Verification performance comparison of RNQL and BPQL algorithms for goal-seeking module.

Algorithms	Moving steps from different initial position to goal
Algorithms	(41, 38, 0)	(248, 157, 0)	(106, 147, $- π / 2$ )	(97, 108, $π / 2$ )
RNQL	733	354	561	558
BPQL	838	376	600	657

RNQL: random neural Q-learning; BPQL: back-propagation Q-learning.

Simulation of obstacle avoidance

In the proposed obstacle avoidance method, once the network parameters for the two behavior modules are completely built through the RNQL algorithm, the two behaviors will be combined so that the robot arrives at the given goal position without colliding with obstacles. In this section, the avoidance module and goal-seeking module are now combined to navigate a robot to reach the target position. When the robot navigates in a certain environment, one of the two behaviors must be selected at each action step in order to accomplish its goal. This is performed by a switching function expressed by equation (9). First, the robot is required to move to the target position in an environment with four obstacles under four different initial positions as shown in Figure 13.

Figure 13.

Mobile robot moves to a target from different initial position without collision in the simple environment: (a) (96, 176, 0); (b) (31, 32, 0); (c) (180, 41, 0) and (d) (28, 219, pi/2).

From the figure, it can be observed that whatever the initial position of the robot is, the robot can arrive at the target position in a smooth path while keeping a certain distance with obstacles. Also a complex environment with many different obstacles as depicted in Figure 14 is further used to verify the obstacle avoidance performance of a trained robot by reaching the specific target. Besides, a more complex maze environment with different initial and target positions as shown in Figures 15 and 16 is applied here to evaluate its performance. From the figures, one can note that the trained robot can successfully move from different start points to the different goals in complex environments. Besides, the results of the BPQL algorithm for the three verification environments are given in Figure 17 for the purpose of comparison. From the figure, it can be seen that the robot is navigated successfully to the target in the simple environment while it fails in the complex environments. It can be concluded that the proposed obstacle avoidance strategy has good generalization ability and also has good ability to adapt to a new environment.

Figure 14.

Mobile robot moves to a target from different initial position without collision in the complex environment: (a) (27, 28, 0); (b) (48, 165, 0); (c) (30, 338, pi/2) and (d) (256, 55, pi/2).

Figure 15.

Mobile robot moves to a target closer from the obstacles from different initial position in a maze: (a) (25, 327,-pi/2); (b) (26, 58, pi/2); (c)(101, 237, pi/2) and (d) (253, 23, pi/2).

Figure 16.

Mobile robot moves to a target farer to the obstacles from different initial position in a maze: (a) (25, 327, -pi/2); (b) (26, 58, pi/2); (c) (101, 237, pi/2) and (d) (253, 23, pi/2).

Figure 17.

Moving trajectories of BPQL from different initial position in three verification environments: (a) (96, 176, 0); (b) (31, 32, 0); (c) (27, 28, 0); (d) (48, 165, 0); (e) (25, 327, -pi/2) and (f) (26, 58, pi/2).

Conclusion

In this article, a simple and efficient learning strategy for the AMR’s obstacle avoidance problem in uncertain environments has been proposed. The core avoidance and goal-seeking modules in the strategy are trained by a novel RNQL algorithm where the SLFN is used as a Q-function approximator to estimate the Q-value. In the proposed RNQL algorithm, the parameters of the SLFN are tuned by the OS-ELM algorithm. However, different from the original OS-ELM, the initial output weights of the SLFN are estimated subjected to a quadratic inequality constraint. Performance of RNQL is compared with BPQL algorithm on obtaining the optimal learning policies for the avoidance and goal-seeking modules. The results indicate that the RNQL produces better learning performance in terms of convergence speed, training time, and generalization ability. The proposed learning strategy can effectively navigate the AMR to arrive at the goal without colliding with the obstacles in unknown environments.

The proposed algorithm is a general way to guide the mobile robot to reach a target. But the effectiveness of the presented algorithm depends on some sensing information, that is, the position of the robot, the coordinate of the goal, and the distances between the target and the obstacles. Therefore, while the mobile robot system has the positioning and obstacle detection sensors, the proposed algorithm is an effective method for guiding the robot to reach the destination. In the case of lacking the positioning sensors, the proposed algorithm can also lead the robot to reach the goal if the target can be identified via some specific sensors. For example, the mobile robot can detect obstacles and the goal by cameras. However, the mobile robot will move randomly when the target is out of the detection range of the camera. If the robot enters the target detection range, it moves toward the target. This will cause a longer navigation time. Moreover, a larger safe distance between the target with the nearest obstacles is required. This can ensure enough space for the robot to move to the target from different initial positions. If the safe distance is smaller, the avoidance module plays a main role. The robot wanders around the target and cannot arrive at the target. The determination of the minimum safe distance is not an easy task since it depends on many factors. We will discuss it in our future work.

Footnotes

Academic Editor: Pak Wong

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is funded in part by National Natural Science Foundation of China (Grant No. 61403300), National Science Council of ShaanXi Province (Grant No. 2014JM8337), and the Fundamental Research Funds for the Central Universities.

References

Borenstein

Koren

Histogramic in-motion mapping for mobile robot obstacle avoidance. IEEE T Robotic Autom 1991; 7: 535–539.

Van den Berg

Overmars

. Roadmap-based motion planning in dynamic environments. IEEE T Robot 2005; 21: 885–897.

Cai

Ferrari

Information-driven sensor path planning by approximate cell decomposition. IEEE T Syst Man Cy B 2009; 39: 672–689.

Pathak

Agrawal

An integrated path-planning and control approach for nonholonomic unicycles using switched local potentials. IEEE T Robot 2005; 21: 1201–1208.

Ren

McIsaac

Patel

Modified Newton’s method applied to potential field-based navigation for mobile robots. IEEE T Robot 2006; 22: 384–391.

Tsourveloudis

Valavanis

Hebert

Autonomous vehicle navigation utilizing electrostatic potential fields and fuzzy logic. IEEE T Robotic Autom 2001; 17: 490–497.

Wen

Zheng

Zhu

. Elman fuzzy adaptive control for obstacle avoidance of mobile robots using hybrid force/position incorporation. IEEE T Syst Man Cy C 2012; 42: 603–608.

Yang

Chen

CW.

Mobile robot navigation using neural Q-learning. In: Proceedings of the third international conference on machine learning and cybernetics, Shanghai, China, 13–16 August 2006, pp.48–52. New York: IEEE.

Zhang

. Application of artificial neural network based on Q-learning for mobile robot path planning. In: Proceedings of the 2006 IEEE international conference on information acquisition, Shandong, China, 20–23 August 2006, pp.978–982. New York: IEEE.

10.

Qiao

Hou

Ruan

. Q-learning based on neural network in learning action selection of mobile robot. In: Proceedings of the IEEE international conference on automation and logistics, Jinan, China, 18–21 August 2007, pp.263–267. New York: IEEE.

11.

Qiao

Hou

Ruan

. Application of reinforcement learning based on neural network to dynamic obstacle avoidance. In: Proceedings of the 2008 IEEE international conference on information and automation, Zhangjiajie, China, 20–23 June 2008, pp.784–788. New York: IEEE.

12.

Ganapathy

Yun

Joe

. Neural Q-learning controller for mobile robot. In: Proceedings of the 2009 IEEE/ASME international conference on advanced intelligent mechatronics, Singapore, 14–17 July 2009, pp.863–868. New York: IEEE.

13.

Sanner

Slotine

JJE

. Gaussian networks for direct adaptive control. IEEE T Neural Networ 1992; 3: 837–863.

14.

Haykin

Neural networks: a comprehensive foundation. Upper Saddle River, NJ: Prentice Hall, 1999.

15.

Huang

Zhu

Siew

. Extreme learning machine: a new learning scheme of feedforward neural networks. In: Proceedings of the international joint conference on neural networks (IJCNN2004), Budapest, 25–29 July 2004, vol. 2, pp.985–990. New York: IEEE.

16.

Huang

Siew

CK.

Extreme learning machine: RBF network case. In: Proceedings of the eighth international conference on control, automation, robotics and vision (ICARCV 2004), Kunming, China, 6–9 December 2004, vol. 2, pp.1029–1036. New York: IEEE.

17.

Huang

Zhu

Mao

. Can threshold networks be trained directly?IEEE T Circuits: II 2006; 53: 187–191.

18.

Huang

Zhu

Siew

CK.

Extreme learning machine: theory and applications. Neurocomputing 2006; 70: 489–501.

19.

Huang

Saratchandran

. Fully complex extreme learning machine. Neurocomputing 2005; 68: 306–314.

20.

Huang

Chen

Convex incremental extreme learning machine. Neurocomputing 2007; 70: 3056–3062.

21.

Huang

Chen

Enhanced random search based incremental extreme learning machine. Neurocomputing 2008; 71: 3460–3468.

22.

Cao

Lin

Extreme learning machines on high dimensional and large data applications: a survey. Math Probl Eng 2015; 2015: 103796-1–103796-12.

23.

Cao

Zhao

Lai

. Landmark recognition with sparse representation classification and extreme learning machine. J Frankl Inst 2015; 352: 4528–4545.

24.

Cao

Xiong

Protein sequence classification with improved extreme learning machine algorithms. Biomed Res Int 2014; 2014: 103054-1–103054-12.

25.

Chen

Zhao

Zhu

. Quantized kernel recursive least squares algorithm. IEEE Trans Neural Networ Learn Syst 2013; 24: 1484–1491.

26.

Chen

Zhao

Zhu

. Quantized kernel least mean square algorithm. IEEE Trans Neural Networ Learn Syst 2012; 23: 22–32.

27.

Zhang

Huang

Sundararajan

. Multi-category classification using an extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM T Comput Bi 2007; 4: 485–495.

28.

Liang

Saratchandran

Huang

. Classification of mental tasks from EEG signals using extreme learning machine. Int J Neural Syst 2006; 16: 29–38.

29.

Handoko

Keong

Soon

. Extreme learning machine for predicting HLA-peptide binding. Lect Notes Comput Sc 2006; 3973: 716–721.

30.

Huang

Zhu

Siew

CK.

Real-time learning capability of neural networks. IEEE T Neural Networ 2006; 17: 863–878.

31.

Huang

Chen

Siew

CK.

Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE T Neural Networ 2006; 17: 879–892.

32.

Liang

Huang

Saratchandran

. A fast and accurate online sequential learning algorithm for feedforward networks. IEEE T Neural Networ 2006; 17: 1411–1423.

33.

Sutton

RS.

Reinforcement learning: a tutorial. AT&T Labs Research, 2008, https://webdocs.cs.ualberta.ca/~sutton/Talks/RL-Tutorial/sld001.htm

34.

Boada

MJL

Barber

Salichs

MA.

Visual approach skill for a mobile robot using learning and fusion of simple skills. Robot Auton Syst 2002; 38: 157–170.

35.

Golub

Loan

CFV

. Matrix computations. 3rd ed.Baltimore, MD: Johns Hopkins University Press, 1996.

Random neural Q-learning for obstacle avoidance of a mobile robot in unknown environments

Abstract

Keywords

Introduction

Problem definition

RNQL algorithm

Mathematical description of unified SLFN

ELM algorithm

Lemma 1

RNQL algorithm

Estimation of Q-value

Least-square minimization with a quadratic inequality constraint

Reinforcement signal

RNQL algorithm

Remark 1

Remark 2

Performance evaluation

Parameter configuration

Performance comparison of avoidance module

Performance comparison of goal-seeking module

Simulation of obstacle avoidance

Conclusion

Footnotes

Declaration of conflicting interests

Funding

References