Abstract
Deep reinforcement learning (DRL) provides a new solution for rehabilitation robot trajectory planning in the unstructured working environment, which can bring great convenience to patients. Previous researches mainly focused on optimization strategies but ignored the construction of reward functions, which leads to low efficiency. Different from traditional sparse reward function, this paper proposes two dense reward functions. First, azimuth reward function mainly provides a global guidance and reasonable constraints in the exploration. To further improve the efficiency, a process-oriented aspiration reward function is proposed, it is capable of accelerating the exploration process and avoid locally optimal solution. Experiments show that the proposed reward functions are able to accelerate the convergence rate by 38.4% on average with the mainstream DRL methods. The mean of convergence also increases by 9.5%, and the percentage of standard deviation decreases by 21.2%–23.3%. Results show that the proposed reward functions can significantly improve learning efficiency of DRL methods, and then provide practical possibility for automatic trajectory planning of rehabilitation robot.
Introduction
Trajectory planning is a fundamental problem for a rehabilitation robot. Conventional trajectory planning task of rehabilitation robots has always been completed by doctors. However, the disharmony of doctor-patient ratio and lacking of skilled doctors always cause contradictions and bring inconvenience to patients.1–3 Therefore, autonomous trajectory planning is highly expected. Nevertheless, the autonomous trajectory planning of a robot is a challenging task. Patients in rehabilitation training usually have movement restrictions, this requires the robot to avoid some points that the patient physically cannot reach (referred to ban point) during trajectory planning, otherwise, it will cause physical damage to the patient. Traditional trajectory planning methods are usually applicable to structured environment.4,5 However, the working environment of a rehabilitation robot is always changing with the patient’s physical condition, which is difficult to model in advance. In recent years, Deep Reinforcement Learning (DRL) provides a new solution for trajectory planning tasks in such conditions.6–8 It enables the robot to learn autonomously and plan a feasible trajectory in unstructured environment. The structure of trajectory planning with DRL is shown as Figure 1. “Trial and Error” is the central mechanism of DRL method, the agent can explore the possible motions according to the current state of work environment and the robot by maximizing the cumulative reward with an optimization strategy. Through the interaction of agent, reward function, and work environment, the robot can accomplish the trajectory planning task in unstructured environment.9–11

Scheme of Deep Reinforcement Learning for rehabilitation robot trajectory planning.
The representative optimization strategies in DRL include Q-learning, DQN (Deep Q Network), SARSA (State Action Reward State Action), and the like.12–14 However, these methods cannot be directly used for trajectory planning task, as the spaces of output action generated by those methods are discrete, which cannot meet the need for trajectory planning task with continuous action spaces. To cope with this problem, Lillicrap et al. 15 proposed an algorithm called DDPG (Deep Deterministic Strategy Gradient), by the nonlinear approximation, DDPG makes the output action space continuous. Furthermore, Tai et al. 9 further improved the DDPG with a strategy of asynchronous execution. However, the performance of DDPG is restricted by the operation of experience replay. This shortcoming can be overcome by asynchronous update in A3C (Asynchronous Advantage Actors). 16 The multithreaded implementation in A3C also improves learning efficiency observably. However, A3C does not work well in complex environment due to the fixed learning rate, the performance of robustness is not satisfactory. To solve this problem, DPPO (Distributed Proximal Policy Optimization) 17 was proposed, it introduces a penalty term, which can reduce the impact of unreasonable learning rate by providing a more reasonable update proportion. Above methods can solve the problem of trajectory planning to some extent. Nevertheless, randomness and blindness are still the major problems in DRL methods. When the agent faces an unstructured working environment with ban points, this problem will be more serious. Through previous work, we find that the kernel of this problem is reward function. Previous researches mainly focused on the innovation of optimization strategy but neglected the design of reward function. Most reward functions used in robot trajectory planning task are sparse reward functions. The value of sparse reward functions is zero everywhere, except for a few special places such as target or ban point. 18 Sparse reward function always generates a good deal of ineffective explorations and tends to get trapped in locally optimal solution, which affects the efficiency of DRL method seriously.19–21 Therefore, this paper mainly focuses on construction of reward functions, the primary contributions are summarized as follows: (1) In consideration of the feature of trajectory planning task, this paper proposes two kinds of dense reward functions including azimuth reward function and aspiration reward function. Different from sparse reward function, dense reward function gives non-zero rewards most of the time. It provides much more feedback information after each action, thereby reducing the blindness of exploration in trajectory planning task. (2) Azimuth reward function is a results-oriented reward function, it prompts the agent to choose the action to get higher reward. Azimuth reward function mainly provides reasonable constraint in the exploration. According to characteristics of trajectory planning task, direction and distance are used to model azimuth reward function. Experiments proves that azimuth reward function can bring benefits to the convergence speed and robustness. (3) Aspiration reward function is process-oriented, it focuses more on the exploration process rather than the final result. In this paper, agent’s familiarity with the environment is defined as aspiration, which is the difference between predict the features and actual features. Aspiration reward function will stimulate the agent to explore unfamiliar areas, it is therefore capable of accelerating the exploration process and avoiding locally optimal solution. In order to predict the features felicitously, a novel feature extraction network SRU-HM is also proposed. With the help of SRU-HM, aspiration reward function can make better performance at a faster response.
Azimuth reward function
Target searching and ban point avoidance are the two goals in the trajectory planning task. Azimuth reward function, which is composed of direction reward function and distance reward function, is to provide reasonable constricts for the agent from different perspectives. Section 2.1 and 2.2 explain the two reward functions respectively, section 2.3 introduces the implement of azimuth reward function based on the former.
Direction reward function
A challenge of trajectory planning in unstructured environment with ban points is to balance the target searching and ban point avoidance. Target searching aims to identify the shortest path while ban point avoidance requires security first. The two goals are even in opposition in some cases since the direction from rehabilitation robot to target and ban point are sometimes overlapped. As a consequence, it is necessary to propose a strategy for the agent to choose a reasonable direction. Direction reward function is proposed to take this duty. Inspired by Coulomb’s law,
22
this paper regards the relative motion between target and end effector of rehabilitation robot as dissimilar charges attract each other. Similarly, when considering ban point avoidance, the relation between ban point and end effector can be seen as like charges repel to one another. Direction reward function is built as Figure 2, where

Diagram of direction reward function:
Where
When there are multiple ban points involved, each scene of ban point is calculated separately, and their results will be added to obtain the final direction reward function.
Distance reward function
Distance reward function is also constructed by both considering target searching and ban point avoidance. Therefore, distance reward function is made up of two parts. Ban point avoidance is a punitive element which is responsible for making the rehabilitation robot keep a safe distance from ban point. Target guidance provides positive incentives that navigates the rehabilitation robot to search the target.
Ban point avoidance
The characteristic of ban point avoidance is that the closer the robot moves to the ban point, the higher the negative reward value will be. However, if the relative distance is safe enough, ban point avoidance should not interfere with target guidance task. It’s clear that simple linear functions cannot meet the demands. Gaussian function is used to model ban point avoidance as shown in formula (4), where
Target searching
Target searching is more of a positive motivation to motivate rehabilitation robot to arrive the target as quickly as possible. There are two cases that are shown in formula (5), where
By combining ban point avoidance and target searching, we describe the position reward function as formula (6).
Implement of azimuth reward function
In the actual trajectory planning task, distance and direction are both important factors should be considered comprehensively. However, the working environment of rehabilitation robot is intricate, which needs to consider the elements of both robot and the patient. As a consequence, the weights of the two items in azimuth reward function are always various in different scenarios. In this paper, we introduce a weight vector

Region division of ban point: The surrounding region of a ban point is divided into danger, warning, and safety. The darker the red area represents the higher the risk of collision.
To improve learning efficiency as well as ensuring safety,
where
Aspiration reward function
Structure of aspiration reward function
Locally optimal solution is a common problem which always perplexes DRL method. The reason is that most DRL methods only adopt utility reward functions. In this pattern, positive rewards will be given if the actions meet expectations. On the contrary, if the actions deviate from expectations, agent will get negative bonus. Although the agent can complete the exploration task with utility reward function in some conditions, but the trap of local optimal solution is usually unavoidable and the learning efficiency is often not satisfactory. 23 To solve the problems, aspiration reward function is proposed. The idea of aspiration reward function is to increase the agent’s desire to explore the unfamiliar environment. The agent’s familiarity with environment will affects its strategy adjustment. Compared with the traditional mode, the agent with aspiration reward function and utility reward function is more reasonable, since it has higher learning efficiency and is more consistent with the human leaning habit. In this paper, the aspiration reward for the agent is negatively related to its familiarity with current working environment. 24
The structure of aspiration reward function is shown in Figure 4. The core idea is to regard aspiration as the accuracy of the agent’s prediction of status feature changes. Aspiration reward function is composed of a feature extractor and an SRU-HM neural network. The former is used for extracting status features and the later is responsible for feature prediction. The difference between extracted feature and predicted the status feature is used to calculate aspiration reward. Considering that we are calculating the aspiration reward in time note

Structure of aspiration reward function.
Where
Recurrent neural network with hierarchical memory (SRU-HM)
In previous works, researchers usually use fully connected networks in feature prediction.25,26 Fully connected networks adopt stacked structure which is easy to implement, but in practice, the number of layers in the network is difficult to determine since shallow network cannot predict status feature accurately and the deep network is difficult to train and time-consuming. It is a challenging to make suitable status feature prediction by using a relatively simple network structure. To cope with this problem, this paper proposes a recurrent neural network with hierarchical memory (SRU-HM). Compared with traditional stacked structure, this build-in memory mechanism can retain long-term historical information. Inner and outer layers in hierarchical recurrent neural networks are connected so the inner and outer memory cell information can access each other. The structure of SRU-HM is shown in Figure 5.

The overall network structure of SRU-HM.
In the hidden layer of SRU-HM, the inner unit of SRU is embedded in the outer unit to build a layered network. The input information sent to inner unit from outer unit and then returns back after processed. The internal process of SRU-HM is shown in Figure 6. Where

Internal structure of SRU-HM: The part in the green box is loop nesting.
Implementation of reward functions in DRL
In this part, we explain how to implement the presented reward functions to the major DRL methods. In the previous work, it can be found that DRL methods with actor network and critic network (A–C frame) has much better performance than the one using actor network (A frame) or critic network (C frame) alone. Therefore, this paper largely discusses the implementation and comparisons of reward function on methods with A–C frame.
The learning process of DRL method with the proposed reward functions is shown as Figure 7, it is comprised of four stages including initialization, action selection, reward calculation, and network training. At the first stage initialization, actor network

Diagram of DRL in A–C frame with proposed reward functions.
Experiments and discussion
In this section, three sets of experiments are conducted to verify the performance of the proposed reward functions. Convergence rate, mean value, and standard deviation are selected as evaluation indicators. Convergence rate and mean value are chosen to test the learning efficiency, and standard deviation is for stability and robustness. In the experiments, the proposed reward functions are implemented to the state-of-the-art DRL methods Asynchronous Advantage Actor-Critic (A3C) 16 and Distributed Proximal Policy Optimization (DPPO). 17 Basic reward function is used for comparison. Basic reward function is a sparse reward function, it gives 0 in most cases, except for the robot reaching the target or ban point. In the first two experiments, azimuth reward function and aspiration reward function are put in use respectively, and the last set of experiments are conducted with the both reward functions. Simulation experiments are conducted inV-REP.27,28 In this paper, we simulated two working environments as shown in Figure 8, rehabilitation robot needs to reach the target point without touching any ban point to complete the trajectory planning task.

Simulation of rehabilitation robot trajectory planning task: Scene A has one ban point and Scene B with two. Scene A and B represent tasks of different complexity.
Every experiment will conduct for five times, the results are averaged in order to eliminate contingency. In the experiments the maximal reward for DRL method is set to 2000. If the accumulated reward value of the agent reaches 90% of the upper limit stably, trajectory planning is considered to be completed. The configuration used in the experiments is summarized in Table 1.
Configuration used in the experiments.
Azimuth reward function
In this section, we apply the azimuth reward function to DPPO and A3C, and experimental results are summarized in Table 2. It can be seen that A3C and DPPO with azimuth reward function both perform better in convergence and robustness compared to traditional reward function basic. For A3C, convergence speed of A3C is accelerated by 18.6%–19.9%, and the promotion for DPPO is 24.5%–35.5%. In the aspect of mean value, the two methods also have some advance by 5.2%–6.1%. The improvement in robustness is more significant, standard deviation of A3C and DPPO decrease by 32.5% on average. It can be seen that azimuth reward function not only speeds up the learning efficiency, but also increases the convergence stability of DRL method greatly. In the exploration, the role of azimuth reward function is to provide a global guidance and reasonable constraints for the agent, therefore, it can effectively reduce invalid exploration and improve efficiency. The reward curves of A3C and DPPO is visualized in Figure 9. In the process, the reward stays at 0 or even negative for some episodes at early stage of exploration. The reason for this is that rehabilitation robot may touch ban point during random exploration. By contrast, it shows that azimuth reward function can greatly shorten this stage to improve exploration efficiency. In the convergence phase, the curves of azimuth reward function are more stable as well.
Result of different reward functions R-F is short for Reward Function, A-A denotes Azimuth + Aspiration.

Convergence curves of Different Reward Functions: The curve in blue is Basic, yellow is Azimuth, green is Aspiration, and red is A-A.
Aspiration reward function
In DRL methods, most reward function are result-oriented, but aspiration reward is quite different, it mainly focuses on the process rather result. Different from general reward functions that give external evaluation, aspiration reward function is more like the personality of the agent, which shows more interest to unfamiliar things. This determines that the improvement brought by aspiration reward function is mainly in the convergence speed. Table 2 and Figure 8 shown the results, convergence speed is improved by up to 38.1%, and convergence in mean improves 5.5% as well. However, aspiration reward function has some negative influences on standard deviation, this phenomenon is consistent with the essence and original intention of aspiration reward function. It can be found in the Figure 8 that the benefit brought by aspiration is pretty obvious at the early stage of exploration. For curves where the reward is zero or negative, this part is significantly reduced in the method with aspiration reward function. In addition to accelerating the exploration, avoiding local optimal solution is also an advantage of aspiration reward function. Agent with basic reward function sometimes falls into a local optimal solution. Specifically, the reward does not increase within a certain period of episodes, but obviously the method hasn’t converged yet. The case in Figure 8-A is much more obvious. Aspiration reward function can avoid this problem in most cases.
Azimuth and aspiration reward function
In this section, azimuth and aspiration reward function (referred as A-A reward function for abbreviation hereinafter) work together. Results are plotted in Figure 8, as can be seen both A3C and DPPO with A-A reward function are superior to others in all cases. The convergence rate of A3C is increased by up to 37.2% compared to basic reward function, and this promotion is 39.4%–42.9% for DPPO in different Scene. Convergent mean value went up by an average of 171.3, this improvement is more distinct in the Scene with two ban points, this also shows that the proposed A-A reward function can effectively cope with complex scenarios. For standard deviation, the performance of robustness is also improved but slightly less than azimuth reward function due to aspiration. Considering convergence rate, convergent mean, and robustness comprehensively, the efficiency improvement brought by A-A reward function is significant. From the results of visualization, at the beginning of exploration, the curve of aspiration reward function is better than others, the desire to explore the unknown environment plays an important role at this stage. As the exploration progresses, agent becomes familiar with the working environment, by this time, azimuth reward function gradually showed its advantages, the curve of azimuth gradually overtakes aspiration. In addition, azimuth reward function also takes the duty of safety guarantee, ensuring the rehabilitation robot does not touch ban point. At this stage, aspiration reward function is mainly responsible for preventing the agent from falling into a local optimal solution. In convergence stage, aspiration reward function may cause little interference to standard deviation, but compared with the improvement in convergence performance it brings, it is completely negligible. Finally, taking a look at the performance of A3C and DPPO, DPPO performed better than A3C when using basic reward function in Scene B. The reason is the learning rate of A3C is fixed, and DPPO introduces a penalty mechanism for optimization, so it performs better in complex environments. However, when using the proposed A-A reward function, the performance of the two methods is equivalent in simple environment Scene A, DPPO performs slightly better in complex environment Scene B. This shows that the reward function proposed can also make up for some of the flaws in the optimization method.
Conclusion
To cope with the inefficiency and blindness of rehabilitation robot trajectory planning task with DRL methods, this paper puts forward two new dense reward functions including azimuth reward function and aspiration reward function. The former can provide rational constraints for the agent during exploration, while the latter is capable to accelerate exploration and avoid locally optimal solution. To improve the efficiency of aspiration reward function, a new feature prediction network SRU-HM is also proposed. Experimental results demonstrate that major methods with the proposed reward functions can improve the convergence rate and trajectory planning quality dramatically with respect to the accuracy and robustness. In future studies, we will try to conduct the multi-agent exploration experiment on the actual rehabilitation robots. The further research of SRU-HM is also a major work, in addition to reward calculation, we are going to make SRU-HM be a part of the brain of the agent and play a more important role.
Footnotes
Handling Editor: Chenhui Liang
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Shanghai Municipal Science and Technology Major Project (No.2021SHZDZX0103).
