Abstract
Reinforcement learning has been a promising approach in control and robotics since data-driven learning leads to non-necessity of engineering knowledge. However, it usually requires many interactions with environments to train a controller. This is a practical limitation in some real environments, for example, robots where interactions with environments are restricted and time inefficient. Thus, learning is generally conducted with a simulation environment, and after the learning, migration is performed to apply the learned policy to the real environment, but the differences between the simulation environment and the real environment, for example, friction coefficients at joints, changing loads, may cause undesired results on the migration. To solve this problem, most learning approaches concentrate on retraining, system or parameter identification, as well as adaptive policy training. In this article, we propose an approach where an adaptive policy is learned by extracting more information from the data. An environmental encoder, which indirectly reflects the parameters of an environment, is trained by explicitly incorporating model uncertainties into long-term planning and policy learning. This approach can identify the environment differences when migrating the learned policy to a real environment, thus increase the adaptability of the policy. Moreover, its applicability to autonomous learning in control tasks is also verified.
Introduction
Recently, reinforcement learning (RL) has shown its immense potential for processing complex and large-scale tasks. 1 –3 In particular, it becomes a useful approach to realize optimal control in robotics since data-driven learning leads to non-necessarity of engineering knowledge, 4 which is usually difficult to obtain. 5 –7 However, learning is prohibitively slow, that is, the required number of interactions with the environment is impractically large. Even in problems with low-dimensional state spaces or fairly benign dynamics, thousands of trials are usually required in learning. This data inefficiency makes it impractical to apply RL to real robotics and prohibits RL approaches in more challenging scenarios. Thus, learning is generally conducted with a simulation model, and after that, a migration process is required from the simulation environment to the real environment. However, the errors (commonly referred to as reality gap (RG)) between the simulation environment and the real environment make it challenging to apply the learned policy to the real environment. 8 In general, adding additional measurement sensors can increase the adaptability of a learned policy, 9 but this is both cost inefficient and time inefficient, and furthermore, the difference between the simulation environment and the real environment cannot be clearly understood. Therefore, it is crucial to train an adaptive policy that can be applied to an environment with high uncertainty.
Suppression of RG caused by model difference and/or uncertainty in policy migration has become a hot research topic. Motor primitives have been introduced to accelerate learning speed and reduce task complexity as well as the number of trials, 10,11 but retraining needs to be conducted after migration. Trials focusing on system identification have also been reported, 12 which provide a framework for solving the problem. System identification can help to generalize the knowledge of the system to unobserved states, thus reducing the number of trials for policy optimization. 13 –15 However, the learned policy still relies on the number of trials and the quality of the data.
On the other hand, uncertainties are treated as noise, which can be handled by a robust control policy. Lee et al. 16 used a Bayesian network to estimate the error among environments and developed a robust policy. However, since uncertainty could not be fully considered, this approach was effective only when the dynamic model was a good approximation of the real environment. Unfortunately, this condition usually cannot be satisfied in a complex dynamic system like an actual robot. 17 To solve the problem, Yu et al. 18 built an online system identification model that was able to consider all the uncertain factors, but the accuracy of Q-value would be decreased since the window of motion history was narrowed, and furthermore, it took model parameters as a training target.
In this article, we propose a recurrent RL approach, which is based on the deep deterministic policy gradient (DDPG) architecture. 19 It can achieve an adaptive policy by combining an environmental encoder (EE) with a universal policy. As recurrent neural network (RNN) can integrate the information across time frames, 20 the EE is built by RNN from motion history (time series of state–action pairs). The proposed approach is called as recurrent DDPG (RDDPG). The critic network and EE are trained to estimate Q-values in any possible situations of an uncertain environment. The latent variables between the EE and the critic network are defined as meta-parameters, which are used to identify the parameters in continuous state–action domains. Thus, this approach can give an accurate estimation of the Q-values and consequently achieve an adaptive policy.
This article is organized as follows: the related work is described in the second section, and the key ideas of the proposed approach, that is, learning framework, policy improvement, and unsupervised learning of the EE, are given in the third section. The fourth section describes the simulation experiment and discusses the effectiveness of the proposed approach.
Related work
Deep Q-network is a well-known deep RL method proposed by DeepMind. 1 It achieved massive success in higher-dimensional problems with discrete action spaces, such as the Atari game. However, in many tasks of interest, especially physical control tasks, the action space is continuous. To solve this problem, DDPG was proposed. It is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function and uses the Q-function to learn the policy. DDPG gained great success in observable problems, such as the cart-pole swing-up task and the reaching task. It can learn value functions in stable and robust way because the network is trained off-policy with samples from a replay buffer to minimize correlations between samples. The network is trained with a target Q-network to give consistent targets. 19
However, the task studied in this article can be classified as a partially observable Markov decision process (POMDP) due to the existence of environmental uncertainty. The agent cannot directly observe the parameters of the environment because they fall outside of the history window. RNN can solve this type of problem since it learns across time series data.
In general, POMDP is a sequential decision-making model where the underlying states of an environment are partially available or the observation received by the agent is an incomplete state. It can be described as a 6-tuple
The parameters of an environment can be expressed by the time series of state–action pairs. Therefore, it is possible to extract information from the time series by RNN to refine the observation. Thus, we introduced recurrence to RL to build meta-parameters so that the actual Q-values could be estimated from a value network. In applying RNN, we used long short-term memory (LSTM), which is designed to supervise time series learning for long-term dependencies, for solving the problem where errors propagate back in time. The problems of vanishing/exploding gradients in LSTM can be prevented by using constant error carousels (CECs). 21
Figure 1 shows the architecture of LSTM. It adds or deletes information with gates, which can selectively allow information to pass through a sigmoid layer

The architecture of LSTM. LSTM: long short-term memory.
Since the physical parameters of the environment are changed randomly, the Q-value changes in each episode of learning even though the policy is unchanged. This change causes the uncertainty of the value network. However, an accurate value network is the premise of an optimal policy. To solve this problem, we use meta-parameters, which are generated by LSTM to reflect the change of the Q-value. Taking meta-parameters as additional input in the learning, the value network can be specific in each episode.
RDDPG architecture
In RL, an agent receives a state st
and takes an action at
based on the state st
at time t and then the environment encounters a new state
where
The objective of RL is to maximize the Q-value shown in equation (2). The key problem here is the lack of parameter identification when the environment is changing. This makes the learned policy weak in adaptability. To address this question, we introduce recurrence to DDPG for parameters identification and call the improved approach as RDDPG. In this approach, the policy determines the action depending on time series data rather than current state.
DDPG is an actor-critic algorithm, 24 which can learn policies in continuous action spaces, the optimization procedure in RDDPG is to update the policy network and the value network alternatively. The process is described in Figure 2, where LSTM, as an EE, yields meta-parameters as an additional input of the value network and the policy network. The value network is trained to estimate the Q-value by minimizing the temporal difference error. The policy network, that is, a nonstationary, meta-conditioned, deterministic policy, 25 then yields a specific action by maximizing the Q-value. The EE is not parameterized by a certain task objective. Instead, it is optimized by a gradient back-propagation of a value network. Hence, the value network leads to a relatively accurate estimation of the Q-value, and the policy network takes an accurate action even though uncertainties exist.

The architecture of RDDPG. RDDPG: recurrent deep deterministic policy gradient.
Compared with a typical training scenario, in which a teacher and a student are deterministic single-task participants, an EE is a processor of time series shared across different environments. It provides meta-parameters to a single value network and a single policy network to deal with different environments. Explicitly, the EE, which is parameterized by wp
, takes the transition
The update rule is
in which Q is parameterized by
Recurrent update
DDPG updates the parameters of a network from samples of replay buffer R, a finite-sized cache consisting of a fixed number of transitions
Here, we tested two types of updates. One is “zero-state update,” which initializes the hidden state parameters of EE to zero at beginning points, and the other is “save-state update,” which saves the hidden state parameters of each step in the replay buffer. Therefore, we used “zero-state update” in this research since it could decrease the space complexity of RDDPG. It does not need to save the hidden states of every step.
Buffer R stores
where J is a constant number, and T is the maximum number of steps in an episode.
Since
On the other hand, to perform an update, we need to calculate the gradients of the EE at each step. LSTM usually calculates the gradients by minimizing the error between the predicted output and the target output.
27
But we do not use this process since EE does not take the parameters of the environment as targets in training. In this study, the gradients were calculated with equation (5), where the weight
The algorithm
Errors are inevitable due to the difference between the actual Q-value
There is no feedback in DDPG for distinguishing the models. The policy obtained with DDPG may be optimal to an “averaged” model among all the models, but it cannot provide an optimal policy to the tasks with different environments because the Q-value is not related to model changes. While, in RDDPG, owing to the existence of the EE, the learned policy is associated with each specific model because meta-parameters can reflect model parameters as feedbacks. The detail of RDDPG is shown in Algorithm 1. EE generates meta-parameters from the time series of state–action pairs. As a result, a policy can be obtained by incorporating
RDDPG algorithm.
Experiment
When we apply a policy that is learned on a simulation model to a real environment, the situation is the same as that we apply a policy learned on one environment to another environment. To confirm the effectiveness of RDDPG, three types of tasks were constructed, each one contained an environment with a few uncertainties. We conducted the experiments using a low-dimensional state description with joint angles and positions. The characteristic parameters (the weight and length of each link, the damping of each joint, etc.) were changed randomly within a certain band in each episode. Figure 3(a) is a two-degrees of freedom (2-DOF) cart-pole model. Figure 3(b) and (c) is 2-DOF manipulators with and without loads, respectively. The task shown in Figure 3(a) was to control the pole to keep a vertical position. In each training episode, the mass and length of the pole changed randomly. On the other hand, the tasks shown in Figure 3(b) and (c) were to control the manipulators to reach certain positions in their working spaces.

Model robots as environments for comparing RDDPG and DDPG. (a) Cart-pole. (b) Manipulator. (c) Load manipulator. (a) Cart-pole balance task with variable masses of the pole and the cart. (b) and (c) Positioning tasks of manipulators with and without load. The lengths and masses of links 1 and 2, the damping at joints 1 and 2, and the load are variables. RDDPG: recurrent deep deterministic policy gradient; DDPG: deep deterministic policy gradient.
As shown in Figure 3(a) and (b), the environments could be considered having uncertain bounded parameters, while the task shown in Figure 3(c) was a task with external load. The ranges of the parameters are given in Table 1. Figure 4 is a comparison of the performances of DDPG and RDDPG for different tasks. The total reward was defined as
Parameter range of environments.

Comparison of total rewards during the learning of RDDPG and DDPG for the three tasks. The parameters of environments are varied randomly at the beginning of each episode. (a) Cart-pole. (b) Manipulator. (c) Load manipulator. RDDPG: recurrent deep deterministic policy gradient; DDPG: deep deterministic policy gradient.
where
It can be seen from Figure 4 that RDDPG had better total reward than DDPG during learning. In the cart-pole task, the two algorithms gave almost the same results, as shown in Figure 4(a), but RDDPG was more stable. This result indicates that meta-parameters can efficiently reflect the parameters of an environment. To further confirm the effectiveness of RDDPG, we compared the control performance of the two algorithms for the model robot, as shown in Figure 3(c). Figure 5(a) and (b) is the action (driving torques) at each step, and Figure 5(c) gives positioning errors of the end of the manipulator at each step. The positioning errors and residual oscillations with the policy learned by RDDPG were relatively small in comparison with DDPG, and the oscillations of joint torques were also limited. The two algorithms showed different features in the adaption to an environment. Although optimum control and feedback control invoke different philosophies, 29 the results obtained in the experiment demonstrate that RDDPG could provide the robust performance as a feedback controller which could reduce steady-state errors, whereas DDPG behaved as an open-loop controller.

Action and position errors of the robot shown in Figure 3(c) with certain parameters. (a) Action of DDPG. (b) Action of RDDPG. (c) Positioning error. (a) and (b) Torques generated by the policies of DDPG and RDDPG, respectively. (c) Position errors of the end of the manipulator under DDPG and RDDPG. RDDPG: recurrent deep deterministic policy gradient; DDPG: deep deterministic policy gradient.
The physical parameters of the model robots, as shown in Figure 3, were arbitrarily chosen. The same learned policy could provide almost the same manipulation performance even though the parameters of the environment were changed within an “error band.” Figure 6 compares the performance of the learned policies for the cart-pole model and the 2-DOF manipulator model by RDDPG and DDPG. The former provided higher total reward than the latter. To further investigate the capability of RDDPG, the learned policy was applied to the model robot, as shown in Figure 3(c), with changing external load. The total rewards are shown in Figure 7. RDDPG gave a better performance than DDPG at any loads, demonstrating an excellent capability in dealing with an uncertain environment.

Performances of the learned policies in tasks with random environment parameters. (a) Parameters of the two-DOF manipulator task in each episode. (b) Parameters of the cart-pole task in each episode. (c) Total reward in the manipulator model. (d) Total reward in the cart-pole model. (a) and (b) Parameters of cart-pole task and manipulator task at each episode, respectively. The unit of each parameter is given in Table 1. (c) and (d) Total rewards obtained by applying the learned policies at each episode. DOF: degree of freedom.

Performance of the learned policies in the load-manipulator task. The mass of the load increased gradually with episodes, while other parameters were kept constant. At each test episode, the total rewards were obtained by applying the learned policies under current condition.
Conclusions
To make a learned policy on a simulation model to adapt to a real environment with limited uncertainties, an RL approach, which is called RDDPG, was proposed. It features the use of the EE and extensive training by changing environment parameters randomly in a limited range. Simulation experiments were conducted and the results demonstrated that the learned policy could adapt to a dynamic model with high uncertainties, and indeed, it dentified the parameters of the model in real time. Simulation experiments on three model robots showed that RDDPG could reduce positioning errors and residual oscillations of both positions and joint torques considerably as compared to traditional DDPG. It is different from DDPG in adapting to an uncertain environment. It relies on the EE that exploits time series data. The EE enables it to deal with the uncertainties of the characteristics of an environment. It is believed that the high adaptability of RDDPG comes from the precomputing of the possible models and the identification of parameters with an EE, which is used to identify the parameters. This process can avoid excessive reliance on the accuracy of the parameters.
In the forthcoming study, we will apply the proposed approach to an actual robot to confirm its effectiveness to real problems.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key Research and Development Plan (2016YFE0128700), the Natural Science Foundation of Hebei Province (E2017202270), the Key Research and Development Plan of Hebei Province (18211816D), and the National Key Research and Development Plan (2017YFB1301002).
