Sage Journals: Discover world-class research

Abstract

Rehabilitation devices such as actuated exoskeletons can provide mobility assistance for patients suffering from paralysis or muscle weakness. In order to improve the well-being of patients, the control design of exoskeletons is of paramount importance and highest priority. In this paper, we present a sliding reinforcement learning (RL) method control for an upper-limb exoskeleton, enabling it to learn following a desired trajectory in the Cartesian space. The deep deterministic policy gradient (DDPG) using an actor-critic architecture is employed to continuously adjust the non-singular terminal sliding mode control (NSTSMC) control inputs, based on previous experiences. The designed actor network learns the policy and the critic evaluates the quality of the actions chosen by the actor. The robustness of the proposed approach is studied when the system is subjected to random disturbances. The simulation results demonstrate that the proposed approach based on the RL method effectively fulfills exoskeleton tracking tasks. Moreover, a comparative analysis with the standard NSTSMC, computed torque (CT), and RL-based CT shows the superiority of the proposed approach in terms of position tracking error. These findings are further confirmed by various performance evaluation metrics.

Keywords

Passive rehabilitation upper-limb exoskeleton sliding mode control reinforcement learning deep deterministic policy gradient (DDPG)actor-critic

Introduction

Spinal cord injury¹ is one of the common causes of paralysis. It involves damage to the spinal cord which is responsible for receiving and sending signals between the brain and the rest of the body. As the injury cuts the connection between the brain and the part below the injury, it may lead to a tetraplegia which affects all four limbs of the body. For this reason, new rehabilitative techniques using robotic devices as exoskeletons have been proposed with the main aim of enhancing the quality of life of the affected patient. A comprehensive systematic review of the classification of upper limb robot devices and different control strategies has been presented in Narayan et al.² In this work, we are interested in the control of the upper-limb exoskeleton. In order to provide a passive arm movement therapy, the role of the control is to drive the system ensuring that the upper-limb exoskeleton reaches and maintains a desired position. For this purpose, various control approaches are proposed in the literature, we can classify them into two categories. The kinematic control strategies are non-model-based approaches and dynamic control strategies that consider the model of the system in the design of the control approach. The advantages of the kinematic control as proportional–derivative (PD),³ proportional–integral–derivative,⁴ and model free⁵ are the simplicity and require less computational time. However, classical PD control laws are effective for systems with constant parameters, but they may be inadequate and less robust for nonlinear systems with variable parameters and varying dynamic characteristics. However, they may lead to poor performance and instability. Dynamic control as computed torque (CT),⁶ sliding mode control (SMC),^7–10 model predictive control,^11,12 Feedforward-RISE,¹³ and adaptive approaches^14–16 are known for their robustness to variations in system parameters and external disturbances. However, they need more computation time, few studies consider the model of the upper limb and some control approaches may provide a high control input. Recent times have seen a diverse range of successes in reinforcement learning (RL) methods where video games are one of the most popular applications. RL is a branch of machine learning that was started in the 1950s by the British mathematician Turing¹⁷ discussing the possibility of computers being intelligent. It represents methods that facilitate adaptive autonomy by allowing agents to autonomously learn policies that optimize rewards through interactions with the environment.¹⁸ The idea was inspired by the sequential decision-making processes observed in living beings as humans and animals. Q-learning and its variant algorithms stand out as some of the most employed RL algorithms in the field of robotics and path planning. In Lin and Hwang,¹⁹ a Q-learning process was used to consider kinematic constraints and maintain the balance of the biped robot during imitation. During the learning process, the robot determines its actions according to a learned policy then a reward function provides feedback for the robot’s actions, offering a positive or negative reward based on its state (maintain of loss balance) after performing the action. Another work based on Q-learning was proposed in Khlif et al.²⁰ for a path planning of mobile robot combining $ε$ -greedy and Boltzmann distribution. In Liu and Huang,²¹ the authors proposed adaptive techniques and RL to control the robotic arm maintained on quadrator ensuring a minimal effect on the quadrotor dynamics, by incorporating the end-effector position into the reward function of the deep deterministic policy gradient (DDPG) controller. This latter generates the desired joint angles for the robotic arm. One of the model-based methods in RL is the probabilistic inference for learning control proposed in Deisenroth et al.²² It is designed to learn a probabilistic dynamics model and explicitly incorporate model uncertainty into long-term planning. It was successfully implemented in a real cart–pole system and unicycle robot.

Concerning exoskeleton robot control, there are few published works on this subject. In Yuan et al.,²³ a method called dynamic movement primitives based on RL is designed to provide motion generation in a walking exoskeleton robot and to transform the trajectory from the Cartesian space to the joint space. The work proposed a Q-learning approach based on a polynomial neural network employed for estimating the Q-function in order to improve the control performance. Rose et al.²⁴ use the DDPG where the control torque inputs are learned for the exoskeleton hip, knee, and ankle joints directly from the observed joint information.

In contrast to earlier classical control approaches in control for rehabilitation robots, our study makes a significant contribution by introducing an enhanced nonlinear sliding mode controller that leverages the benefits of SMC and RL within the Cartesian space of manipulation. Our choice is selected according to the interesting aspect of SMC called the non-singular terminal SMC (NSTSMC).²⁵ This method is designed to guide the system states to a predefined set of sliding surfaces, making the closed-loop response insensitive to uncertainties in internal parameters and external disturbances. The work in Jellali et al.²⁵ presents a SMC for (upper-limb exoskeleton of Laboratoire Images, Signaux et Systèmes Intelligents (LISSI)) introduced in Cartesian space and designed for passive functional rehabilitation. This control law not only ensures the robustness and precision of the system control but also enhances the system’s convergence velocity by guaranteeing convergence of the system dynamics to the equilibrium state in finite time and with a significant reduction in a chattering phenomenon. The approach also eliminates the singularity problem associated with conventional terminal SMC. Our proposed approach based on DDPG can be considered as an extension of the contribution proposed in Jellali et al.²⁵ Moreover, we demonstrated the advantages of using the DDPG agent using another control approach, namely the CT method. The results showed an improvement in the system response despite the system being subjected to higher disturbances than those considered during the training process. The robustness of the proposed method is studied where evaluation metrics have been computed for four controllers namely, RL-based-NSTSMC, RL-based-CT (RL-CT), NSTSMC, and CT. This paper is organized as follows, section “Control of three degrees of freedom (3 DOF) upper-limb exoskeleton in Cartesian space” describes the general dynamic model of the exoskeleton and the non-singular SMC. In section “Reinforcement learning (RL)”, an overview concerning RL and the DDPG algorithm is presented, followed by a detailed description of our proposed approach. In the next section, simulation results and discussion are presented. In the last section, conclusion and future works are presented.

Control of three degrees of freedom (3 DOF) upper-limb exoskeleton in Cartesian space

To control an exoskeleton in passive rehabilitation, the planning trajectories in Cartesian space is often more convenient as the tasks involve specific end-effector trajectories such as following a line or a circle. Moreover, this makes it easier for the patient to interact with the robot.

In this section, we will describe the general dynamic model of the upper-limb exoskeleton (see Figure 1) in Cartesian space, followed by a concise description of the non-singular terminal sliding mode controller proposed in Jellali et al.,²⁵ wherein the work of this paper can be considered as an enhancement of the approach.

Figure 1.

Three degrees of freedom (3 DOFs) upper-limb exoskeleton: (a) kinematic diagram and (b) top view.

Exoskeleton modeling

The dynamic model of the upper-limb exoskeleton can be described as follows:

M (q) \ddot{q} + C (q, \dot{q}) \dot{q} + G (q) + D (\dot{q}) + T_{d} = τ

(1)

where

q, \dot{q}, \ddot{q} \in R^{n}

represent the position, velocity, and acceleration, respectively.

M (q) \in R^{n \times n}

represents the symmetrical and positive definite inertia matrix,

C (q, \dot{q}) \in R^{n \times n}

indicates the matrix of centrifugal and Coriolis forces,

G (q) \in R^{n}

contains the effects of gravity,

D (\dot{q})

represents additional damping and

T_{d}

denotes external disturbances and non-modelled dynamics. The exoskeleton control inputs, denoted by

τ \in R^{n}

, include applied torques and/or forces. From equation (1), we can derive the following equation of acceleration:

\ddot{q} = M^{- 1} (q) (τ - C (q, \dot{q}) \dot{q} - G (q) - D (\dot{q}) - T_{d})

(2)

As the control law in the Cartesian space, the relationships allowing the transition from the joint space to the Cartesian space are represented as follows:

X = f (q)

(3)

where

X \in R^{m}

represents the Cartesian vector position of the end-effector of the exoskeleton and

f

defines the forward kinematic function.

This yields:

\begin{matrix} \dot{X} & = J (q) \dot{q} \\ \ddot{X} & = J (q) \ddot{q} + \dot{J} (q, \dot{q}) \dot{q} \end{matrix}

where

J = \partial f (q) / \partial q \in R^{m \times n}

is the Jacobian matrix and

\dot{J}

is its derivative.

Non-singular terminal sliding mode controller

The design of the sliding mode controller involves a two-step process. Initially, a control law denoted by $τ_{e q} \in R^{n}$ is defined to satisfy the sliding condition on the selected sliding surface. Subsequently, a control law denoted by $τ_{sw} \in R^{n}$ is devised to enhance the controller’s robustness, mitigating the impact of inaccuracies and disturbances on the system. The SMC law proposed in Jellali et al.²⁵ is expressed as follows:

τ = τ_{eq} + τ_{sw}

where the NSTSMC was chosen as follows:

s = e + α | \dot{e} | γ \cdot sgn (\dot{e})

where:

\begin{aligned} s (t) & : Sliding surface \\ \dot{s} (t) & : Derivative of the sliding surface \\ τ (t) & : Control input \\ α > 0 & : Positive constant \\ 1 < γ < 2 & : Bounded gain \\ e \in R^{m} & : Vector of the Cartesian position error \\ \dot{e} \in R^{m} & : Vector of the Cartesian velocity error \end{aligned}

The equivalent control law

τ_{e q}

is expressed as

τ_{e q} = M J^{- 1} ({\ddot{X}}_{d} + \frac{1}{α γ} | \dot{e} |^{2 - γ} sgn (\dot{e}) - J \dot{q}) + C \dot{q} + G + D

The expression of the switching control term

τ_{sw}

is defined as

τ_{sw} = M J^{- 1} K \times sat(s)

where

\times

denotes the element-by-element multiplication. And

K \in R^{n \times n}

represents a positive diagonal matrix.

Figure 2 shows the general schema of the NSTSMC in Cartesian space. Tuning NSTSMC to achieve the desired performance in Cartesian space, by balancing between fast convergence and avoiding the chattering, can be a delicate and difficult task. In order to overcome the drawback of robust gains that may lead to a chattering effect when they are significantly larger than the external disturbance, it was necessary to adaptively adjust the control gains with respect to the external disturbance. This may be automatically accomplished by adjusting the input torque. In the next section, we will propose a new approach based on RL algorithm.

Figure 2.

The bloc diagram of non-singular terminal sliding mode control (NSTSMC) in Cartesian space.

Reinforcement learning (RL)

In this section, we will provide a comprehensive overview of RL, followed by a detailed description of the DDPG algorithm that we intend to adopt in our proposed control approach.

Brief review

The problem of RL is supposed to be a simple formulation of the learning through interaction problem in order to achieve a goal. The learner and decision-maker are referred to as the agent. The environment is the entity with which it interacts, encompassing everything external to the agent.²⁶ The agent and environment interact constantly, where the agent chooses actions and the environment responds to these actions and presents new situations to the agent. The environment also gives rise to special numerical values namely rewards that the agent tries to maximize over time. The agent–environment interaction is present in Figure 3. The basic idea behind the RL, as explained in several references, is that at each time step $t$ , the agent receives a certain representation of the environment state $s_{t} \in S$ , where $S$ is the set of possible states. Based on this, the agent selects an action, $a_{t} \in A (s_{t})$ , where $A (s_{t})$ is the set of available actions in state $s_{t}$ . One time step later, in part as a consequence of its action, the agent receives a numerical reward, $r_{t + 1} \in R$ , and finds itself in a new state, $s_{t + 1}$ .

Figure 3.

Agent–environment interaction.

The {state, action, reward} principle can be formulated as a Markov decision process (MDP) problem. MDP is defined as a mathematical framework for modeling decision-making. A Markovian decision process is described by:

State $s$ and Action $a$ and corresponding State space $S$ and action space $A$ , respectively.

Reward function $R (s, a)$

The transition state probability $P$ for $s \in S$ , and $a \in A$ .

The discount factor $γ \in [0, 1)$ responsible on how the agent considers the future reward.

Policy ( $π$ ): define the methods or strategy employed by the agent to determine its next action based on its current state.

The value function ( $V$ ), denoted as $V^{π} (s)$ , is defined as the expected long-term rate of return on the current state $s$ according to the policy $π$ .

The Q-value ( $Q$ ), denoted as $Q^{π} (s, a)$ , is similar to the V-value but depending on the action $a$ . $Q^{π} (s, a)$ refers to the long-term return of the current state $s$ , performing action $a$ under the policy $π$ .

Algorithms of RL can be categorized into three classes namely (i) value-based as State-Action-Reward-State-Action and Q-learning agent, (ii) policy-based as policy gradient (PG) agents or (iii) actor-critic as proximal policy optimization, deep Q-network, DDPG. The first one learns optimal action-value function

Q * (s, a)

. The policy may be derived from

Q * (s, a)

. The second one searches directly for the optimal policy

π *

that maximizes the expected cumulative reward. The third one combines elements of both, by maintaining both a value function and a policy and allowing for a balance between exploration and exploitation. In the third class, the actor (policy) is updated to improve decision-making, and the critic (value function) provides feedback on the quality of the chosen actions.

Deep deterministic policy gradient (DDPG)

The DDPG is a model-free, online, off-policy RL algorithm. A model-free RL algorithm is when the agent learns to make decisions and take action without having an explicit model of the environment’dynamics describing how this latter will respond to its actions. The difference between off-policy and on-policy is that off-policy learning occurs when the agent learns from experiences generated by a different policy. In other words, the learning and target policies are different. On-policy learning, on the other hand, involves learning from experiences generated by the current policy that is being optimized. A DDPG agent is an actor-critic RL agent that searches for an optimal policy that maximizes the expected cumulative long-term reward. The summary of training steps is presented by the following main steps and summarized on the DDPG Algorithm 1.

Initialize the actor and critic networks with random weights.

Initialize a replay buffer to store experiences.

At every time step, the DDPG adjusts the properties of the actor and critic.

It records previous experiences using a circular experience buffer.

Select an action using the current policy (actor network), adding exploration noise to the action.

Execute the action in the environment, observe the reward and next state.

Store the experience tuple (state, action, reward, next state) in the replay buffer.

If the replay buffer has enough experiences, sample a mini-batch of experiences randomly from the buffer. For each experience in the mini-batch:

Update the critic network by minimizing the critic loss.

L = \frac{1}{2 M} \sum_{i = 1}^{M} {(y_{i} - Q (s_{i}, a_{i} ∣ ϕ))}^{2}

(4)

where

y_{i}

is the target Q-value for the

i

th experience in the mini-batch,

Q (s_{i}, a_{i} ∣ ϕ)

is the predicted Q-value from the critic network with the random parameter

ϕ

, and

M

denotes the mini-batch size.

Update the actor network using the PG to maximize the expected discounted reward.

Update the target actor and critic network

Repeat the process for a set number of episodes or until convergence.

Algorithm 1.

The pseudocode of DDPG algorithm

1:	Initialize actor and critic networks randomly (actor-critic)
2:	for $i = 1$ to Maxepisodes do
3:	$s \leftarrow R e s e t ()$
4:	for $j = 1$ to MaxStep do
5:	action ← Policy(s) ⊳ Actor Network obtains an action
6:	action ← action + N ⊳ Add exploration noise
7:	s', r, done ← step(action) ⊳ ReplayBuffer stores (s, a, s', r)
8:	If ReplayBuffer > BatchSize then
9:	Batch ← RandomSample(ReplayBuffer, Batchsize)
10:	Q ← Update(Critic Network, Batch) ⊳ Update Q-value function
11:	Update(Actor Network, Batch, Q)
12:	end if
13:	if done then
14:	break
15:	end if
16:	end for
17:	end for
18:	pause

The proposed approach

The problem of rehabilitation control is transformed into an optimal control problem, aiming to design an iterative controller to ensure that the exoskeleton’s motion trajectory is capable of following the desired training movement. Integration of RL with SMC might be beneficial in control systems where RL is used to adapt and learn the optimal control policy while SMC control provides robustness and external disturbance rejection. In SMC, accurate mathematical models of the upper-limb exoskeleton are often required. RL algorithms, in contrast, can operate with less precise dynamic models and in cases where the system dynamics are not fully known. This aligns with our control case, as the model of the exoskeleton may be known a priori, but the model of the upper limb may not be. The task of RL in a continuous action space MDP can be defined as a tuple of states, actions, transition probabilities, policy, rewards, and discount factor ${S, A, T, P, R, γ}$ . The DDPG approach is used for dynamic and reinforcement-based adaptation of SMC torques, as illustrated in Figure 4.

Figure 4.

Bloc diagram of the proposed approach.

State space representation

In this case, the DDPG agent receives a state observation $s = {X, e}$ , where $X = [x, y]^{T}$ is the vector of operational coordinates measured from the simulated model of the exoskeleton and $e = [X_{d} - X]^{T}$ , where $X_{d} = [x_{d}, y_{d}]^{T}$ is the vector of desired operational coordinates defined by user and given by:

X_{d} = (\begin{matrix} x_{d} \\ y_{d} \end{matrix}) = (\begin{matrix} r \sin (w t) + x_{0} \\ r \sin (w t + (π / 2)) + y_{0} \end{matrix})

(5)

where

w = 0.5 rad/s

is the frequency,

r = 0.5 m

is the radius, and

(x_{0}, y_{0}) = (0.35 m, 0.35 m)

is the center of the desired circular trajectory.

Reward

The reward function, R, plays an important role in RL as it provides evaluative feedback for the system to learn the optimal policy. A well-designed reward function can not only expedite the learning processes but also enhance the quality of learning. Therefore, it is highly important to design an appropriate reward function. In this approach, a method for immediate reward for the studied robotic system is proposed. Based on the characteristics of the tracking control for the exoskeleton, the immediate reward at time step $t$ is defined as follows:

R (t) = - \sqrt{(X_{d} (t) - X (t))^{2} + (Y_{d} (t) - Y (t))^{2}}

(6)

where

\begin{aligned} R (t) : The reward value . \\ X_{d} (t) : Desired position along the X - axis at time t . \\ Y_{d} (t) : Desired position along the Y - axis at time t . \\ X (t) : Current position along the X - axis at time t . \\ Y (t) : Current position along the Y - axis at time t . \\ \sqrt{(X_{d} (t) - X (t))^{2} + (Y_{d} (t) - Y (t))^{2}} : The squared Euclidean \\ distance between the desired and current positions . \end{aligned}

The negative square root of the distance indicates that the reward is inversely proportional to the distance. Meaning that, the closer the current position is to the desired position, the higher the reward.

Action space representation

In this application, we opt for RL to autonomously adapt the amplitude of the SMC input, represented in section, denoted by $τ$ . For the adjustment of gains, the torque provided for the actuated exoskeleton must fall within a safety range. Therefore, the actions denoted by $a$ are constrained to be between $[- 1, 1]$ . The real torque applied to the environment is equal to:

τ_{real} = a τ

(7)

where

τ_{real} = (\begin{matrix} τ_{shoulder} \\ τ_{elbow} \end{matrix}) \in A

(8)

It represents the torque of possible motor control at each joint. The transition model

T

from one state to another is determined by the dynamics of the exoskeleton system.

DDPG architecture

The DDPG model is implemented using the RL toolbox of Matlab Math work. Training and agent options and hyperparameters are adjusted as indicated in Tables 1 and 2. The critic and actor networks are depicted in Figure 5. The architecture of the critic network involves fully connected layers, rectified linear unit (ReLU) layers, combining the information from the state and action paths, ultimately leading to a single output representing the Q-value of the critic. The input layer expects input data of size $[4, 1]$ for the observation. A fully connected layer with 148 neurons, connecting each input neuron to each neuron in the next layer. Then a ReLU activation function applied element-wise is proposed. It is followed by another fully connected layer with 120 neurons. The input layer, for the action, is expecting input data of size $[2, 1]$ . Then a fully connected layer with 120 neurons is proposed. This latter is connected to the state path where a ReLU Layer is proposed. Finally, a fully connected output layer with a single neuron, represents the output of the critic.

Figure 5.

The critic and actor networks.

Table 1.

Parameter settings of both training and agent.

Training options	Values
Max episodes	600
Max steps per episode	250
Score averaging window length	10
Stop training criteria	“Average reward”
Save agent criteria	“Episode reward”
Agent options	Values
Noise options:
Initial action	0
Mean attraction constant	0.15
Mean	0
Standard deviation decay rate	$0.0001$
Standard deviation	$0.7$
Standard deviation min	$0$
Target smooth factor	$0.001$
Target update frequency	$1$
Mini-batch size	$128$
Experience buffer length	$1.5 \times 10^{5}$
Experience buffer max length	$1 \times 10^{6}$
Sample time	$0.1$
Discount factor	$0.99$

Table 2.

Parameter settings of critic and actor modules.

Critic options	Values
Learning rate	$0.0125$
Gradient threshold	$i n f$
Gradient threshold method	“l2norm”
L2 regularization factor	$0.0001$
Actor optimizer critic	“adam”
Optimizer parameters epsilon critic	$1 \times 10^{- 8}$
Optimizer parameters momentum critic	“Not applicable”
Optimizer parameters gradient decay factor critic	0.9
Optimizer parameters squared gradient decay factor	0.999
Use device	“gpu”
Actor options	Values
Learning rate	$0.02$
Gradient threshold	$I n f$
Gradient threshold method	“l2norm”
L2 regularization factor	$0.0001$
Actor optimizer	“adam”
Optimizer parameters epsilon	$1 \times 10^{- 8}$
Optimizer parameters momentum	“Not applicable”
Optimizer parameters gradient decay factor	0.9
Optimizer parameters squared gradient decay factor	0.999
Use device	“gpu”

The architecture of the actor network takes observations as input and produces continuous actions as output. It consists of fully connected layers with ReLU activation functions and a final layer with a hyperbolic tangent (tanh) activation to map the output to a bounded range. The created actor is a deterministic actor representation using a specified network. Firstly, the input is represented by an input layer for the observation of size $[4, 1]$ . Then a fully connected layer with 148 neurons, followed by the first ReLU layer, another fully connected layer with 120 neurons, the second ReLU layer, and a fully connected layer with two neurons. The output is derived by a hyperbolic tangent layer.

Remark 1:

In the context of RL, the term “environment” refers to the components outside the agent, encompassing everything with which the agent interacts. This environment can be either a real-world system or a simulated one. In our study, the environment consists of the NSTSMC and the dynamic system. The role of the agent is to adapt the control inputs based on both the observations received from the environment and the computed rewards. These observations include the actual and desired position. The reward signal guides the agent toward optimal behavior by providing feedback on its performance. The specific task that the exoskeleton is designed to perform is trajectory tracking in Cartesian space. This involves following a predefined trajectory with precision, ensuring that the end-effector of the exoskeleton moves along the desired trajectory accurately. This task is important for applications such as robotic-assisted rehabilitation, where precise movements are necessary for effective therapy.

Results and discussion

In this section, training and test results are provided. The training duration for $600$ episodes was equal to $2$ h, $13$ min, and $6$ s.

The description of the scenario: before the training process, a uniform random noise is added for $T_{d}$ (see equation (1)). The lower limit of the random noise is equal to $- 1.5$ and the upper limit is equal to $1.5$ . The selected number of is by default equal to 45,678. We start the discussion with training results, then simulation results where a comparative study and robustness analysis are performed.

Training results

The evolution of the reward and average reward are depicted in Figure 6. The average and episode reward vary significantly during the initial episodes of training as the agent is learning from experiences. This variation is a known step of the RL process, where the agent initially explores the environment to gather information before gradually shifting toward exploiting the knowledge it has gained to maximize rewards. During these early episodes, the agent undergoes both exploration and exploitation phases, which are critical for effective learning. Exploration allows the agent to try out different actions to discover their effects and gather diverse experiences, leading to fluctuations in the rewards obtained. On the other hand, exploitation involves using the acquired knowledge to select actions that are expected to yield higher rewards, contributing to more stable and increased reward patterns over time. We can notice the episodes where there is a clear distinction between exploration and exploitation by analyzing the reward trends and trajectories. In episodes dominated by exploration, the trajectories may appear more unstable as the agent tests various actions, while in exploitation episodes, the trajectories tend to be more focused and directed toward the goal.

Figure 6.

The evolution of episode and average reward.

This behavior is verified by plotting the trajectories of some episodes, as depicted in the trajectories paths in Cartesian space ( $X Y$ plane) shown in Figure 7. These trajectory plots provide a visual representation of the agent’s movements and decisions in the environment, illustrating how the agent’s strategy evolves from random exploration to more deliberate and goal-oriented actions. By examining these trajectories, we can gain insights into the learning process and the balance between exploration and exploitation during different stages of training.

Figure 7.

Trajectory per episode.

A clearer view of the agent’s performance and learning progression is presented in Figure 8, which illustrates the evolution of the Cartesian tracking error per episode. This figure highlights the changes in tracking accuracy over time, showing how the agent’s performance improves as it learns. We can notice that the best results are obtained in the last episodes of training, where the tracking error converges rapidly to zero, indicating that the agent has effectively learned to follow the desired trajectory with minimal deviation. In contrast, during the exploration process, illustrated by episodes 1, 100, 184, and others, the tracking error is significantly higher. This poor tracking is a reflection of the agent’s exploratory actions, where it prioritizes learning over immediate performance. As training progresses and the agent transitions from exploration to exploitation, the tracking error decreases, demonstrating the agent’s improved capability to perform the task accurately. This progression demonstrates the importance of the exploration phase in the learning process, allowing the agent to acquire the necessary knowledge to achieve optimal performance in later stages.

Figure 8.

Evolution of tracking error per episode.

To sum up, some episodes may yield lower rewards, while others display an upward trend in episode rewards. Based on the reward structure in our approach, higher rewards signify that the actual trajectory is closer to the desired one. When this occurs, the agent is in an exploitation phase; otherwise, it is engaged in exploration.

Simulation results and comparative study

In the test process, a comparative study is conducted between four control approaches. The first controller is the CT approach. The control design of CT is presented in the Appendix. The second controller proposed by Jellali et al.²⁵ is the NSTSMC presented in section “Non-singular terminal sliding mode controller.” In the third and fourth schemes, we used the RL approach to adapt both control input signals for the CT and NSTSMC approaches. The fixed control gains for both approaches are summarized in Table 3. The simulation time is equal to 120 s for all controllers.

Table 3.

Control gain setting.

CT	NSTSMC
$K_{p} = diag [300, 100, 0]$	$α = 1$
$K_{d} = diag [300, 100, 0]$	$γ = 1.5$
	$ϕ = 0.05$
	$K = [30, 10, 0]^{T}$

CT: computed torque; NSTSMC: non-singular terminal sliding mode control.

To validate the effectiveness of control approaches, a higher random noise within the interval [ $- 5, 5$ ] is included for all controllers. Its evolution is depicted in Figure 9. In this scenario, we demonstrate how the DDPG agent allows the upper-limb exoskeleton to adapt to changing environments and conditions and perform well, even in the presence of random disturbances.

Figure 9.

Random noise included after training for all controllers.

In this section, a comparative study is presented under the same parameters and initial conditions. All controllers are compared for the case of disturbance rejection. We selected desired trajectories different from that used in training where $X_{d}$ and $Y_{d}$ are equal to:

X_{d} = 0.24 \sin (0.5 t) + 0.26

(9)

Y_{d} = 0.05 \sin (0.5 t + \frac{π}{2}) + 0.27

(10)

The results depicted in Figure 10 demonstrated the advantages of using the DDPG agent compared to other control approaches, namely the CT method. The results showed an improvement in the system response, despite the system being subjected to higher disturbances than those considered during the training process. Additionally, the fastest trajectory to the desired path is achieved with RL-based methods. This was also observed by visualizing the position errors in Figure 11. We notice that the fastest tracking and disturbance compensation is demonstrated by the RL-NSTSMC method, followed by the RL-CT method. Four performance evaluation metrics are computed and summarized in Table 4. The lowest values are obtained for the sliding mode method. An improvement is observed compared to standard control methods. For example,

29 %

improvement is noted in the integral time-weighted square error (ITSE) index when calculated using the RL-NSTSMC method compared to the NSTSMC and

22.4 %

is obtained when calculated using the RL-CT method compared to the CT approach.

Figure 10.

Comparative study between non-reinforcement learning (RL)-based methods and RL-based methods.

Figure 11.

Cartesian coordinates and position errors of non-reinforcement learning (RL)-based methods and RL-based methods.

Table 4.

Summary of computed metrics.

Index	CT	RL-CT	NSTSMC	RL-NSTSMC
$ISE = \int_{0}^{T} (e_{x}^{2} + e_{y}^{2}) d t$	8.61	7.32	7.32	6.51
Improvement		14.98%		11.09 $%$
$RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (e_{x i}^{2} + e_{y i}^{2})}$	0.268	0.2471	0.247	0.233
Improvement		7.8 $%$		5.67 $%$
$ITSE = \int_{0}^{T} t \cdot (e_{x}^{2} + e_{y}^{2}) d t$	97.099	75.3355	75.375	52.7843
Improvement		22.41 $%$		29.97 $%$
$IAE = \int_{0}^{T} \| e_{x} (t) + e_{y} (t) \| d t$	26.6695	23.4984	22.3742	19.3077
Improvement		11.89 $%$		13.71 $%$

CT: computed torque; RL: reinforcement learning; NSTSMC: non-singular terminal sliding mode control; ISE: integral squared error; RMSE: root mean square error; ITSE: integral time-weighted square error; IAE: integral absolute error.

Conclusion

In this paper, we propose an RL-based sliding mode approach to control a 3DOF exoskeleton. By interacting with an environment, the proposed DDPG agent provides an action in the exoskeleton to adjust the gain of the non-singular SMC by automatically adjusting the control input signal. Simulation results of the proposed approach have shown promising results. The proposed approach RL allows the upper-limb exoskeleton to adapt to changing environments and conditions and perform well, even in the presence of high random disturbances. An enhancement has been achieved compared to the original NSTSMC approach. Based on several performance evaluation metrics, we also demonstrated that this approach is effective even when replacing the controller with another one, such as the CT method. Future work may be extended to active rehabilitation by performing a collaborative task between the patient and the exoskeleton using a multi-agent approach.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Afef Hfaiedh

Appendix

References

Alizadeh

Dyck

Karimi-Abdolrezaee

. Traumatic spinal cord injury an overview of pathophysiology, models and acute injury mechanisms. Front Neurol 2019; 10: 282.

Narayan

Kalita

Dwivedy

. Development of robot-based upper limb devices for rehabilitation purposes: a systematic review. Int J Control Autom Syst 2021; 6: 4.

Kawasaki

Ito

Ishigure

, et al. Development of a hand motion assist robot for rehabilitation therapy by patient self-motion control. In: 2007 IEEE 10th international conference on rehabilitation robotics, Noordwijk, Netherlands, 13–15 June 2007, pp.234–240. IEEE. DOI: https://doi.org/10.1109/ICORR.2007.4428432.

Rosen

. A novel linear PID controller for an upper limb exoskeleton. In: 49th IEEE conference on decision and control (CDC), Atlanta, GA, USA, 15–17 December 2010, pp.3548–3553. IEEE. DOI: https://doi.org/10.1109/CDC.2010.5716985.

Bembli

Khraief-Haddad

Belghith

. A robust model free terminal sliding mode with gravity compensation control of a 2 DOF exoskeleton-upper limb system. J Control Autom Electr Syst 2021; 32: 632–641.

Brackbill

Mao

Agrawal

, et al. Dynamics and control of a 4-DOF wearable cable-driven upper arm exoskeleton. In: 2009 IEEE international conference on robotics and automation, Kobe, Japan, 12–17 May 2009, pp.2300–2305. IEEE. DOI: https://doi.org/10.1109/ROBOT.2009.5152545.

Rahman

Archambault

Saad

, et al. Robot aided passive rehabilitation using nonlinear control techniques. In: 2013 9th Asian control conference (ASCC), Istanbul, Turkey, 23–26 June 2013, pp.1–6. IEEE. DOI: https://doi.org/10.1109/ASCC.2013.6606297.

Fellag

Benyahia

Drias

, et al. Sliding mode control of a 5 DOFS upper limb exoskeleton robot. In: 2017 5th international conference on electrical engineering—Boumerdes (ICEE-B), Boumerdes, Algeria, 29–31 October 2017, pp.1–6. IEEE. DOI: https://doi.org/10.1109/ICEE-B.2017.8192098.

Bembli

Khraief-Haddad

Belghith

. An exoskeleton–upper limb system control using a robust model free terminal sliding mode. In: 2020 4th international conference on advanced systems and emergent technologies, Hammamet, Tunisia, 15–18 December 2020, pp.371–376. IEEE. DOI: https://doi.org/10.1109/IC_ASET49463.2020.9318285.

10.

Luengas

López-Gutiérrez

Salazar

, et al. Robust controls for upper limb exoskeleton, real-time results. Proc IMechE Part I: J System Control Engineering 2018; 232: 797–806.

11.

Hmida

Hafsi

Bouani

. An embedded Hildreth-based model predictive control of an elbow joint orthosis robot. Trans Inst Meas Control 2023; 45: 911–920.

12.

Raza

Ahmed

Ali

, et al. Model predictive control for upper limb rehabilitation robotic system under noisy condition. In: 2018 IEEE 5th international conference on smart instrumentation, measurement and application (ICSIMA), Songkhla, Thailand, 28–30 November 2018, pp.1–4. IEEE. DOI: https://doi.org/10.1109/ICSIMA.2018.8688747.

13.

Hfaiedh

Bembli

Haddad

, et al. Control of patient-upper-limb-exoskeleton: Stability and robustness analysis. In: IEEE 3rd international conference on signal, control and communication (SCC), Hammamet, Tunisia, 01–03 December 2023, pp.1–6. IEEE. DOI: https://doi.org/10.1109/SCC59637.2023.10527673.

14.

Kang

Wang

. Adaptive robust control of 5 DOF upper-limb exoskeleton robot. Int J Control Autom Syst 2015; 13: 733–774.

15.

Riani

Madani

Hadri

, et al. Adaptive control based on an on-line parameter estimation of an upper limb exoskeleton. In: 2017 international conference on rehabilitation robotics (ICORR), London, UK, 17–20 July 2017, pp.695–701. IEEE. DOI: https://doi.org/10.1109/ICORR.2017.8009329.

16.

Riani

Madani

Hadri

, et al. Adaptive integral terminal sliding mode control of an upper limb exoskeleton. In: 2017 18th international conference on advanced robotics (ICAR), Hong Kong, China, 10–12 July 2017, pp.131–136. IEEE. DOI: https://doi.org/10.1109/ICAR.2017.8023507.

17.

Turing

. Computing machinery and intelligence. Mind 1950; LIX: 433–460.

18.

Vamvoudakis

Wan

Lewis

et al. Handbook of reinforcement learning and control. Cham: Springer, 2021.

19.

Lin

Hwang

. Balancing and reconstruction of segmented postures for humanoid robots in imitation of motion. IEEE Access 2017; 5: 17534–17542.

20.

Khlif

Nahla

Safya

. Reinforcement learning with modified exploration strategy for mobile robot path planning. Robotica 2023; 41: 2688–2702.

21.

Liu

Huang

. DDPG-based adaptive robust tracking control for aerial manipulators with decoupling approach. IEEE Trans Cybern 2022; 52: 8258–8271.

22.

Deisenroth

Fox

Rasmussen

. Gaussian processes for data-efficient learning in robotics and control. IEEE Trans Pattern Anal Mach Intell 2015; 37: 408–423.

23.

Yuan

Liand

Zhao

et al. DMP-based motion generation for a walking exoskeleton robot using reinforcement learning. IEEE Trans Ind Electron 2020; 67: 3830–3839.

24.

Rose

Bazzocchi

MCF

Nejat

. End-to-end deep reinforcement learning for exoskeleton control. In: 2020 IEEE international conference on systems, man, and cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020, pp.4294–4301. IEEE. DOI: https://doi.org/10.1109/SMC42975.2020.9283306.

25.

Jellali

Madani

Khraief

, et al. Non-singular terminal sliding mode controller in cartesian space: application to an upper limb exoskeleton. In: IEEE 21st international conference on advanced robotics (ICAR 2023), Abu Dhabi, United Arab Emirates, 05–08 December 2023, pp.558–563. IEEE. DOI: https://doi.org/10.1109/ICAR58858.2023.10406537.

26.

Sutton

Barto

. Reinforcement learning: an introduction. 2nd ed. Massachusetts, London, UK: The MIT Press Cambridge, 2018.

A reinforcement learning based sliding mode control for passive upper-limb exoskeleton

Abstract

Keywords

Introduction

Control of three degrees of freedom (3 DOF) upper-limb exoskeleton in Cartesian space

Exoskeleton modeling

Non-singular terminal sliding mode controller

Reinforcement learning (RL)

Brief review

Deep deterministic policy gradient (DDPG)

The proposed approach

State space representation

Reward

Action space representation

DDPG architecture

Results and discussion

Training results

Simulation results and comparative study

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Appendix

References