Sage Journals: Discover world-class research

Abstract

This work aims to present a new optimal control scheme based on the CACM-RL technique applied tounstable systems such as a Two-Wheeled Inverted Pendulum (TWIP). The main challenge in this work is to verify and validate the good behaviour of CACM-RL in this kind of system. Learning while maintaining the equilibrium is a complex task. It is easy in stable platforms because the system never reaches an unstable state, but in unstable systems it is very difficult. The study also investigates implementing CACM-RL to coexist with a classic control solution. The results show that the proposed method works perfectly in unstable systems, providing better results than a PID controller.

Keywords

Decentralized control High-Order Neural Networks Extended Kalman Filter Backstepping

1. Introduction

Path planning and optimal control algorithms for solving the problem of trajectory generation for vehicles and robots have been addressed in several research projects. These mobile platforms are nonlinear dynamic systems whose motion laws have been widely studied [1–2].

According to [2], the goal of motion planning can be stated as: “given an initial configuration and a desired final configuration of the robot, [to] find a path, starting at the initial configuration and terminating at the final configuration, while avoiding collisions with obstacles”. In our case, the trajectory definition is extended to other systems where there are no obstacles but there are special states, which the system should not reach. Furthermore, if a cost function (time, distance or energy) is minimized, the optimal motion problem is addressed. Besides solving the basic problem, intelligent vehicles must exhibit an optimal behaviour in real scenarios. This supposes an additional aim in the design of efficient algorithms for motion planning in autonomous vehicles with restricted computational resources.

Global optimal motion planning of vehicles and robots remains an open problem. Research is focused on different methods based on trajectory generation strategies. Research carried out in pursuit of optimal motion planning has used different solutions based on two different methods: open-loop and closed-loop. Open-loop or offline approaches [1] [3 –6] find a collision-free path from previous information about the environment (obstacle position, characteristics and geometrical shapes of the space where the vehicle can move, etc.). The goal of these methods is to follow the path obtained during the offline mode calculation. Obviously, this control depends on the dynamic and mechanical characteristics of the vehicle which is being controlled. Closed-loop or online solutions [7 –9] are more desirable since vehicles are usually subjected to perturbations and uncertainty within the environment.

There are other research techniques that focus on achieving optimal motion planning based on a closed-loop approach but without a mathematical model of the vehicle. These techniques are based on reinforcement learning [10], where the optimal motion is estimated through an interaction between the vehicle and its environment. However, the achieved solution depends heavily on the sample period. Other approaches based on cellmapping techniques offer an effective numerical means of obtaining global and local solutions to the nonlinear optimal control problem [11 –16]. Optimal control problems are very difficult to solve both analytically and numerically, but these approaches provide adequate solutions. Taking into account the advantages associated with reinforcement learning methods and cellmapping techniques, in [15] a new optimal motion planning algorithm called CACM-RL is proposed by the authors.

2. Two-wheeled inverted pendulum

The inverted pendulum problem is a classic problem in control engineering. It is frequently used as a practical exercise in teaching laboratories and has even developed into a practical system of transport, which was first marketed in 2001 and is known as Segway®.

The objective of a Two-Wheeled Inverted Pendulum system (TWIP) is to reach a balanced state. This state corresponds to a vertical position with the body at the top. In this work, we plan to balance and control the position of the body by means of a particular optimal control technique developed in our group.

The TWIP used in this work is built from LEGO® NXT components together with two motors, one gyroscope, and the control software. The gyroscopic sensor can detect rotation using a single-axis gyro based on a quartz resonator. The control software is based on the CACM-RL technique [15]. The TWIP model used in this work is represented in Figure 1.

Figure 1.

Two-Wheeled Inverted Pendulum built with LEGO® Mindstorms® NXT 2.0 kit.

The developed control software has been designed in compliance with all constraints and memory limits of the hardware. The platform integrates a time processor unit in order to perform timing control actions independently from the rest of the operations; a gyroscope and an encoder are the only sensorial systems for monitoring the value of the used state variables. The state variables that have been used in this work are: X={x₁, x₂, x₃, x₄}, defined in Table 1.

Table 1.

State variables

X₁=φ	Angular position
X₂=φ′	Angular velocity
X₃=x	Displacement on the rolling surface
X₄=x′	Linear velocity on the rolling surface

The angular velocity X₂ is obtained reading directly from the gyroscope, and the linear velocity X₄ from the position given by the encoder coupled to the motors. X₁ is obtained by means of an integration of the angular velocity. Finally X₃ is updated according to the information obtained from the encoders. The control actions used for acting on the system are the voltages applied to the motors.

3. Optimal control techniques

Optimal control techniques can be divided into twoclasses: (1) those based on a dynamic model of the system and its environment, and (2) those where the model of the system is not available. Cellmapping techniques applied to the control of unstable and nonlinear systems are taken into account in this study. These techniques need a mathematical model of the system in order to perform an optimal control law. However, the latter class describedabove require techniques that learn from received experience so that once sufficient knowledge of the system has been obtained, an optimal control law can be performed. These techniques are called reinforcement learning methods. They can be widely applicable if the methods are computationally feasible and can be robustly applied to unstable and non-linear systems. There are two different types of these learning methods: indirect and direct. An indirect method relies on a system identification procedure for creating an explicit model of the controlled system and thus being able to determine the control rule of that model. Direct methods can determine the control rule without knowing the model of the system.

The new proposed algorithm for solving the optimal control problem in a TWIP system is based on a particular case of a cellmapping technique applied to the control of unstable and nonlinear systems, combined with another technique based on the direct method of reinforcement learning.

Cellmapping techniques

Cellmapping techniques include, on the one hand, an efficient application of numerical methods in order to integrate non-linear systems (even unstable) and, on the other, the Bellman's Principle of Optimality to find the optimal control efficiently. The result is a new optimal control method based on a cell statespace able to design efficient optimal controllers of highly unstable and nonlinear systems. The design method of controllers based on cellmapping techniques [14] can be divided into two stages:

Obtain a family of cell-to-cell mappings that constitutes the necessary knowledge to calculate the optimal control laws associated with the states of the system.

Search through the optimal control laws, taking into account the Principle of Optimality (intelligence searching techniques). The process finishes when an Optimal Control Table (OCT) is found, which acts as a controller.

Cell-to-cell mapping methods are based on a discretization of the state variables of the system, defining a partition of the state space into cells. A cell-to-cell mapping can be derived from the dynamic evolution of the system. In [14], the Control Adjoining Cell Mapping (CACM) algorithm for optimal control of highly nonlinear systems is proposed. This method is based on the Adjoining Cell Mapping (ACM) technique, whose central concept is the creation of a cell mapping where only transitions between adjoining cells are allowed. Let us consider a system under the action of a P dimensional control vector, which can only take some finite values Nu belonging to a set U. It is assumed that the control is maintained at a constant level during a time interval t. When an action is applied to the system during a time interval, the system can go to a new state (image cell), remain at the same state or go to the drain (out of the state space). The knowledge of the system is given by the set of transitions to the image states. The maximum size of the set of transitions is Nc× Nu, where Ncis the total number of cells or states.

The adjoining property states that the distance (D-k) between any cell and its map (image) is equal to some predefined integer value k ≥ 1. The distance between two cells z and z′ is defined as

D - k (z, z') = \max_{j} | z_{j} - z'_{j} | = k

The integration time of each transition is determined adaptively to carry out the adjoining property. Such a property provides substantial improvements with respect to other optimal control techniques, since the CACM algorithm only computes the meaningful information required to obtain a good approximation to the optimal solution. Sometimes it may happen that the transition never maps into a D-k cell, for example when it is trapped in the origin cell or it goes to a drain cell. In these cases, the algorithm stops and it changes the control action in order to emerge from the cell or enter the state space. The CACM technique carries out a shortest-path search in an efficient manner, reducing both memory requirements and computation time.

When building the OCT, a cost function is defined to indicate the cost for a control action to map a cell to its image. The cost can be defined in terms of time, energy, distance or any other factor. Since a cost function is specific for a cell, it can be used as a local performance measure for controller evaluation. An overall performance measure for a controller can be generated from the local measures by simply averaging all the costs on all controllable cells.

Reinforcement Learning

Reinforcement learning methods only require a scalar reward (or punishment) to learn to map situations (states) in actions [10]. They only need to interact with the environment to learn from experience. The knowledge is saved in a look-up table that contains an estimation of the accumulated reward to reach the goal from each situation or state and applied policy. The objective is to find the actions (policies) that maximize the accumulated reward in each state. Q-learning is one of the most popular reinforcement learning methods, since with a simple formulation it can address model-free optimization problems [10][17].

The convergence of this algorithm towards the optimal policy p^* was proven by Watkins [18]. The accumulated reward for each pair state-action Q(s, a) is updated (backup) by the one-step equation:

Δ {Q(s}_{t} {,a}_{t}) = α (r_{t + 1} + γ Q_{\max} (s_{t + 1}, a) - Q (s_{t}, a_{t}))

where Q is the expected value of performing action a in state s, r is the reward, α is the learning rate that controls convergence and γ is the discount factor. The discount factor makes rewards earned earlier more valuable than those received later. If the reward function is proper [19] the discount factor can be omitted (γ = 1). The action a, with highest Q value at states, is the best policy up to instant t.

CACM-RL

The cell mapping techniques include, on the one hand, an efficient application of the numerical methods in order to integrate nonlinear systems and, on the other, the Bellman's Principle of Optimality to find the optimal control efficiently. Designing controllers based on the cell mapping techniques [12 –16] results in an efficient optimal control method for nonlinear systems. Cell-to-cell mapping methods are based on a discretization of the state variables of the system, defining a partition of the state space into cells. A cell-to-cell mapping can be derived from the dynamic evolution of the system. In [12 –14], solutions based on cell mapping techniques for the design of optimal controllers are proposed. In [14], the Control Adjoining Cell Mapping (CACM) method is implemented, which consists of the creation of a cell mapping where only transitions between adjoining cells are allowed.

It is necessary to define a control vector that can only take some finite values, Nu. It is assumed that the control is kept constant during a time interval, t. When an action is applied to the system during a time interval, the system may go to a new state, remain at the same state or go to the drain (out of the state space). The knowledge of the system is given by the set of transitions from different origin states. The maximum size of the set of transitions is N_c × N_u, where N_c is the total number of cells or states.

Reinforcement learning methods only require a scalar reward (or punishment) to learn to map situations (states) in actions [10]. They only need to interact with the environment to learn from experience. The knowledge is saved in a look-up table that contains an estimation of the accumulated reward to reach the goal from each state and applied policy. The objective is to find the actions (policies) that maximize the accumulated reward in each state.

The new algorithm proposed by the authors in [15], CACM-RL, combines the cellmapping techniques and the reinforcement learning approaches in order to conceive a single efficient optimal control algorithm.

CACM-RL deals with different data structures to store the partial results of the learning process. These structures are described in Table 2, below.

Table 2.

CACM-RL algorithm

1: Initialize Q-Table(s,a) and Model_Table 2: x ← current state 3: x′ = x 4: s←cell(x) 5: s′ = s 6: REPEAT 7: IF s′ ∈ drain or s′ ∈ goal or s′ ∈ safety_area 8: THENf_reactive(x′) 9: ELSE IFDk-adjoining(x,x′) 10: THEN Q-Table(s,a)←s′,r 11: Model_Table←IT(x,x′) 12: x = x′ 13: a← policy(s) 14: Execute action a on the TWIP 15: Observe the new state x′ and r after Ts 16: s′←cell(x′) 17: s←cell(x) 18: UNTIL the end of the learning stage REPEAT 19: FOReach (s,a) 20: x←state(s) 21: x′←DT(x) s′←cell(x′) 22: Q_Table(s,a)←s′,r 23: UNTIL EPISODES < N_EPI

Q_Table(s,a) is where the accumulated reward for each state-action pair, Q(s,a), is saved. From this table, the optimal policy a* is obtained. The table is updated according to the one-step equation [15].

During the learning phase, we must pay special attention to the reward values implemented in the CACM-RL algorithm. Since the optimal criterion was to minimize the time, the reward could have three different values in the following three cases: 1) a generic transition different from the goal is reached; 2) the goal is reached; 3) the TWIP goes out of the considered state space. In the first case, the reward is equal to the transition time, but with a negative sign. Usually, it will be an integer number of sample periods. In the second case, the reward is equal to the maximum value (positive). In the third case, the system is punished with very large negative reward. When learning, the maximum value of the reward stored in the goal is spread to all state space. In this way, those controllable states (from which the goal is reached) far away from the goal will have a positive low reward, and vice versa.

Model Table(s,a) is the model of the TWIP system. It contains the transitions from the origin state for each velocity(φ′, x′)(there are as many velocities as there are cells), for each tilt(φ) and for each action control. The transitions have to satisfy the adjoining property to ensure a good approximation of the optimal policy. For simplicity, the origin state is defined as: φ=0; x=0.

policy(s) selects a specific policy to estimate the Model_Table and to exploit the best policy acquired.

IT(x,x′) is the operator that transforms a generic transition x-x′ into a transitionat the origin. When all possible transitions from the origin have been performed, we can conclude that the learning stage has finished (Model_Table just created). In order to perform a real approximation of the transformation to the origin, φ′ and x′ are averaged out and filtered.

Generic transition is defined as occurring within the considered state space and whose starting state is different from the origin state.

DT(x) is the operator that transforms a transitionat the origin into a generic transition. In this way, the whole knowledge (generic transitions) of the state space can be generated.

f_reactive(x′) lets the TWIP continue learning, either coming into the state space or moving it to another position. The former occurs when the TWIP goes off the state space (drain state) and the latter when the TWIP reaches the goal or a safe position in relation to a given obstacle.

Dk-adjoining(x,x′) detects the transition distance between the current state x′ and the previous state, x. The results presented in this work have been achieved using a distance of 2 cells

cell(x) transforms a continuous state, x, into a discrete value, s (cell).

state(s) transforms a discrete value, s, into a continuous state, x.

4. New optimal control scheme

When talking about classic control, the controller usually acts on the real platform through a communication bus. The controller is able to know the state of the system thanks to the information provided by the sensors. The controller is able to command the actuators in order to modify the state of the system. With the new optimal control scheme operating simultaneously with the classic control system, the CACM-RL technique simply captures the control actions that are being applied by the classical controller and annotates the state and the evolution of the system. It learns autonomously the dynamics of the platform in order to achieve an autonomous optimal controller.

In this work the classic controller is a PID, which acts on the real platform in a nominal way. CACM-RL incorporates a ‘transitions register’ module, which stores in a database the new states reached and the control actions applied to the real platform in order to learn its dynamics. As the knowledge database is filled, the system capacity for planning optimal control actions is increased. At the bottom of Figure 2, we can see that learning grows until the system reaches the balanced state. The goal is to reach an optimal control policy from the achieved dynamics model, which is stored in the knowledge database. The key issue in this work is how to achieve the optimal policy in CACM-RL from a nominal operation of the PID controller. Through the controller, however, we cannot reach all desired transitions and therefore some gaps are generated in the Model_Table. For this reason, the unreached transitions are identified and the trajectories to reach these states a regenerated in order to learn a new optimal control policy.

Figure 2.

Real time optimal controller operation in learning process.

With this implementation, as soon as the system begins to retrieve knowledge from the dynamics of the system, a policy of control is developed. In this way, the suggestions of the control actions provided by CACM-RL will be available in real time in order to be applied. At the beginning, the control actions are similar to the ones provided by the PID controller, but with experience these actions become more efficient or optimal. In this context, it is possible to switch the controller to the new optimal control scheme in order to act as a main controller and continue learning from the experience. It is also possible to introduce an internal agent to decide the switching to the optimal controller or the return to the classic control.

The learning process is summarized in Figure 2 considering the inverted pendulum as an example and the balancing disturbance as the cost function to optimize.

5. Results

In this chapter, a characterization of the optimal control performed on the TWIP is analysed. In Figure 3, the balanced motion of the TWIP is represented. We can see the movements (forwards and backwards) of the robot for reaching the balanced state at the same point, x₀. The five steps shown in Figure 3 have been obtained by means of sampling and dumping of the state variables specified in Table 1. These steps are a full control cycle. A control cycle is the period of an oscillation of the inverted pendulum.

Figure 3.

Balanced motion of the TWIP.

The objective in this test is to demonstrate how a non-linear and unstable system can be controlled by the new proposed optimal control scheme. This experience can be extrapolated and applied to any non-linear and unstable system in order to be controlled in an optimal way, such as Unmanned Vehicles (UV) or spacecraft.

The simplicity of the hardware and the complexity of the mathematical model of the TWIP are the reasons this kind of system has been chosen for verifying the feasibility of the new optimal control scheme.

In Figure 4, the non-linear variables (φ, φ′) are traced during 8 seconds; the window is highlighted in order to show the control cycle, which is represented in more detail in Figure 5. In the top graphic of Figure 5, the control actions delivered by the PID control (blue line) and by the optimal control (red line) are shown. In the red line, two effects related to response time and energy can be identified: the same control actions are performed as before with the classic controller, which generates an energy reduction. The evolution of the state variables during a control cycle is shown in the bottom graphic of Figure 5.

Figure 4.

TWIP Balancing control cycle.

Figure 5.

Comparison between the PID control solution and the new optimal control scheme. The top graphic represents a control cycle while the bottom graphic shows the evolution of the state variables.

It is important to highlight the evolution of the non-linear variables (φ, φ′) in the balanced motion of the inverted pendulum: when the pendulum reaches both limits during the oscillation, φ′=0, as in the simple pendulum. Also, at the balanced point, φ=0 and φ′ reaches the maximum value.

6. Conclusions and future work

In this work, the new optimal control scheme based on the CACM-RL technique has been demonstrated to be a feasible complete solution to control non-linear and unstable systems such as the TWIP, without using any mathematical model of the robot. The new scheme is able to operate along with classic control techniques or be completely autonomous. When sharing control with classic techniques, the new scheme learns the dynamics of the system by watching the evolution of the robot from the control actions imposed by the classic solution. In the second case, it can learn by acting directly on the system. This way, it performs an optimal control in order to lead the system to a specific goal (equilibrium condition).

From the results specified in the previous chapter, we can conclude that the new optimal control scheme is a suitable and efficient solution for unstable systems. This is due to the fact that the performed control actions are optimal and the balanced state can be reached faster than with the classic method. Furthermore, CACM-RL has the advantage of enabling learning, not only near the equilibrium, but also in areas far from this equilibrium. It must be highlighted that while the PID controller is acting on TWIP, CACM-RL is learning. This is the reason why any external disturbance (within specific limits) can generate a new transition out of the equilibrium in the TWIP. In this way, it moves to a new state and CACM-RL increases its learning. In this case, we can say that the TWIP has learnt in a non-linear zone.

Figure 6 shows the CACM-RL response when the system is balanced. If we compare the amplitude of the falling angle φ, shown in Figure 4 (with the PID controller), we can see that now the oscillation has been reduced. In the same way, the angular velocity φ′ is slower than in Figure 4. The NxtPower variable, in the background, shows the action control that acts on the TWIP while balancing.

Figure 6.

Optimal control with CACM-RL in balanced situation.

In this experiment, the inverted pendulum has been demonstrated to be a simple platform with a very complex mathematical model. It has allowed us to verify the feasibility of the proposed technique based on CACM-RL. The behaviour of the TWIP has been sampled in real time and coexisting with a PID control technique. This has been done using two strategies: 1) Sequential sampling burst. This requires a high bit rate, high sampling frequency and a big buffer size to capture a long sequence of samples. However, the knowledge of the dynamics is quickly acquired. 2) Random sampling. This strategy is more suitable for resource-limited systems. However, the achieved results are the same as with the previous strategy, although they are obtained less quickly.

Footnotes

7. Acknowledgment

This work has been supported by the MICINN under the grant AYA2011-29727-C02-02and by LEGO-Electricbricks, who provided the prototype and the development environment.

References

Reeds

J. A.

and Shepp

R. A.

(1990) Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics. 145(2): 364–393.

Latombe

J.-C.

(1991) Robot Motion Planning. Kluwer Academic, Boston, MA.

Rankin

and Crane

C. D.

III (1996) A multi-purpose off-line path planner based on an A* search algorithm. In: Proceedings of the ASME Design Engineering Technical Conferences, Irvine, California. 1–10.

Qin

Soh

Y. C.

Xie

and Wang

(2000) Optimal trajectory generation for wheeled mobile robot. In: Proceedings of the 5th International Conference on Computer Integrated Manufacturing, Singapore. 1: 434–444.

Lamiraux

and Laumond

J. P.

(2001) Smooth Motion Planning for Car-Like Vehicles. IEEE Transactions on Robotics and Automation. 14(4): 498–502.

Fraichard

and Ahuactzin

J. M.

(2001) Smooth Path Planning for Cars. In: Proceedings of the IEEE International Conference on Robotics and Automation, Seoul. 4: 3722–3727.

Barraquand

and Latombe

J. C.

(1989) Onnonholonomic mobile robots and optimal maneuvering. Revue d'IntelligenceArtificielle. 3(2): 44–103.

Borenstein

Koren

(1991) The Vector Field Histogram – Fast obstacle avoidance for mobile robots. IEEE Journal of Robotics and Automation. 7(3): 278–288.

Konolige

(2000) A Gradient Method for Realtime Robot Control. In: Proceedings of Intelligent Robots and Systems. 1: 639–646.

10.

Sutton

R. S.

and Barto

(1998) Reinforcement Learning: An introduction. MIT Press, England.

11.

Hsu

C. S.

and Guttalu

R. S.

(1980) An unravelling algorithm for global analysis of dynamical systems: An application of cell-to-cell mapping. Journal of Applied Mechanics. 47(4): 940–948.

12.

Hsu

(1985) A discrete method of optimal control based upon the cell state space concept. Journal of Optimization Theory and Applications. 46(4): 547–569.

13.

Papa

Tai

H. M.

and Shenoi

(1997) Cell Mapping for Controller Design and Evaluation. IEEE Control Systems Magazine. 17(2): 52–65.

14.

Zufiria

P. J.

and Martínez-Marín

(2003) Improved Optimal Control Methods based upon the Adjoining Cell Mapping Technique. Journal of Optimization Theory and Applications. 118(3): 654–680.

15.

Gómez

González

R.V.

Martínez-Marín

Meziat

and Sánchez

(2011) Optimal Motion Planning by Reinforcement Learning in Autonomous Mobile Vehicles. Robotica. 30(2): 159–170.

16.

Zufiria

P. J.

and Guttalu

R. S.

(1993) The adjoining cell mapping and its recursive unravelling. Part I: Description of adaptive and recursive algorithms. Nonlinear Dynamics, Springer Netherlands. 4(4): 204–226.

17.

Watkins

C. J. C. H.

(1989) Learning from Delayed Rewards. Ph.D. Dissertation, Cambridge University, England.

18.

Watkins

C. J. C. H.

and Peter

(1992) Technical note: Q-learning. Mach. Learn. 8:249–292.

19.

Bertsekas

D. P.

and Tsitsiklis

(1996) Neuro-Dynamic Programming. USA:Athenea Scientific.

Optimal Control Based on CACM-RL in a Two-Wheeled Inverted Pendulum

Abstract

Keywords

1. Introduction

2. Two-wheeled inverted pendulum

3. Optimal control techniques

Cellmapping techniques

Reinforcement Learning

CACM-RL

4. New optimal control scheme

5. Results

6. Conclusions and future work

Footnotes

7. Acknowledgment

References