Abstract
This work aims to present a new optimal control scheme based on the CACM-RL technique applied tounstable systems such as a Two-Wheeled Inverted Pendulum (TWIP). The main challenge in this work is to verify and validate the good behaviour of CACM-RL in this kind of system. Learning while maintaining the equilibrium is a complex task. It is easy in stable platforms because the system never reaches an unstable state, but in unstable systems it is very difficult. The study also investigates implementing CACM-RL to coexist with a classic control solution. The results show that the proposed method works perfectly in unstable systems, providing better results than a PID controller.
1. Introduction
Path planning and optimal control algorithms for solving the problem of trajectory generation for vehicles and robots have been addressed in several research projects. These mobile platforms are nonlinear dynamic systems whose motion laws have been widely studied [1–2].
According to [2], the goal of motion planning can be stated as: “given an initial configuration and a desired final configuration of the robot, [to] find a path, starting at the initial configuration and terminating at the final configuration, while avoiding collisions with obstacles”. In our case, the trajectory definition is extended to other systems where there are no obstacles but there are special states, which the system should not reach. Furthermore, if a cost function (time, distance or energy) is minimized, the optimal motion problem is addressed. Besides solving the basic problem, intelligent vehicles must exhibit an optimal behaviour in real scenarios. This supposes an additional aim in the design of efficient algorithms for motion planning in autonomous vehicles with restricted computational resources.
Global optimal motion planning of vehicles and robots remains an open problem. Research is focused on different methods based on trajectory generation strategies. Research carried out in pursuit of optimal motion planning has used different solutions based on two different methods:
There are other research techniques that focus on achieving optimal motion planning based on a closed-loop approach but without a mathematical model of the vehicle. These techniques are based on reinforcement learning [10], where the optimal motion is estimated through an interaction between the vehicle and its environment. However, the achieved solution depends heavily on the sample period. Other approaches based on cellmapping techniques offer an effective numerical means of obtaining global and local solutions to the nonlinear optimal control problem [11–16]. Optimal control problems are very difficult to solve both analytically and numerically, but these approaches provide adequate solutions. Taking into account the advantages associated with reinforcement learning methods and cellmapping techniques, in [15] a new optimal motion planning algorithm called CACM-RL is proposed by the authors.
2. Two-wheeled inverted pendulum
The inverted pendulum problem is a classic problem in control engineering. It is frequently used as a practical exercise in teaching laboratories and has even developed into a practical system of transport, which was first marketed in 2001 and is known as Segway®.
The objective of a Two-Wheeled Inverted Pendulum system (TWIP) is to reach a balanced state. This state corresponds to a vertical position with the body at the top. In this work, we plan to balance and control the position of the body by means of a particular optimal control technique developed in our group.
The TWIP used in this work is built from LEGO® NXT components together with two motors, one gyroscope, and the control software. The gyroscopic sensor can detect rotation using a single-axis gyro based on a quartz resonator. The control software is based on the CACM-RL technique [15]. The TWIP model used in this work is represented in Figure 1.

Two-Wheeled Inverted Pendulum built with LEGO® Mindstorms® NXT 2.0 kit.
The developed control software has been designed in compliance with all constraints and memory limits of the hardware. The platform integrates a time processor unit in order to perform timing control actions independently from the rest of the operations; a gyroscope and an encoder are the only sensorial systems for monitoring the value of the used state variables. The state variables that have been used in this work are: X={x1, x2, x3, x4}, defined in Table 1.
State variables
The angular velocity X2 is obtained reading directly from the gyroscope, and the linear velocity X4 from the position given by the encoder coupled to the motors. X1 is obtained by means of an integration of the angular velocity. Finally X3 is updated according to the information obtained from the encoders. The control actions used for acting on the system are the voltages applied to the motors.
3. Optimal control techniques
Optimal control techniques can be divided into twoclasses: (1) those based on a dynamic model of the system and its environment, and (2) those where the model of the system is not available. Cellmapping techniques applied to the control of unstable and nonlinear systems are taken into account in this study. These techniques need a mathematical model of the system in order to perform an optimal control law. However, the latter class describedabove require techniques that learn from received experience so that once sufficient knowledge of the system has been obtained, an optimal control law can be performed. These techniques are called reinforcement learning methods. They can be widely applicable if the methods are computationally feasible and can be robustly applied to unstable and non-linear systems. There are two different types of these learning methods: indirect and direct. An indirect method relies on a system identification procedure for creating an explicit model of the controlled system and thus being able to determine the control rule of that model. Direct methods can determine the control rule without knowing the model of the system.
The new proposed algorithm for solving the optimal control problem in a TWIP system is based on a particular case of a cellmapping technique applied to the control of unstable and nonlinear systems, combined with another technique based on the direct method of reinforcement learning.
Cellmapping techniques
Cellmapping techniques include, on the one hand, an efficient application of numerical methods in order to integrate non-linear systems (even unstable) and, on the other, the Bellman's Principle of Optimality to find the optimal control efficiently. The result is a new optimal control method based on a cell statespace able to design efficient optimal controllers of highly unstable and nonlinear systems. The design method of controllers based on cellmapping techniques [14] can be divided into two stages:
Obtain a family of cell-to-cell mappings that constitutes the necessary knowledge to calculate the optimal control laws associated with the states of the system.
Search through the optimal control laws, taking into account the Principle of Optimality (intelligence searching techniques). The process finishes when an Optimal Control Table (OCT) is found, which acts as a controller.
Cell-to-cell mapping methods are based on a discretization of the state variables of the system, defining a partition of the state space into cells. A cell-to-cell mapping can be derived from the dynamic evolution of the system. In [14], the Control Adjoining Cell Mapping (CACM) algorithm for optimal control of highly nonlinear systems is proposed. This method is based on the Adjoining Cell Mapping (ACM) technique, whose central concept is the creation of a cell mapping where only transitions between adjoining cells are allowed. Let us consider a system under the action of a
The adjoining property states that the distance (D-k) between any cell and its map (image) is equal to some predefined integer value
The integration time of each transition is determined adaptively to carry out the adjoining property. Such a property provides substantial improvements with respect to other optimal control techniques, since the CACM algorithm only computes the meaningful information required to obtain a good approximation to the optimal solution. Sometimes it may happen that the transition never maps into a D-
When building the OCT, a cost function is defined to indicate the cost for a control action to map a cell to its image. The cost can be defined in terms of time, energy, distance or any other factor. Since a cost function is specific for a cell, it can be used as a local performance measure for controller evaluation. An overall performance measure for a controller can be generated from the local measures by simply averaging all the costs on all controllable cells.
Reinforcement Learning
Reinforcement learning methods only require a scalar reward (or punishment) to learn to map situations (states) in actions [10]. They only need to interact with the environment to learn from experience. The knowledge is saved in a look-up table that contains an estimation of the accumulated reward to reach the goal from each situation or state and applied policy. The objective is to find the actions (policies) that maximize the accumulated reward in each state.
The convergence of this algorithm towards the optimal policy
where
CACM-RL
The cell mapping techniques include, on the one hand, an efficient application of the numerical methods in order to integrate nonlinear systems and, on the other, the Bellman's Principle of Optimality to find the optimal control efficiently. Designing controllers based on the cell mapping techniques [12–16] results in an efficient optimal control method for nonlinear systems. Cell-to-cell mapping methods are based on a discretization of the state variables of the system, defining a partition of the state space into cells. A cell-to-cell mapping can be derived from the dynamic evolution of the system. In [12–14], solutions based on cell mapping techniques for the design of optimal controllers are proposed. In [14], the Control Adjoining Cell Mapping (CACM) method is implemented, which consists of the creation of a cell mapping where only transitions between adjoining cells are allowed.
It is necessary to define a control vector that can only take some finite values, Nu. It is assumed that the control is kept constant during a time interval, t. When an action is applied to the system during a time interval, the system may go to a new state, remain at the same state or go to the drain (out of the state space). The knowledge of the system is given by the set of transitions from different origin states. The maximum size of the set of transitions is Nc × Nu, where Nc is the total number of cells or states.
Reinforcement learning methods only require a scalar reward (or punishment) to learn to map situations (states) in actions [10]. They only need to interact with the environment to learn from experience. The knowledge is saved in a look-up table that contains an estimation of the accumulated reward to reach the goal from each state and applied policy. The objective is to find the actions (policies) that maximize the accumulated reward in each state.
The new algorithm proposed by the authors in [15], CACM-RL, combines the cellmapping techniques and the reinforcement learning approaches in order to conceive a single efficient optimal control algorithm.
CACM-RL deals with different data structures to store the partial results of the learning process. These structures are described in Table 2, below.
CACM-RL algorithm
During the learning phase, we must pay special attention to the
Generic transition is defined as occurring within the considered state space and whose starting state is different from the origin state.
4. New optimal control scheme
When talking about classic control, the controller usually acts on the real platform through a communication bus. The controller is able to know the state of the system thanks to the information provided by the sensors. The controller is able to command the actuators in order to modify the state of the system. With the new optimal control scheme operating simultaneously with the classic control system, the CACM-RL technique simply captures the control actions that are being applied by the classical controller and annotates the state and the evolution of the system. It learns autonomously the dynamics of the platform in order to achieve an autonomous optimal controller.
In this work the classic controller is a PID, which acts on the real platform in a nominal way. CACM-RL incorporates a ‘transitions register’ module, which stores in a database the new states reached and the control actions applied to the real platform in order to learn its dynamics. As the knowledge database is filled, the system capacity for planning optimal control actions is increased. At the bottom of Figure 2, we can see that learning grows until the system reaches the balanced state. The goal is to reach an optimal control policy from the achieved dynamics model, which is stored in the knowledge database. The key issue in this work is how to achieve the optimal policy in CACM-RL from a nominal operation of the PID controller. Through the controller, however, we cannot reach all desired transitions and therefore some gaps are generated in the

Real time optimal controller operation in learning process.
With this implementation, as soon as the system begins to retrieve knowledge from the dynamics of the system, a policy of control is developed. In this way, the suggestions of the control actions provided by CACM-RL will be available in real time in order to be applied. At the beginning, the control actions are similar to the ones provided by the PID controller, but with experience these actions become more efficient or optimal. In this context, it is possible to switch the controller to the new optimal control scheme in order to act as a main controller and continue learning from the experience. It is also possible to introduce an internal agent to decide the switching to the optimal controller or the return to the classic control.
The learning process is summarized in Figure 2 considering the inverted pendulum as an example and the balancing disturbance as the cost function to optimize.
5. Results
In this chapter, a characterization of the optimal control performed on the TWIP is analysed. In Figure 3, the balanced motion of the TWIP is represented. We can see the movements (forwards and backwards) of the robot for reaching the balanced state at the same point, x0. The five steps shown in Figure 3 have been obtained by means of sampling and dumping of the state variables specified in Table 1. These steps are a full control cycle. A control cycle is the period of an oscillation of the inverted pendulum.

Balanced motion of the TWIP.
The objective in this test is to demonstrate how a non-linear and unstable system can be controlled by the new proposed optimal control scheme. This experience can be extrapolated and applied to any non-linear and unstable system in order to be controlled in an optimal way, such as Unmanned Vehicles (UV) or spacecraft.
The simplicity of the hardware and the complexity of the mathematical model of the TWIP are the reasons this kind of system has been chosen for verifying the feasibility of the new optimal control scheme.
In Figure 4, the non-linear variables (φ, φ′) are traced during 8 seconds; the window is highlighted in order to show the control cycle, which is represented in more detail in Figure 5. In the top graphic of Figure 5, the control actions delivered by the PID control (blue line) and by the optimal control (red line) are shown. In the red line, two effects related to response time and energy can be identified: the same control actions are performed as before with the classic controller, which generates an energy reduction. The evolution of the state variables during a control cycle is shown in the bottom graphic of Figure 5.

TWIP Balancing control cycle.

Comparison between the PID control solution and the new optimal control scheme. The top graphic represents a control cycle while the bottom graphic shows the evolution of the state variables.
It is important to highlight the evolution of the non-linear variables (φ, φ′) in the balanced motion of the inverted pendulum: when the pendulum reaches both limits during the oscillation, φ′=0, as in the simple pendulum. Also, at the balanced point, φ=0 and φ′ reaches the maximum value.
6. Conclusions and future work
In this work, the new optimal control scheme based on the CACM-RL technique has been demonstrated to be a feasible complete solution to control non-linear and unstable systems such as the TWIP, without using any mathematical model of the robot. The new scheme is able to operate along with classic control techniques or be completely autonomous. When sharing control with classic techniques, the new scheme learns the dynamics of the system by watching the evolution of the robot from the control actions imposed by the classic solution. In the second case, it can learn by acting directly on the system. This way, it performs an optimal control in order to lead the system to a specific goal (equilibrium condition).
From the results specified in the previous chapter, we can conclude that the new optimal control scheme is a suitable and efficient solution for unstable systems. This is due to the fact that the performed control actions are optimal and the balanced state can be reached faster than with the classic method. Furthermore, CACM-RL has the advantage of enabling learning, not only near the equilibrium, but also in areas far from this equilibrium. It must be highlighted that while the PID controller is acting on TWIP, CACM-RL is learning. This is the reason why any external disturbance (within specific limits) can generate a new transition out of the equilibrium in the TWIP. In this way, it moves to a new state and CACM-RL increases its learning. In this case, we can say that the TWIP has learnt in a non-linear zone.
Figure 6 shows the CACM-RL response when the system is balanced. If we compare the amplitude of the falling angle φ, shown in Figure 4 (with the PID controller), we can see that now the oscillation has been reduced. In the same way, the angular velocity φ′ is slower than in Figure 4. The NxtPower variable, in the background, shows the action control that acts on the TWIP while balancing.

Optimal control with CACM-RL in balanced situation.
In this experiment, the inverted pendulum has been demonstrated to be a simple platform with a very complex mathematical model. It has allowed us to verify the feasibility of the proposed technique based on CACM-RL. The behaviour of the TWIP has been sampled in real time and coexisting with a PID control technique. This has been done using two strategies: 1)
Footnotes
7. Acknowledgment
This work has been supported by the MICINN under the grant AYA2011-29727-C02-02and by LEGO-Electricbricks, who provided the prototype and the development environment.
