Nonlinear system control using a fuzzy cerebellar model articulation controller involving reinforcement-strategy-based bacterial foraging optimization

Abstract

This article proposes a fuzzy cerebellar model articulation controller with reinforcement-strategy-based modified bacterial foraging optimization for solving the cart-pole balancing control problem. The proposed reinforcement-strategy-based modified bacterial foraging optimization is used to adjust the parameters of fuzzy receptive field functions and fuzzy weights for improving the accuracy of the fuzzy cerebellar model articulation controller output. An efficient strategic approach is applied in the chemotaxis step in the traditional bacterial foraging optimization algorithm. In the approach, each virtual bacterium swims for different run lengths and increases the bacterial diversity. Experimental results are presented to show the performance and effectiveness of the proposed reinforcement-strategy-based modified bacterial foraging optimization method.

Keywords

Control reinforcement learning bacterial foraging optimization strategy method cerebellar model articulation controller fuzzy set

Introduction

The most well-known supervised learning algorithm is the back-propagation (BP).^1–3 Because the gradient descent technique is used to minimize the cost function, BP may fall into the local minima instead of reaching the global optima. In addition, the initial values of the system parameters dominate the BP training performance. Although the supervised learning algorithm performs efficiently with precise training data, normally, this is not the case in the real world. When the training data are rough and coarse, they can respond with “evaluative” but not “instructive” feedback in the supervised learning problem. Training an evaluative feedback network is called reinforcement learning; the evaluative feedback is scalar and is called the reinforcement signal. Apparently, the exact training data may be expensive to obtain or even unobtainable, and this has spurred wide discussions on reinforcement learning problems.^4–8 Barto et al.⁴ used neuron-like adaptive elements to solve difficult learning control problems with only reinforcement signal feedback. The idea of their proposed architecture was called the actor-critic (adaptive heuristic critic) architecture. Berenji and Khedkar⁵ proposed a generalized approximate reasoning-based intelligent control architecture for learning and tuning a fuzzy controller based on reinforcement signals from a dynamical system. The architecture includes a priori control knowledge of expert operators in terms of fuzzy control rules. Lin and Lee⁶ also proposed a reinforcement neural-network-based fuzzy logic control system (RNN-FLCS) for solving various reinforcement learning problems. The RNN-FLCS can find proper network structure and parameters simultaneously and dynamically. Recently, Anderson et al.⁷ demonstrate that learning a predictive model of state dynamics can result in a pretrained hidden layer structure that reduces the time needed to solve reinforcement learning problems. Lewis et al.⁸ describe methods from reinforcement learning that can be used to design new types of adaptive controllers that converge to optimal control solutions online in real time by measuring data along the system trajectories. The proposed method was applied to a power system for optimal adaptive control. According to the above-mentioned methods,^4–8 their architectures require an extra critic network in addition to the action network.

Recently, evolutionary algorithms, which simulate natural evolutionary processes on the basis of Darwinian principles, have been widely applied to various fields.^9–13 First, the genetic algorithm (GA),^9,10 which is a popular evolutionary algorithm, was proposed. Inspired by animal social behavior, such as the flocking of birds and schooling of fishes, as well as swarm theory, researchers have proposed several evolutionary algorithms, including particle swarm optimization (PSO),¹¹ differential evolution (DE),¹² bacterial foraging optimization (BFO),¹³ and ant colony optimization (ACO).¹⁴ In heuristic or stochastic situation, it is unlikely to fall into the local minimum. Individuals with specific behavior are similar to certain biological phenomena. The identification of common features between the two has led to the development of evolutionary computation. In this study, we focused on BFO, which was inspired by the foraging behavior of E. coli in the intestines. BFO has a parallelizable searching ability. However, for complex optimization problems, the traditional BFO does not easily yield optimal solutions. Therefore, the performance of the traditional BFO depends on the run length of chemotaxis. To overcome this problem, several researchers have proposed the use of adaptive run-length strategies^15–18 for verifying the influence of bacterial run length on executable results. The current study focused on the chemotaxis step for improving the BFO algorithm. Therefore, the strategy method was adopted.

Recently, many researchers^19–21 have used evolutionary algorithms for reinforcement learning. GA-based fuzzy reinforcement learning was proposed by Lin and Jou²⁰ for controlling magnetic bearing systems. Juang et al.²¹ presented genetic reinforcement learning, which could be used for designing fuzzy controllers. The GA adopted in Juang et al.,²¹ which is based on traditional symbiotic evolution, complements the local mapping property of a fuzzy rule when applied to fuzzy controller design. Li et al.²² proposed an efficient navigation control method of mobile robot. According to the relative position between the mobile robot and the environment, the behavior manager switches to determine toward goal behavior or wall-following behavior of mobile robot. A novel recurrent fuzzy cerebellar model articulation controller (CMAC) based on a reinforcement improved dynamic artificial bee colony is proposed for performing wall-following control of mobile robot.

This study proposes a fuzzy cerebellar model articulation controller (FCMAC) with reinforcement-strategy-based modified bacterial foraging optimization (R-SMBFO) for solving the cart-pole balancing control problem. The proposed R-SMBFO is used to adjust the parameters of fuzzy receptive field functions and fuzzy weights for improving the accuracy of the FCMAC output. An efficient strategic approach was applied to the chemotaxis step in the traditional BFO algorithm. In the strategic approach, each virtual bacterium swims for different run lengths and increases the bacterial diversity.

The remainder of this article is organized as follows: Section “FCMAC” introduces the proposed FCMAC and section “Reinforcement bacterial foraging optimization” describes the proposed R-SMBFO learning algorithm. The control of the cart-pole balancing system is addressed in section “Control of a cart-pole balancing system.” Finally, Section “Conclusion” concludes the paper.

FCMAC

The traditional CMAC, which imitates the function and structure of the human cerebellum, is a popular neural network model.² The input space of a CMAC network is quantized into discrete states, and overlapping areas are called hypercubes. Each hypercube covers many discrete states and is assigned a memory cell for storing information. For an input state, only a few hypercubes are activated, and they contribute to the corresponding network output. The superiority of the CMAC model lies in its quick learning speed, high convergence rate, and simple hardware implementation. However, the model has certain drawbacks, which include substantial computing memory requirements and relatively poor function approximation ability. Fuzzy modeling is recognized as a powerful tool for developing models from various sources. The FCMAC overcomes the disadvantages of the traditional CMAC model using fuzzy membership functions.^23,24

Albus’ CMAC model has three major limitations: the selection of the memory-structure parameters is difficult, a rigorous theory is required for function approximation, and substantial memory is required for solving high-dimensional problems. Therefore, to overcome these limitations, an efficient FCMAC model is proposed here. The structure of the FCMAC model is shown in Figure 1.

Figure 1.

The structure of the FCMAC model.

In the proposed FCMAC model, a Gaussian function is used as the fuzzy receptive field function and the fuzzy weight function for learning. Some learned information is stored in the fuzzy receptive field function and fuzzy weight function, represented using the mean and variance of Gaussian functions. Similar to the traditional CMAC model, the proposed FCMAC model approximates a nonlinear function y = f(x) using two primary mapping functions

S : X \Rightarrow A

(1)

P : A \Rightarrow D

(2)

where X, A, and D are an s-dimensional input space, an N_L-dimensional association space, and a one-dimensional output space, respectively. These two mapping functions are realized through fuzzy operations. The function S(x) maps each state x in the input space X onto an association memory selection vector $α = (α_{1}, α_{2}, \dots, α_{N_{e}}) \in A$ that has N_e nonzero elements (N_e < N_L) and satisfies the condition $0 \leq α_{i} \leq 1$ for all components in α. Furthermore, the function P(α) calculates the crisp value y by mapping the association memory selection vector onto the adjustable fuzzy weights. Then the centroid method is used to defuzzify the partial fuzzy output into a scalar output.

The proposed FCMAC model consists of three layers. Layer 1 is called the input layer. The fuzzy receptive field function uses the Gaussian function

μ (x) = e^{- {((x - m) / σ)}^{2}}

(3)

where x is the input state and m and σ are the center and variance of the fuzzy receptive field function, respectively.

An N_D-dimensional problem is considered in layer 2, and the N_D-dimension Gaussian function is expressed as follows

α_{j} = Π_{i = 1}^{N_{D}} e^{- {((x_{i} - m_{ij}) / σ_{ij})}^{2}}

(4)

where $α_{j}$ is the jth element of the selection vector for the association memory, ∏ is the product operator, x_i is the ith dimension input value, $m_{ij}$ and $σ_{ij}$ are the center and variance (corresponding to the jth element of the selection vector) of the receptive field function.

In Layer 3, the value of the association memory selection vector corresponding to each fuzzy weight is treated as the input matching degree to produce a partial fuzzy output. The Gaussian basis function used in this study is combined with the centroid defuzzification method, which is used to defuzzify the partial fuzzy output into a scalar output. For 2D functions, the crisp output y is derived as follows

y = \frac{\sum_{j = 1}^{N_{L}} α_{j} w_{j}^{m} w_{j}^{σ}}{\sum_{j = 1}^{N_{L}} α_{j} w_{j}^{σ}}

(5)

where $N_{L}$ represents the number of hypercube cells and $w_{j}^{m}$ and $w_{j}^{σ}$ are the mean and variance of the fuzzy weights of association memory j, respectively.

Reinforcement bacterial foraging optimization

Proposed R-SMBFO

In this study, a novel R-SMBFO method is proposed. The strategy method in the chemotaxis step of the traditional BFO algorithm was adopted to solve the long execution time required for multidimensional problems and solve the trapping problem for determining local optimal solutions.

A bacterium represents a solution of the optimization problem, and it can be expressed as a D-dimensional vector, $θ^{i} = [θ_{1}^{i}, θ_{2}^{i}, \dots, θ_{D}^{i}]$ , $i = 1, 2, \dots, S$ , where S represents the bacterial population size. The formula of the chemotaxis step is as follows

θ^{i} (j + 1, k, l) = θ^{i} (j, k, l) + C_{i} \times \frac{Δ (i)}{\sqrt{Δ^{T} (i) Δ (i)}}

(6)

where $θ^{i} (j, k, l)$ denotes the position of the ith bacterium in the jth chemotaxis, kth reproduction, and lth elimination–dispersal steps, $C_{i}$ represents the run length of the bacterial tumbling movement, and $Δ (i)$ , whose components have a value between −1 and 1, is the direction of random search.

The bacterial population can be expressed as $P (j, k, l) = {θ^{i} (j, k, l) | i = 1, 2, \dots, S}$ , and each cluster message signaling between bacteria is represented by the following equation

\begin{matrix} J_{cc} (θ, P (j, k, l)) = \sum_{i = 1}^{S} J_{cc} (θ, θ^{i} (j, k, l)) = \\ \sum_{i = 1}^{S} [- d_{a t t r a c t a n t} \exp (- w_{a t t r a c t a n t} \sum_{n = 1}^{D} {(θ_{n} - θ_{n}^{i})}^{2})] + \\ \sum_{i = 1}^{S} [h_{r e p e l l a n t} e x p (- w_{r e p e l l a n t} \sum_{n = 1}^{D} {(θ_{n} - θ_{n}^{i})}^{2})] \end{matrix}

(7)

where $d_{attractant}$ , $w_{attractant}$ , $h_{repellant}$ , and $w_{repellant}$ represent the attractant and repellent behavior of bacteria and J_cc denotes the updated fitness function value. The new fitness functions of bacteria can be expressed as follows

J (i, j + 1, k, l) = J (i, j, k, l) + J_{c c} (θ^{i} (j + 1, k, l), P (j + 1, k, l))

(8)

In Oentaryo et al.,²⁵ the solution of the optimization problem depends on the bacterial run length. In the strategy method, the previous and current fitness functions of each bacterium are stored and the two fitness functions are compared to lead an evaluation. If the current fitness function is more advantageous than the previous one, the “+” sign is marked. If the current fitness function is more disadvantageous than the previous one, the “−” sign is marked. Otherwise, the “=“ sign is marked. In this case, the current and previous fitness functions have the same advantageous. In the strategy method, the status is obtained from three consecutive fitness functions and is a combination of two consecutive signs from above-mentioned three signs (i.e. “+,”“−,” and “=”). For example, the status (=+) denotes that the fitness values $J_{j - 2}^{i}$ and $J_{j - 1}^{i}$ of bacteria i are identical, and that $J_{j}^{i}$ is more advantageous than $J_{j - 1}^{i}$ . Table 1 shows all statuses.

Table 1.

The strategy method.

Case	Status	Strategy method
1	(= +), (+ +)	Moving to the global optimal direction
2	(+ =), (– +)	Moving to the local optimal direction
3	(– –), (= –), (+ –), (– =), (= =)	Moving around itself

Deterioration: –; status quo: =; improvement: +.

As shown in Table 1, the statuses in the proposed strategy method are divided into three cases. The position of the run length ( $C_{i}$ ) of the bacterium i is adjusted according to the corresponding evolutionary strategies. In Case 1, the bacterium in the (= +) and (+ +) statuses has consistently discovered a more advantageous region to approach the optimal solution. This bacterium will move toward the global optimal direction ( $θ_{Gbest}$ ) in the current generation. Subsequently, the bacterium in the (+ =) and (– +) statuses of Case 2 cannot continually discover a more advantageous region. In other words, this case does not fall in a less advantageous area. Thus, the bacterium in Case 2 is directed to the positions of the local optimal solution ( $θ_{Pbest}^{i}$ ) or the global optimal solution ( $θ_{Gbest}$ ). In Case 3, if a bacterium is in the (– –), (= –), (+ –), (– =), and (= =) statuses, the fitness value changes sometimes into low, high or keeps the same during three consecutive fitness valuations. Thus, the bacterium in Case 3 swims along itself direction.

The strategy method in the chemotaxis step of the BFO was adopted. The formulations of the bacterial position and the bacterial run length in different strategies are as follows:

Case 1

C_{i} = \frac{J_{G b e s t}}{(J^{i} + J_{G b e s t} + J_{P b e s t}^{i})}

(9)

θ^{i, m + 1} = θ^{i, m} + C_{i} \times (θ_{G b e s t} - θ^{i, m})

(10)

Case 2

C_{i} = \frac{J_{P b e s t}}{(J^{i} + J_{G b e s t} + J_{P b e s t}^{i})}

(11)

θ^{i, m + 1} = θ^{i, m} + C_{i} \times (θ_{P b e s t}^{i} - θ^{i, m}) + C_{i} \times (θ_{G b e s t} - θ^{i, m})

(12)

Case 3

C_{i} = \frac{J^{i}}{(J^{i} + J_{G b e s t} + J_{P b e s t}^{i})}

(13)

θ^{i, m + 1} = θ^{i, m} + C_{i} \times \frac{Δ (i)}{\sqrt{Δ^{T} (i) Δ (i)}}

(14)

where $θ^{i, m}$ denotes the current position of bacterium i, $C_{i}$ is the current bacterial run length, $θ_{P b e s t}^{i}$ and $J_{P b e s t}^{i}$ are the local best position and local best fitness function of the current bacterium i, respectively, $J^{i}$ is the current bacterial fitness function, and $θ_{G b e s t}$ and $J_{G b e s t}$ represent the best position and global optimal fitness function, respectively, among all bacteria. The bacterial run length is updated on the basis of the current positions of the bacteria. In each case of the strategy method, the adjustment formulations of the run length and position are different. The three fitness values, namely $J^{i}$ , $J_{G b e s t}$ , and $J_{P b e s t}^{i}$ , are summed in the denominator, whereas the itself parameter is in the numerator. In Case 1, the bacterium finds a more advantageous region; it moves toward $θ_{G b e s t}$ and uses $J_{G b e s t}$ as a numerator in equation (10), thereby shortening $C_{i}$ .

Consider the following definitions of variables: $N_{s}$ is the number of bacterial swimming length in the chemotaxis step, $N_{c}$ denotes the number of chemotaxis steps, $N_{re}$ represents the number of reproduction steps, $N_{e d}$ is the number of elimination-dispersal steps, $C_{i}$ is the run length of each bacterium, $J^{i}$ denotes the fitness function of each bacterium, $J_{G b e s t}$ represents the global optimal fitness function among all bacteria, and $J_{P b e s t}^{i}$ is the local best fitness function of bacterium i. Details of the pseudo codes of the strategy method are as follows:

Step 1: Initialize the parameters S, $C_{i}$ , $N_{s}$ , $N_{c}$ , $N_{re}$ , $N_{e d}$ , and $p_{e d}$ . The positions of bacteria in the solution space are generated randomly, and the local best position $θ_{P b e s t}^{i}$ for each bacterium is stored. The fitness function ( $J_{P b e s t}^{i}$ ) is then evaluated, and the smallest fitness function ( $J_{G b e s t}$ ) and its position ( $θ_{G b e s t}$ ) are stored. Initially, $C_{i}$ is set as 0.1, and the run lengths of all bacteria are considered to be equal. After the adaptability status is determined, the parameter $S A_{i}$ is marked as “+” to indicate an advantageous search direction.

Step 2: Elimination-dispersal loop: $l = l + 1$ ;

Step 3: Reproduction loop: $k = k + 1$ ;

Step 4: Chemotaxis loop: $j = j + 1$ ;

Step 5: Chemotaxis step:

For bacterium i, $i = 1, 2, \dots, S$

The new position of the bacterium is updated and the fitness function is estimated using equation (6).

For swimming step m, $m = 1, 2, \dots, N_{s}$

According to Cases 1–3 of the strategy method, a strategy is evaluated and C_i is updated.

Compute the new position of the bacterium using equations (9), (11), or (13).

Evaluate the fitness function and update the strategy status $S A_{i}$ in the next swimming step.

Next m

Next i

Step 6: If $j < N_{c}$ , go to Step 4. Go to the chemotaxis loop.

Step 7: Reproduction: Evaluate the health of bacteria. $S_{r}$ is half the population size. When the fitness functions are sorted, half of the bacteria $S_{r}$ with higher fitness values will die; others with smaller fitness functions will be split into two populations of equal size.

Step 8: If $k < N_{re}$ , go to Step 3. In this case, the specified reproduction step is not satisfied. Therefore, the chemotaxis loop in the next generation begins.

Step 9: Elimination-dispersal: If a random value is generated and is less than probability p, the bacterium will be randomly assigned to a new position. The parameter $S A_{i}$ is signed as “+.”

If $l < p_{ed}$ , then go to Step 2. Otherwise, the algorithm is terminated. In the traditional chemotaxis step, if the fitness values worsen, the bacteria will stop swimming. By contrast, the proposed method in the chemotaxis step causes each bacterium to lead the swimming operation, regardless of whether the fitness values worsen. In the elimination-dispersal step, if a bacterium is eliminated and randomly dispersed, the adaptation status $S A_{i}$ is updated and set as “+,” exactly as in the initialization step. The flowchart of the proposed chemotaxis method is presented in Figure 2.

Figure 2.

Flowchart of the proposed chemotaxis step.

Reinforcement learning for FCMAC

The desired outputs for each input are provided in the supervised learning problem, whereas the reinforcement learning problem requires only simple “evaluative” or “critical” information for learning. Only a small amount of information is available to indicate whether the output is right or wrong. Figure 3 shows the R-SMBFO. In this study, the reinforcement signal indicates whether a success or a failure control occurs.

Figure 3.

Schematic diagram of the R-SMBFO for the FCMAC controller.

The proposed FCMAC, shown in Figure 3, is the control network that initiates appropriate actions according to the current input vector. The actor–critic architecture of Barto et al.⁴ consists of a control network and a critic network. By contrast, the input and output of the proposed FCMAC are the state of a plant and a control action of the state, respectively, denoted by f. The reinforcement signal is generated only when a failure occurs, and it is the one and only available feedback notifying the FCMAC.

Figure 3 shows a schematic diagram of the R-SMBFO for the FCMAC controller which consists of an accumulator, a relative performance measure indicator, accumulates the number of time steps before a failure occurs. An accumulator measures the duration of the experiment in a “success” state, and this feedback is treated as the fitness of the proposed R-SMBFO method; in other words, the accumulator is the indicator of the “fitness” of the current FCMAC. Before failure occurs, it is essential for R-SMBFO to determine an expression for the number of time steps and treat the formula as the fitness function. In the proposed method, no critical network is required for use as a multistep or single-step predictor.

Figure 4 describes the R-SMBFO method for the FCMAC. The method is applied in a feed-forward situation, and it monitors the environment (plant) until a failure occurs. The fitness function is measured for the duration of the experiment in a “success” state by the aforementioned accumulator. A fitness value is assigned to each string in the population. If a fitness value has a higher value, the corresponding string has a better solution. In this study, many time steps were used to define the fitness function before failure occurred. The R-SMBFO maximizes the fitness value, and the fitness function is given as follows

F i t n e s s_F u n c t i o n (i) = T I M E_S T E P (i)

(15)

where TIME_STEP(i) records the duration that the experiment is in a “success” state for the ith population. Equation (15) indicates that longer time steps (keeping the desired control state longer) imply a higher fitness of the R-SMBFO method.

Figure 4.

Flowchart of the proposed reinforcement learning method.

Control of a cart-pole balancing system

This section discusses the control of a cart-pole balancing system,⁴ which was considered to evaluate the FCMAC along with the R-SMBFO method. In the experiment, we used a Pentium(R) 4 chip processor with a 3.2 GHz CPU, a 1 GB memory, and visual C++ 6.0 simulation software. The initial parameters of the R-SMBFO before training are presented in Table 2.

Table 2.

Initial parameters before training.

Parameter	Value
S	30
$N_{s}$	2
$N_{c}$	5
$N_{re}$	2
$N_{ed}$	1
$p_{ed}$	0.25
$d_{attract}$	0.1
$w_{attract}$	0.2
$w_{repellant}$	10
$h_{repellant}$	0.1

The classical control problem of cart-pole balancing was examined in this study by applying the proposed R-SMBFO. As shown in Figure 5, the cart-pole balancing problem involves learning how to balance an upright pole properly. Both the cart and the pole only move vertically, and each of them has 1 degree of freedom. The parameters $θ$ and $\overset{\cdot}{θ}$ represent the angle and angular velocity of the pole, respectively, and x and $\overset{\cdot}{x}$ represent the horizontal position and velocity of the cart, respectively. Furthermore, f is the amount of force (in Newton) and is the control action of moving the cart toward the left or right. Two failure states are defined as follows: (1) the pole falls past ±12° and (2) the cart covers a distance of 2.4 m from the center to each bound of the track. The objective is to determine a sequence of forces that can be applied to the cart to ensure that the pole is upright. The first-order numerical procedures (i.e. Euler’s method) of the cart-pole balancing system are produced and shown as follows

θ (t + 1) = θ (t) + Δ \overset{\cdot}{θ} (t)

(16)

\begin{matrix} \overset{\cdot}{θ} (t + 1) = \overset{\cdot}{θ} (t) + Δ \frac{(m + m_{p}) g \sin θ (t)}{(4 / 3) (m + m_{p}) l - m_{p} l \cos^{2} θ (t)} \\ - \frac{\cos θ (t) [f (t) + m_{p} l \overset{\cdot}{θ} {(t)}^{2} \sin θ (t) - μ_{c} sgn (\overset{\cdot}{x} (t))]}{(4 / 3) (m + m_{p}) l - m_{p} l \cos^{2} θ (t)} \\ - \frac{\frac{μ_{p} (m + m_{p}) \overset{\cdot}{θ} (t)}{m_{p} l}}{(4 / 3) (m + m_{p}) l - m_{p} l \cos^{2} θ (t)} \end{matrix}

(17)

x (t + 1) = x (t) + Δ \overset{\cdot}{x} (t)

(18)

\begin{matrix} \overset{\cdot}{x} (t + 1) = \overset{\cdot}{x} (t) \\ + Δ \frac{f (t) + m_{p} l [\overset{\cdot}{θ} {(t)}^{2} \sin θ (t) - \overset{\cdot\cdot}{θ} (t) \cos θ (t)]}{(m + m_{p})} \\ - \frac{μ_{c} sgn (\overset{\cdot}{x} (t))}{(m + m_{p})} \end{matrix}

(19)

where the length of the pole l = 0.5 m, the combined mass of the cart and pole m = 1.1 kg, the mass of the pole m_p = 0.1 kg, the acceleration due to gravity g = 9.8 m/s, the friction coefficient of the cart on the track $μ_{c} = 0.0005$ ; the friction coefficient of the pole on the cart $μ_{p} = 0.000002$ ; and the sampling time Δ = 0.02 s. The constraints for the variables were set as $- 12 \circ \leq θ \leq 12 \circ$ , –2.4 m $- 2.4 m \leq x \leq 2.4 m$ , and $- 10 N \leq f \leq 10 N$ . A control strategy was deemed successful if it could balance a pole for 100,000 time steps.

Figure 5.

The cart-pole balancing system.

The four input variables $(θ, \overset{\cdot}{θ}, x, \overset{\cdot}{x})$ and the output f_t were normalized to a value between 0 and 1 in the following ranges: $θ$ : [−12, 12], $\overset{\cdot}{θ}$ : [–60, 60], x: [−2.4, 2.4], $\overset{\cdot}{x}$ : [−3, 3], f_t: [−10, 10]. The fitness function in equation (15) was used to train the FCMAC. This equation represents the time taken for the cart-pole balancing system to fail and receives a penalty signal of −1 when the beam deviates beyond a certain angle ( $| θ | > 12 \circ C$ ) or the cart runs into the bounds of its track ( $| x | > 2.4 m$ ).

The initial values of the input variables in the experiment were set to (0, 0, 0, 0). Each experiment consisted of 30 runs, and each run started with the same initial state. Figure 6(a) shows that the average number of generations required for balancing the pole using the FCMAC and R-SMBFO learning method is 9.82. In this figure, the largest fitness value of each run in the current generation is selected before the cart-pole system fails. When R-SMBFO learning stops, the best strings of the swarm in the final generation are selected and tested on the cart-pole balancing system.

Figure 6.

The performance of (a) the proposed R-SMBFO method, (b) the R-PSO²⁶ method, and (c) the R-GA²⁷ method on the cart-pole balancing system.

Figure 7(a) shows the angular deviation of the pole when the cart-pole balancing system was controlled by the FCMAC with the R-SMBFO leaning method; the system started from the initial state given by $x (0) = 0, \overset{\cdot}{x} (0) = 0, θ (0) = 0, \overset{\cdot}{θ} (0) = 0$ . The average angular deviation was 0.0151. Experimental results showed that the proposed FCMAC with the R-SMBFO leaning method had good control over the cart-pole balancing system. In addition, in order to show the experimental results of different sampling times, we have also tested this experiment by setting the different sampling times as 0.002 s. Figure 8 shows the angular deviation of the pole when the cart-pole balancing system was controlled by the FCMAC with the R-SMBFO leaning method. The average angular deviation was 0.0028.

Figure 7.

Angular deviation of the pole by a trained (a) the proposed R-SMBFO method, (b) the R-PSO²⁶ method, and (c) the R-GA²⁷ method.

Figure 8.

Angular deviation of the pole with the sampling time 0.002 s by a trained the proposed R-SMBFO method.

We also compared the control performance of the proposed method with that of reinforcement particle swarm optimization (R-PSO)²⁶ and the reinforcement genetic algorithm (R-GA)²⁷ by applying all these methods to the same problem. Figure 6(b) and (c) shows that an average number of generations required for balancing the pole using the R-PSO and R-GA methods are 12.94 and 15.47, respectively. Figure 7(b) and (c) shows the angular deviations of the pole as presented in Kennedy and Eberhart²⁶ and Karr.²⁷ The average angular deviations of the R-PSO and R-GA methods are 0.0506 and 0.0481, respectively. Thus, in the experiment, the proposed R-SMBFO method showed superior control performance compared with these two methods.^26,27

The GENetic ImplemenTOR (GENITOR),¹⁹ reinforcement symbiotic evolution (R-SE),²¹ symbiotic adaptive neuro-evolution (SANE),²⁸ the temporal difference and genetic algorithm-based reinforcement (TDGAR),²⁰ and the clustering- and Q-value-based genetic algorithm learning schemes for fuzzy system design (CQGAF)²⁹ have also been tested on the same control problem, and Table 3 shows the simulation results. In the table, the number of trials represents the number of training episodes required. The structure of the neural network considered in Whitley et al.¹⁹ included five input nodes, five hidden nodes, and one output node, and the authors adopted a GA to adjust the weights in a neural network. In Juang et al.,²¹ R-SE was used to adjust the parameters of a recurrent neural fuzzy network, and the population size, crossover rate, and mutation rate in R-SE were set to 200, 0.5, and 0.3, respectively. The use of SANE in Juang et al.²¹ involved the application of a symbiotic evolution algorithm to a neural network with five input nodes, eight hidden nodes, and two output nodes. In TDGAR,²⁰ a new hybrid learning algorithm integrates the temporal difference (TD) forecasting method and the GA to perform reinforcement learning. In the CQGAF,²⁹ a weak reinforcement signal is used to perform GA-based fuzzy system design. Results presented in Table 3 show that the proposed R-SMBFO is highly feasible and effective.

Table 3.

Performance comparisons of various existing models.

Methods	Mean (generations)	Best (generations)	Worst (generations)
R-SMBFO	9.82	2	26
R-PSO²⁵	12.94	4	39
R-GA²⁷	15.47	2	87
R-SE²¹	214	15	380
GENITOR¹⁹	3268	415	18,743
SANE²⁸	1984	46	5865
TDGAR²⁰	186	18	310
CQGAF²⁹	133	12	288

R-SMBFO: reinforcement-strategy-based modified bacterial foraging optimization; R-PSO: reinforcement particle swarm optimization; R-GA: reinforcement genetic algorithm; R-SE: reinforcement symbiotic evolution; SANE: symbiotic adaptive neuro-evolution; TDGAR: temporal difference and genetic algorithm-based reinforcement; GENITOR: GENetic ImplemenTOR; CQGAF: clustering- and Q-value-based genetic algorithm learning schemes for fuzzy system design.

In this study, the CPU times were used to compare the proposed method with existing methods.^{19–21,25,27–29} The comparison results are shown in Table 4. The average CPU time of the R-SMBFO method was 12.57 s. The comparison in Table 4 shows that the proposed R-SMBFO method requires a shorter CPU time than the other existing methods.

Table 4.

The comparison of CPU time for various existing models.

Methods	Mean (s)	Best (s)	Worst (s)
R-SMBFO	12.57	3.26	25.59
R-PSO²⁵	17.13	3.78	27.23
R-GA²⁷	375.36	4.87	4790.6
R-SE²¹	38.85	8.53	90.78
GENITOR¹⁹	70.95	33.34	246.36
SANE²⁸	43.56	16.54	156.84
TDGAR²⁰	30.23	7.97	76.25
CQGAF²⁹	28.15	6.39	61.37

Conclusion

This study proposes a FCMAC with an R-SMBFO learning method and applied it to the cart-pole balancing control problem. The proposed R-SMBFO uses a strategy method in the chemotaxis step. The advantages of the proposed method are that each bacterium swims for different run lengths and increases the bacterial diversity. The R-SMBFO learning method was used to adjust the parameters of the FCMAC. Experimental results showed that the performance of the FCMAC with R-SMBFO learning is superior to that of other methods.

Footnotes

Handling Editor: Silvia Rodrigo

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors would like to thank the Ministry of Science and Technology of the Republic of China, Taiwan for financially supporting this research under Contract No. MOST 106-2221-E-167-016.

References

Rubio-Solis

Panoutsos

Interval type-2 radial basis function neural network: a modeling framework. IEEE T Fuzzy Syst 2015; 23: 457–473.

Chan

F-T.

Self-learning complex neuro-fuzzy system with complex fuzzy sets and its application to adaptive image noise canceling. Neurocomputing 2012; 94: 121–139.

Wang

J-G

Tai

S-C

Lin

C-J.

The application of an interactively recurrent self-evolving fuzzy CMAC classifier on face detection in color images. Neural Comput Appl 2018; 29: 201–213.

Barto

Sutton

Anderson

CW.

Neuron like adaptive elements that can solve difficult learning control problem. IEEE T Syst Man Cyb 1983; 13: 834–847.

Berenji

Khedkar

Learning and tuning fuzzy logic controllers through reinforcements. IEEE T Neural Networ 1992; 3: 724–740.

Lin

Lee

CSG

. Reinforcement structure/parameter learning for neural-network-based fuzzy logic control systems. IEEE T Fuzzy Syst 1994; 2: 46–63.

Anderson

Lee

Elliott

DL.

Faster reinforcement learning after pretraining deep networks to predict state dynamics. In: International joint conference on neural networks (IJCNN), Killarney, 12–17 July 2015. New York: IEEE.

Lewis

Vrabie

Vamvoudakis

KG.

Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE T Control Syst 2012; 32: 76–105.

Holland

JH.

Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press, 1975.

10.

Fogel

. Evolutionary optimization. In: Conference record of the twenty-sixth Asilomar conference on signals, systems & computers, Pacific Grove, CA, 26–28 October 1992. New York: IEEE.

11.

Lin

Peng

CC.

Chord recognition using neural networks based on particle swarm optimization. Cybernet Syst 2011; 42: 264–282.

12.

Storn

Price

Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 1997; 11: 341–359.

13.

Passino

KM.

Biomimicry of bacterial foraging for distributed optimization and control. IEEE Contr Syst Mag 2002; 22: 52–67.

14.

Dorigo

Caro

GD.

Ant colony optimization: a new meta-heuristic. In: Proceedings of the congress on evolutionary computation-CEC99, Washington, DC, 6–9 July 1999, vol. 2, pp.1470–1477. New York: IEEE.

15.

Yan

Zhu

Chen

et al . Improved bacterial foraging optimization with social cooperation and adaptive step size. In: Huang

Jiang

Bevilacqua

et al . (eds) Intelligent computing technology, vol. 7389. Berlin: Spriinger, 2012, pp.634–640.

16.

Majhi

Panda

Majhi

et al . Efficient prediction of stock market indices using adaptive bacterial foraging optimization (ABFO) and BFO based techniques. Expert Syst Appl 2009; 36: 10097–10104.

17.

Chen

Zhu

Self-adaptation in bacterial foraging optimization algorithm. In: 3rd international conference on intelligent system and knowledge engineering, Xiamen, China, 17–19 November 2008, pp.1026–1031. New York: IEEE.

18.

Chen

Zhu

Adaptive bacterial foraging optimization. Abstr Appl Anal 2011; 2011: 108269.

19.

Whitley

Dominic

Das

et al . Genetic reinforcement learning for neurocontrol problems. Mach Learn 1993; 13: 259–284.

20.

Lin

Jou

CP.

GA-based fuzzy reinforcement learning for control of a magnetic bearing system. IEEE T Syst Man Cy B 2000; 30: 276–289.

21.

Juang

Lin

CT.

Genetic reinforcement learning through symbiotic evolution for fuzzy controller design. IEEE T Syst Man Cy B 2000; 30: 290–302.

22.

Lin

Huang

et al . Mobile robot navigation control using recurrent fuzzy cerebellar model articulation controller based on improved dynamic artificial bee colony. Adv Mech Eng 2011; 8: 1–10.

23.

Lin

HY.

TSK fuzzy CMAC-based robust adaptive backstepping control for uncertain nonlinear systems. IEEE T Fuzzy Syst 2012; 2: 1147–1154.

24.

Lee

Lin

Chen

HJ.

A self-constructing fuzzy CMAC model and its applications. Inform Sci 2007; 177: 264–280.

25.

Oentaryo

Pasquier

Quek

RFCMAC: a novel reduced localized neuro-fuzzy system approach to knowledge extraction. Expert Syst Appl 2011; 38: 12066–12084.

26.

Kennedy

Eberhart

Particle swarm optimization. In: IEEE international conference on neural networks, Perth, WA, Australia, 27 November–1 December 1995, pp.1942–1948. New York: IEEE.

27.

Karr

. Design of an adaptive fuzzy logic controller using a genetic algorithm. In: Proceedings of the fourth international conference on genetic algorithms, San Diego, CA, 13–16 July 1991, pp.450–457. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

28.

Moriarty

Miikkulainen

Efficient reinforcement learning through symbiotic evolution. Mach Learn 1996; 22: 11–32.

29.

Juang

CF.

Combination of online clustering and Q-value based GA for reinforcement fuzzy system design. IEEE T Fuzzy Syst 2005; 13: 289–302.