A New Approach of Multi-Robot Cooperative Pursuit Based on Association Rule Data Mining

Abstract

An approach of cooperative hunting for multiple mobile targets by multi-robot is presented, which divides the pursuit process into forming the pursuit teams and capturing the targets. The data sets of attribute relationship is built by consulting all of factors about capturing evaders, then the interesting rules can be found by data mining from the data sets to build the pursuit teams. Through doping out the positions of targets, the pursuit game can be transformed into multi-robot path planning. Reinforcement learning is used to find the best path. The simulation results show that the mobile evaders can be captured effectively and efficiently, and prove the feasibility and validity of the given algorithm under a dynamic environment.

Keywords

multiagent system multi-robot pursuit problem association rule Q-learning

1. Introduction

In multiagent systems it is quite an important and interesting endeavor to study the development of cooperative behavior among agents. Multi-robot cooperative pursuit is a popular multiagent problem. Based on the same conditions between predators and evaders, the pursuit game can be sorted into single object pursuit and multi-object pursuit. For the first part, this game has also been considered in many references (Yamaguchi, H., 1991; Swastik, K. & Chinya, V., 2005; Kok, J. K. & Vlassis, N., 2004; Yuko, I.; Sato, T. & Kakazu, Y. 2003; Vidal, R.; Shakernia, O.; Kim, H. J.; Shim, D. H. & Sastry, S. 2002); for the other part, Grinton, C. (1996) made use of Commitments & Conventions cooperative mechanism to explore the algorithm including the consistent commitments among predators to capture many static evaders. Schenato, L. et al. (2005) proposed a general framework for the design of a hierarchical control architecture that exploits the advantages of a sensor network by combining both centralized and decentralized real-time control algorithms to complete multi-object pursuit. Based on dynamic role assignment, a method for forming coordination multi-robot teams in the multiple mobile targets capturing has been proposed (Li, S. Q.; Wang, H.; Li, W. & Yang, J. Y. 2006). Cai et al. (2008) proposed a kind of multi-robot cooperative pursuit algorithm that allows fault injection dynamic alliance based task bundle auction. Based on these researches, in this paper we propose an approach of cooperative hunting for multiple mobile targets by multi-robot. In the present approach, there are two parts: one in which the sample data set by the attribute relationship between predators and evaders is created, association rule data mining technology is used to find out the interesting rules, then to build the pursuit team to every evader according to these rules; the other in which every pursuit team forecasts the position of its object in order to make sure of the predators' aim positions, consequently, we can transform the pursuit problem into the path planning problem and then use multi-robot reinforcement learning to choose the pursuit teams optimal actions strategies to capture evaders in the shortest time.

2. Description of pursuit problems

In this section, we concentrate on a collaborative problem in which n predators in a limitary toroidal grid environment X have to capture m different kinds of evaders. The expression of P={P₁,…,P_n} and E={E₁,…,E_m} stands for the collection of n predators and m evaders respectively. Predators and evaders are called for agents. Evaders have their styles R_Ej, j∈m, R_Ej∈ {I, II, III, IV} shows how many predators evaders could be captured by. Here we think that predators can judge evaders' styles. There are the fixed obstacles whose shape and size are random in X. The position could be destined for the mapping m:χ→{0,1}, such as ∀x∈χ, m(x)=1, well then x is an obstacle. If two grids have a same side, they are bordered upon. We mark the neighbor grid of x as A(x) and we can see |A(x)|⩽4 in the rectangle plane. We denote the positions of predators and evaders by X_P(t)=(x_P1(t),…,x_Pn(t)) and X_E(t)=(x_E1(t),…,x_Em(t)) respectively at time t, t∈T={1,2,…}. Any two agents can't stay in the same cell and each agent can only take one action which includes to stay the origin place or move to its neighbor grid that is not taken up at time t, that is X_k(t+1)=A(X_k(t))∪{X_k(t)}, k∈P∪E. In this paper we formally assume that all agents know the environment map, however they don't know the objects' positions and styles at t=0 and thus all predators begin to search the pursuit area at that time. Here we adopt the circular searching, that is, the predators do their best to get to grids that haven't been scouted. With increasing the time, there is more and more probability of coming back to search. Hence we give every grid of the pursuit area a value, called searching expectation. For any grid x∈X at time t, its searching expectation is v(x,t). For predators when they are searching, v(x,t) of those grids is zero, which are occupied by obstacles or in the perceptive area of predators, but for other grids, their v(x,t) is updated by

v (x, t) = v (x, t) + \frac{{Δ N}_{R_{i} (t)}}{N_{X}}

(1)

where ΔN_Ri(t) denotes the difference of the number of grids that the predator can see at the neighbor time spot and N_X is the sum of grids of the searching area. We assume that P_i is at the grid x, the sum of v(x,t) for all grids in the perceptive area of P_i is denoted by G(y,t) when it moves to the target grid y. In this paper we adopt local-max search strategy, that is, P_i chooses the grid as the next step position whose G(y,t) is the maximal value. So, let N_A(xRi(t)) denote the accessible grids set at time t, obviously we can conclude that N_A(xRi(t))={x|x∈A(x) and m(x)=0} and let N_Ri(y) be the set of all grids that can be apperceived by P_i when it is at the grid y. Well then, when P_i moves to y from x, its G(y,t) is updated as the following:

G (y, t) = \sum_{z \in N_{R_{i}} y} v (z, t)

(2)

P_i can choose the grid from N_A(xRi(t)) as its expectation target position at time t+1 whose G(y,t) is the maximal value. We discuss two final successful capturing configurations: one in which the total number of predators in neighbor grids of the evader is more than the value of its style, the other in which evaders are surrounded by predators or blocked up in the border of X or the gap of obstacles. We show the initial and final circs of the second one as that seven predators hunt two evaders in Fig.1. The cooperative team can gain the reward S_Ej after they capture the evader E_j.

Fig. 1.

Sketch map of muli-robot successfully capturing the evaders (a) is an initial state before capture (b) is goal state where evaders were captured

3. Leaguing pursuit teams with association rule data mining technology

There are many predators and evaders in the pursuit area. When some predator finds the evader, it will inform other predators. In this section, we use association rule data mining technology to distribute these predators to hunt different evaders effectively, which creates many pursuit teams. Then the members of each team work together to hunt their goals.

3.1. Definitions of association rule data mining with pursuit problems

Association rule was originally proposed by Agrawal and Srikant in 1993. They generalized it again (Agrawal, R. & Srikant, R. 1995; 1996). The aim of association rule is to find out the contact with different commodity (set) in the trade data base, that is, to look for the interesting contact in the appointed data collection. Through describing the potential relation rules between data sets in the data base, we can discover the dependence relations between many domains which are satisfied with the preconcerted threshold value of support and confidence. For discussing problems conveniently, we give several definitions of association rule with pursuit problems.

Definition 1

Let D_j, D_j={P₁,P₂,…,P_i,…P_n} be association rule mining data collection denoting that predators P_i(i⩽n) hunt evaders E_j(j⩽m), where P_i is a pursuit robot. As an affair P_i={k₁,k₂,…,k_r,…k_p}(i=1,2,…,n), the element in P_i is called by an Item which is the description of the correlative attribute relation between P_i and E_j.

Definition 2

We assume that I={k₁,k₂,…,k_q} (p⩽q) is composed of all items in D_j. Any subset X of I is called item set. Association rule is the implication formula such as $X \Rightarrow Y$ , where X⊂I, Y⊂I, also X∩Y=Ø. X is the forward item and Y is the back item; Association rule can reflect the rule when the item of X appears, the item of Y must be. In pursuit problems, the interesting association rule can provide the reliable reason for some predator to hunt its goal.

Definition 3

The number of affairs including item X in D_j is modeled as the support number of item X, namely σ_x. The support rate of item X is figured as support(X),

sup p o r t (X) = \frac{σ_{x}}{| D_{j} |} \times 100 %

(3)

where |D_j| denotes the affair number of D_j. If support(X) is not smaller than the appointed smallest support, called minsupport, we can reckon X as a big item, conversely a small item. The support rate of X∪Y is the support rate of association rule $X \Rightarrow Y$ , namely support( $X \Rightarrow Y$ )=support(X∪Y). And we denote the confidence of association rule $X \Rightarrow Y$ by confidence( $X \Rightarrow Y$ ):

c o n f i d e n c e (X) \Rightarrow Y) = \frac{sup p o r t (X \cup Y)}{sup p o r t (X)} \times 100 %

(4)

The smallest confidence which is designated by the user could be figured as minconfidence.

3.2. Process of creating pursuit teams

The process of creating pursuit teams is categorized into the followings: (1) finding out all of big items; (2) coming into being association rules through big items; (3) building the pursuit teams.

(1) Finding out all of big items: According to definitions, the frequency that these big items appear is at least as same as the appointed smallest support number. Firstly, we create the sample data set D_Ej of hunting E_j in Table 1.

Table 1.
Pursuit sample data set D_Ej

TID k ₁ k ₂ …. k _q k _r

P ₁ V ₁₁ V ₁₂ …. V _1q V _1r

P ₂ V ₂₁ V ₂₂ …. V _2q V _2r

⁞ ⁞ ⁞ …. ⁞ ⁞

P _i V _i1 V _i2 …. V _iq V _ir

⁞ ⁞ ⁞ …. ⁞ ⁞

P _n V _n1 V _n2 …. V _nq V _nr

TID	k ₁	k ₂	….	k _q	k _r
P ₁	V ₁₁	V ₁₂	….	V _1q	V _1r
P ₂	V ₂₁	V ₂₂	….	V _2q	V _2r
⁞	⁞	⁞	….	⁞	⁞
P _i	V _i1	V _i2	….	V _iq	V _ir
⁞	⁞	⁞	….	⁞	⁞
P _n	V _n1	V _n2	….	V _nq	V _nr

where k_r(r⩽q) is the item of D_Ej. In this paper, we assume that k₁ is the resultant force F_ij of P_i who is pulled by E_j and pushed by obstacles, F_ij=F_x+F_c. According to the literature (Rimon, E. 1992), the function of gravitation potential is U_x(P_i)=1/2ξd^m(P_i,E_j), where ξ is the coefficient of position plus, m=2 and d(P_i,E_j) is the distance between P_i and E_j. The minus grads direction of gravitation potential influences F_x, that is, F_x=-▽U_x(P_i)=-ξd(P_i,E_j). The function of repulsion potential is Eq.(5)

\begin{aligned} U_{c} (P_{i}) = {\begin{cases} \frac{1}{2} η (\frac{1}{ρ (p_{i}, O b s} - \frac{1}{ρ_{0}})^{2} & ρ (p_{i}, O b s) \leq ρ_{0} \\ 0 & ρ (p_{i}, O b s) > ρ_{0} \end{cases} \end{aligned}

(5)

where ρ(P_i,Obs) is the shortest distance between P_i and obstacles Obs, ρ₀ is the effect threshold of obstacles distance, in a general way, ρ₀⩽min(d₁,d₂), d₁ is a half of the shortest distance between obstacles, d₂ is the shortest distance between E_j and obstacles, and η is the coefficient of position plus. The corresponding repulsion is Eq.(6)

\begin{aligned} F_{c} (P_{i}) = - \nabla U_{c} (P_{i}) = {\begin{cases} η (\frac{1}{ρ (p_{i}, O b s)} - \frac{1}{ρ_{0}})^{2} \frac{\nabla ρ (p_{i}, O b s)}{ρ^{2} (p_{i}, O b s)} \\ (ρ (p_{i}, O b s) \leq ρ_{0}) \\ 0 \\ (ρ (p_{i}, O b s) > ρ_{0}) \end{cases} \end{aligned}

(6)

k₂ is Cre_ij which shows the degree of other predators dependability that P_i hunts E_j, if P_i doesn't complete its task because of some reasons, its Cre_ij would be lowered which straightly affects other predators decisions. So, we complete its updating rule of the following terms:

\begin{aligned} C r e_{i j} = {\begin{cases} \frac{C_{s}}{C_{t}} + ε_{1} & c a p t u r i n g t h e g o a l \\ 1 - \frac{C_{f}}{C_{t}} - ε_{2} & n o t c a p t u r i n g t h e g o a l \end{cases} \end{aligned}

(7)

where C_s is the times of P_i capturing E_j. C_f is the times of failure, C_t is the times of being in for hunting E_j. ɛ₁ and ɛ₂ are updating thresholds. k₃ is Pro_ij which shows the probability of P_i capturing E_j, here we choose the frequency and actual effect of P_i capturing E_j in the latest time to depict its value, it is defined by Eq.(8)

\begin{aligned} P r o_{i j} = {\begin{cases} \frac{T}{t - t_{0}} \frac{C_{s}}{C_{a}} & t - t_{0} \leq T \\ \frac{C_{s}}{C_{a}} & t - t_{0} > T \end{cases} \end{aligned}

(8)

where t is the currently time, t₀ is the latest time which E_j is captured, T is the longest actual effect after capturing E_j and C_a is the times of P_i hunting different evaders. k₄ is Rev_ij that is the retained profits of P_i after capturing E_j, which denotes the margin of the reward (Rew_ij) and the cost (Cos_ij), that is, Rev_ij=Rew_ij−Cos_ij, where Cos_ij is the sum of communication cost, resource consumption, punishment about hunting failure or exiting to pursuit teams and so on. k₅ is Eli_ij which denotes the estimate whether P_i have the ability to capture E_j, here Eli_ij=α·Cab_ij+β, where Cab_ij is the ability value of P_i hunting E_j, α and β is the corresponding coefficient. k₆ is the current load state Pur_s_i, if P_i doesn't have any task, the value of Pur_s_i is spare, then β>0; if P_i is hunting one goal, the value of Pur_s_i is busy, then β<0. P_i stands for predator i. V_ir is the corresponding value of P_i and k_r. We can conclude from the definition of k_r that there are two types in V_ir, continuous and discrete value. In this paper, we use Apriori algorithm to mine the Boolean association rules so that the value of classificatory and quantitative attribute should be transformed to Boolean values. We will introduce the transformation means as following: 1)

Dispersing the value of quantitative attribute. “k₁∼k₅” are quantitative attributes, for the consistency of record amounts in every group, we disperse the five quantitative attribute based on same length or same distance method:

k₁ is categorized into the followings: a1{V_i1∈[0,∞)}, a2 {V_i1∈ (−∞, 0)}

k₂ is categorized into the followings: b1{V_i2∈ [0.9,1)}, b2{V_i2∈ [0.5,0.9)}, b3{V_i2∈ [0.1,0.5)}

k₃ is categorized into the followings: c1{V_i3∈ [0,0.3)}, c2{V_i3∈ [0.3,0.6)}, c3{V_i3∈ [0.6,1)}

k₄ is categorized into the followings: d1{V_i4∈ (−∞,0]}, d2{V_i4∈ (0,200]}, d3{V_i4∈ (200,1000]}

k₅ is categorized into the followings: e1{V_i5∈ [0.5,1]}, e2{V_i5∈ [0,0.5)}

Transforming the value of classificatory attribute. Pur_si is a classificatory attribute, which should be transformed to the Boolean type, then Pur_si is categorized into the followings: f1{V_i6=“spare”}, f2{V_i6=“busy”}.

For p_i, if V_i1>0, V_i2=0.63, V_i3=0.35, V_i4=500, V_i5=0.61, V_i6=“spare”, then Table 1 could be transformed to Table 2 after above disposal.

Table 2.

Affair data transformed from

TID	List of Transaction Item
p _i	a1 b2 c2 d3 e1 f1

We can obtain the affair data base of all predator states through the homologous process.

(2) Coming into being association rules through big items: Here we employ conventional Apriori algorithm, and we note minsupport by the type of evaders in order that the pursuit team can not capture its goal because its scale is too small, that is, if the type of some evader R_Ej is IV, it could be captured by four predators at least. Note that our affair data base has twelve sample records, minsupport=4/12≈33%. To make sure of the confident degree of the gained rules, minconfidence should be larger than the maximum of C_ij. And C_ij is updated constantly, if C_ij is too low, we can gain so many uninteresting rules, which leads to be hard to build the pursuit teams. So the lowest threshold of minconfidence is destined for 0.5, that is, minconfidence=max(0.5, maxC_ij). We make the assumption any frequency item l and its nonempty subset s, if there is the following formula:

$\frac{sup p o r t (l)}{sup p o r t (s)} \geq min c o n f i d e n c e$ , then we can input the rule $s \Rightarrow (l - s)$ ).

(3) Building the pursuit teams: We can judge whether P_i choose E_j as its goal or not from the gained association rules. In this paper, we therefore assume that k₁, k₂, k₅ are the former attribute, k₃, k₄ are the back attribute for which should be in {c2,c3,d2,d3} theoretically, and k₆ is the reference attribute. If this results in interesting rules such as A: a1^b2^e1^c3^d2[35%,72%], we can explain it: there are 35% of predators whose force pulled by E_j is more than pushed by obstacles, who have the abilities to hunt the goal, whose dependability degree is good, if it will hunt E_j, the probability of success more than 0.6 and the value of its reward over 0∼200 is 72%. We should scan the origin sample data item D_Ej in the light of these association rules, then pick up 35% of predators who is satisfied with a1,b2,e1. For the above case, 12times0.35≈4, then the cooperative team can be made of these four eligible predators to hunt the same goal. However, if the amount of appropriate predators is more than R_Ej, it is preconditions of choosing partners that we should keep the stability of system. Some predator has priority whose Pur_si is spare. If their Pur_si is same, we should think about Pro_ij and Rev_ij. Assuming V_ij=ω₁·Pro_ij+ω₂·Rev_ij, where ω₁ and ω₂ are scale coefficients respectively, the predator with bigger V_ij will be chosen firstly; provided that the same predator is fit for different teams, it will be compartmentalized to the team that its confidence of the corresponding association rule is higher. It is one perfect condition of A. If we mine rules that there are items which are not in our assumptive bound, such as B: a2^b2^e1^c1^d2[40%,72%], we also scan the data item D_Ej and find out those predators which are fit for B, then ignore them. Furthermore, we make use of the same support and confidence to mine interesting rules which is satisfied with hypothesis condition from the leaving predators (12-12times0.4≈8). If we find C: a1^b2^e2^c2^d2[37%,75%], we choose the predators which is satisfied with C to join the pursuit team. We can conclude that there are about 3 (8times0.4≈3) predators. If R_Ej=IV, we let U_ij=ω₃·F_ij+ω₄·Cre_ij+ω5·E_ij, where ω₃ and ω₄ and ω₅ are scale coefficients respectively. Well then, we choose the leaving predators in D_Ej whose U_ij is the highest value to join the pursuit team. And circulating the process, we can build the cooperative team to E_j, called CG(E_j).

4. Multi-robot cooperative pursuit algorithm based on reinforcement learning

4.1. Forecasting the mobile positions of pursuit objects

In this part we propose a multi-robot cooperative pursuit algorithm. When the dynamic evaders appear, we use the association rule mining data technology to build cooperative teams to different evaders. The members of CG(E_j) coordinate their actions to hunt E_j. We assume that evaders can move randomly without pursuit threat, contrarily evade intellectively. To capture the intelligent evaders, predators must forecast the positions of goals. It is the aim of P_i∈CG(E_j) that P_i gets to the neighbor grid of the goal, which makes the distance d(x_Pi(t),x_Ej(t)) from itself to the goal shortest. E_j does its best to evade the predators. According to the pursuit definition, the contribution probability of every predator in A(x_Ej(t)) is 1/R_Ej. We assume that the distance, obstruction and surrounding contribution of pursuit members are DC=∑1/d(x_Pi(t),x_Ej(t)), OC=1/N_A(xEj(t)), EC=∑θ_i,i+1(t)/2 respectively in the alert range of E_j, where N_A(xEj(t)) is the number of neighbor grids that E_j can get to and θ_i,i+1(t) is the anticlockwise angle between neighbor predators and E_j at time t. So, we can compute the probability of capturing E_j as following:

\begin{aligned} P r o (P_{i})_{c a p (E_{j})} = {\begin{cases} 0 \\ \forall P_{i}, d (x_{p_{i}} (t), x_{E_{j}} (t)) > L_{E_{j}} \\ (μ_{1} \cdot D C + μ_{2} \cdot O C + μ_{3} \cdot E C) / L_{E_{j}} \\ P_{i}, d (x_{p_{i}} (t), x_{E_{j}} (t)) \leq L_{E_{j}}, k_{1} + K_{2} + k_{3} = 1 \end{cases} \end{aligned}

(9)

P_i anticipates that the intelligent E_j will choose the cell as its position from A(X_Ej(t))∪{X_Ej(t)}, which makes Pro(P_i)_cap(Ej) smallest at time t+1. So, P_i moves to the goal position from A(X_Pi(t))∪{X_Pi(t)} whose d(x_Pi(t+1), x_Ej(t+1)) is the shortest.

4.2. Cooperative teams hunting objects

To avoid the conflict with predators, if different predators who are in the same team choose the same grid as its target position, the predator who is near to the evader has the priority to hold its target position next t. If members who belong to the different teams choose the same cell as its target position, the predator who is in the evader's alert range has the priority. We assume that P=k₁·ΔDC+k₂·ΔOC+k₃·ΔEC is the predator's each action contribution to capture the evader. If any two predators who are in the different teams are all in the alert range of their objects, the predator whose P is higher has the choice priority. On the contrary, the predator that is near to its object has it. After confirming the aim position, we can transform the pursuit problem into the multi-robot path planning problem based the known environment. One of the key problems for pursuit teams is the problem of coordination: how to ensure that the individual decisions of the predators result in jointly optimal decisions for the team.

Reinforcement learning (Sutton, R. S. & Barto, A. G. 1998; Kaelbing, L. P.; Littman, M. L. & Moore, A. W. 1996) is the problem faced by an agent that must learn behavior through trial-and-error interactions with dynamic environments, and it has been applied successfully in many single agent systems for learning the policy of an agent. Learning from the environment is robust because agents are directly affected by the dynamics of the environment. In principle, it is possible to treat a multiagent system as a ‘big’ single agent and learn the optimal joint policy using standard single-agent reinforcement learning techniques and therefore in this paper we are interested in fully cooperative pursuit systems in which the predators have to learn to optimize a global performance measure. Q-learning (Watkins, C. J. & Dayan, P. 1992) is one of important reinforcement learning algorithms, which is one kind of model-free reinforcement learning algorithm that is based on dynamic programming. In essence Q-learning is a temporal-difference learning method. In this paper we are interested in using Q-learning to learn the coordinated actions of many groups of cooperative predators to find out the best path of capturing objects without collision between agents. So, we should think the pursuit team as a big agent. We can use Q-learning to directly compute the approximation of an optimal action-value function, called Q-value, by using the following update rule:

Q_{t + 1} (s_{t}, a_{t}) = (1 - α_{t}) Q_{t} (s_{t}, a_{t}) + α_{t} \times [r_{t} + γ max_{b} Q_{t} (s_{t + 1}, b)]

(10)

where Q_t is an estimate of the predators' state-action pairs at time t, s_t, a_t, α_t, γ, r_t are state, action, the learning rate, discount factor and reward respectively; $max_{b} Q_{t} (s_{t + 1}, b)$ is gained by comparing the value of all kinds of actions. Here, let the relative positions between predators and evaders, the sensor distance information of predators as the input signal of the pursuit reinforcement learning system. Predators can choose the appropriate actions set {a_t¹…a_tⁱ…a_tⁿ} after learning, where a_tⁱ denotes the action of P_i at time t, in this paper we set α_t=0.3, γ=0.9. Here the instantaneous reward is defined as the following rules. When all pursuit members are out of their objects' alert ranges, we make use of distance to give rewards, let ΔD={∑[d(x_Pi(t),x_Ej(t))−d(x_Pi(t−1),x_Ej(t−1))]}/N_CG(Ej), where N_CG(Ej) denotes the member number of CG(E_j). If ΔD⩽0, then to the positive rewarded and vice versa; however, we should set down the reward rules in the light of the degree of obstruction and encircling besides distance if there are predators in the alert range of E_j. It is the perfect condition that pursuit members distribute in the neighbor grids of evaders and θ_i,i+1(t)=/2. So, let ΔN=N_A(xEj(t))−N_{A(xEj(t−1))}, if ΔN⩽0, then to the positive rewarded and vice versa; when 0<θ_i,i+1(t)⩽/2, the reward is positive if θ_i,i+1(t)⩾θ_i,i+1(t−1) and when /2<θ_i,i+1(t)⩽, the reward is positive if θ_i,i+1(t)⩽θ_i,i+1(t−1). The collision actions could be published. In this paper the reinforcement signal r_t is: r_t=δ_1rt¹+δ_2rt²+δ_3rt³, and when d(x_Pi(t),x_Ej(t))>L_Ej, δ₁=1, δ₂=0, δ₃=0, otherwise δ₁, δ₂, δ₃∈(0,1), δ₁+δ₂+δ₃=1, where r_t¹, r_t², r_t³ are given by the following equations:

\begin{aligned} r_{t}^{1} = {\begin{cases} 1 Δ D \leq 0 \\ - 1 Δ D > 0 \\ - 5 x_{P i (t)} = x_{P i}'_{(t)} \cup {m (x_{P i (t)}) = 1} \end{cases}, \\ r_{t}^{2} = {\begin{cases} 0.5 & Δ N_{A x E j (t))} \leq 0 \\ - 0.5 & Δ N_{A (x E j (t))} > 0 \end{cases}, \\ r_{t}^{3} = \sum_{i \in C G (E_{j})} \frac{λ_{1} (e^{[θ_{i, i + 1} (t) - θ_{i, i + 1} (t - 1)]} - 1) + 1 λ_{2} (1 - e^{[θ_{i, i + 1} (t) - θ_{i, i + 1} (t - 1)]})}{1 + e^{θ_{i, i + 1} (t) - θ_{i, i + 1} (t - 1)]}} \end{aligned}

(11)

where Λ₁=1, Λ₂=0 if 0<θ_i,i+1(t)⩽/2 and Λ₁=0, Λ₂=1 if /2<θ_i,i+1(t)⩽.

5. Experimental setting and simulation results

To evaluate the approach proposed in this paper, we perform simulation experiments on an instance of the pursuit problem in a 360times250 rectangle area which is divided to 900 toroidal grids of 10times10. The obstacles are constant solid areas. The predators and evaders are all 10times10 cycloidal bodies. The alert radiuses of all of agents are same to be 3. We mark every agent with the ID number. If the evader is captured, then the other evader with the same style is relocated at new random positions. They have perfect communication system and same speed (move one cell distance each time). After capturing its object successfully, the cooperative team is rewarded by the third power of the evader's style, which is distributed to every member averagely, then the cooperative team is dismissed.

We use the same initial condition to compare our algorithm (after convergence) which we call it A with the algorithm proposed by Zhou, P. C.; Hong B. R. and Wang Y. H. (2005) which we call it B under above environment (ten predators hunt four evaders). A sample configuration is shown in Fig. 2(a). We take the capturing time of pursuit teams in each experiment as the evaluation index. From these results of A and B, we can find that the capturing time is little different for one algorithm of every experiment, however, the average time is much different in them. And Fig. 2(b) shows the reward comparison between them. We take one sample every ten experiments for the clear results. We can find that, there is a holistic performance increase of 18 percent of our algorithm. These can be explained as following: 1) we choose the pursuit members after contrasting the current state with the historical factors in A, which is relatively general, but case-based reasoning is used, which could lead to its adaptability worse in term of historical cases only under dynamic multi-robot environment. 2) It is assumed that evaders move randomly and predators have not learning ability in B, while in A we think about two conditions, moving randomly and evading intellectively, and reinforcement learning is used to choose the best actions which is more satisfied with the practice. So, A is better than B.

Fig. 2.

Graphs of different pursuit algorithms in the same environment

To test the effect of different decision mechanisms of building pursuit teams to capturing results, we compare the stochastic selection strategy(1) with the selection strategy based on case reasoning and assistant decision matrix(2) and the selection strategy based association rule(3) proposed in this paper. Under the same experiment environment, we increase the number of evaders incessantly from four to ten. The results for three algorithms are displayed at Fig.3. It is clear that the result based on association rule is the best which takes the least time, and the result of the stochastic selection strategy is the worst. This is because there are many factors considered such as ability, credibility, force and cost of predators, and so on, then to build reasonable cooperative teams to the special goal, which can improve the efficiency while avoiding to waste resources.

Fig. 3.

Results of different choice strategies

At last, we choose the same pursuit team to test two pursuit algorithms, one is based on reinforcement learning, and the other is the greedy actions strategy. We take the average pursuit benefit as the evaluation index, that is, how much the predator gain when it moves one grid every time. Table 3 shows the results. We see that the pursuit algorithm based reinforcement learning is better than the other. It is because that the predators are selfish agents, the greedy mechanism makes them run short of cooperation, then which cannot make the team profit most, however, our algorithm use the learning mechanism to make the pursuit members with coordination and cooperation each other, which is sure that the team reward is optimal when every predator chooses its best path.

Table 3.

Pursuit benefits of different learning mechanisms

Pursuit strategy	based on reinforcement learning	greedy actions
Average reward of pursuit teams(R)	1563.7	1396.2
Average number of predators covering grids (N)	199	216
Pursuit benefit (R/N)	7.86	6.42

6. Conclusions

In this paper, a multi-robot cooperative pursuit algorithm is proposed. Firstly, we use association rule data mining technology to choose the goals for every predator. Then the cooperative team for the appointed goal is composed of the predators whose goal is same. Secondly, the members of the cooperative team coordinate to hunt their goal. Through forecasting the position of evader, we transform the pursuit problem to the path planning problem and use multi-robot reinforcement learning to choose the best action strategies. Experimental results showed that the proposed algorithm could be used to achieve the optimal solution. However, confidence will affect the result of building teams. If it is higher, we could not find the suitable interesting rules to build pursuit teams. On the contrary, we could find the improper teams. Future work should examine more appropriate confidence. Furthermore, we use reinforcement learning to solve out the cooperative problems in the course of hunting evaders. However, both the state and action space scale exponentially with the number of predators and evaders, rendering this approach infeasible for overfull robots, which can affect the hunting efficiency. Consequently we should find out a new method to reduce the number of state and action pairs and then improve the learning speed. In addition, we will expand our algorithm ulteriorly to apply it to the unknown pursuit environment. To resolve the conundrums is a challenge.

References

Yamaguchi

(1999). ‘A cooperative hunting behavior by mobile-robot troops’. International Journal of Robotics Research, Vol. 8, No. 8, PP. 931–940.

Swastik

& Chinya Ravishankar

. (2005). ‘A framework for pursuit evasion games in Rn’. Information Processing Letters, Vol.96, No.3, PP.114–122.

Kok

J. R.

& Vlassis

(2004). ‘Sparse Cooperative Q-learning’. Paper Presented at the 21st International Conference on Machine Learning. 2004. Banff, Canada.

Yuko

Sato

& Kakazu

(2003). ‘An approach to the pursuit problem on a heterogeneous multiagent system using reinforcement learning’. Robotics and Autonomous Systems, Vol. 43, PP. 245–256.

Vidal

Shakernia

Kim

H. J.

Shim

D. H.

& Sastry

(2002). ‘Probabilistic pursuit-evasion games: theory, implementation and experimental evaluation’. IEEE Transactions on Robotics and Automation, Vol.18, No. 5, PP. 662–669.

Grinton

(1996). A Tested for Investigating Agent Effectiveness in a Multiagent Pursuit Game. Ph. D. Thesis. The University of Melbourne, Australia.

Schenato

Sastry

& Bose

(2005). ‘Swarm Coordination for Pursuit Evasion Games using Sensor Networks’. Paper Presented at the 2005 IEEE International Conference on Robotics and Automation. 2005. Barcelona, Spain.

S. Q.

Wang

& Yang

J. Y.

(2006). ‘Multiple Mobile Targets Capturing Algorithm Based on Dynamic Role’. Journal of System Simulation, Vol. 18, No. 2, PP. 362–365.

Cai

Z. S.

Sun

L. N.

Gao

H. B.

Zhou

P. C.

Piao

S. H.

& Huang

Q. C.

(2008). ‘Multi-robot Cooperative Pursuit Based on Task Bundle Auctions’. Paper Presented at the First International Conference on Intelligent Robotics and Applications. 2008. Wuhan, China.

10.

Agrawal

& Srikant

(1995). ‘Mining Sequential Patterns’. Paper Presented at the 7th International Conference on Data Engineering. 1995. Taipei, Taiwan.

11.

Srikant

& Agrawal

(1996). ‘Mining Sequential Patterns: Generalizations and Performance Improvements’. Paper Presented at the 5th International Conference on Extending Database Technology (EDBT). 1996. Avignon, France.

12.

Rimon

(1992). ‘Exact robot navigation using artificial potential function’. IEEE Transactions on Robotics and Automation. Vol. 8, No. 2, PP. 501–518.

13.

Kaelbing

L. P.

Littman

M. L.

& Moore

A. W.

(1996). ‘Reinforcement learning: A survey’. Journal of Artificial Research, Vol. 4, PP. 237–285.

14.

Sutton

R. S.

& Barto

A. G.

(1998). ‘Reinforcement learning: An introduction.’. Cambridge MA: MIT Press, 1998.

15.

Watkins

C. J.

& Dayan

(1992). ‘Technical Note: Q-learning’. Machine learning, Vol. 8, PP. 279–292.

16.