Sage Journals: Discover world-class research

Abstract

Currently, purely deep learning-based agents struggle to make optimal decisions within a short timeframe in problems with a vast decision-making space. Human planning knowledge is required to assist agents in making better decisions. This manuscript proposes a novel knowledge-guided and data-driven decision-making framework, utilizing hierarchical task network as the carrier of knowledge, deep learning as the trainer for data, and the Monte Carlo Tree Search as the connector between hierarchical task network and deep learning. The experiments on the MiniRTS environment validated that the proposed framework in this manuscript can replace humans in collecting high-quality data, and it can train neural networks that perform equally well as the compared network even with only 20% of the available data, which provide a new direction for future research.

Keywords

Decision making Monte Carlo tree search hierarchical task network deep learning games

Introduction

With the advancement of computer hardware technology, the powerful computing power has been brought, and collecting large amounts of data has been easier compared to the past. Thus, Deep Learning (DL) has accomplished achievements beyond humanity in multiple fields, such as image processing,^1,2 speech recognition,^3,4 natural language processing,^5,6 and Go.^7,8 In recent years, due to the increasing demand for mental entertainment, a growing number of high-level computer games have been developed. Among them, Real Time Strategy (RTS) games have gained a significant share. As a result, DL has started to be applied in the development of interactive intelligent agents,^9–11 providing opportunities for human players to compete and train, and to some extent, addressing challenging issues in intelligent task planning.

Due to the characteristics of a large decision-making space and short response time in RTS games, traditional DL methods have struggled to achieve high accuracy, resulting in weak performance of trained intelligent agents. With the ease of data collection and breakthroughs in key technologies such as attention mechanisms, deep neural networks (DNNs) trained with large amounts of data have achieved superhuman performance in some RTS games, such as defeating professional players in StarCraft⁹ and surpassing professional players in the game Honour of Kings.¹⁰

However, the aforementioned data-driven intelligent planning methods still cannot completely solve the problem of making decisions in a super large decision space within a short response time. The main reason is that such methods cannot be widely applied. In terms of real-time performance, high-performance networks often come with larger depths. Therefore, for complex game environments and decision tasks, more computing resources and time are needed to make accurate decisions under real-time requirements. In terms of the large decision space, a huge amount of data is required to train a super intelligent agent, and the hardware resources required for training are also high, which general research institutions cannot meet. Moreover, when the conditions of training and decision-making environments are insufficient, the speed of training and using the intelligent agent will be slow, and the performance of the intelligent agent will also decrease. Therefore, it is urgent to find a new decision-making framework to enhance the planning capabilities of DL methods in problems with large decision-making spaces and short response time.

Traditional knowledge-based decision-making methods have been widely applied in various fields, such as emergency planning in government departments,¹² security-durability evaluating of web applications,^13–16 and portfolio optimization problems.^17,18 These decisions are mainly solved using existing traditional decision-making methods to address specific domain challenges. For example, Kumar R's team used the analytic hierarchy process (AHP) to solve a series of security issues in web application development. However, in the field of task planning, traditional knowledge-based decision-making methods primarily involve converting expert knowledge into domain-specific knowledge usable by planning algorithms to generate concrete action sequences, known as courses of action (COA). Although the performance of these planning methods relies heavily on the richness of domain knowledge and falls short of superhuman performance, classical task planning methods have the advantages of fast planning speed and no need for pre-training. Therefore, by combining knowledge-driven planning methods with data-driven approaches, a novel planning method can be developed guided by knowledge and data, resulting in more efficient and accurate interactive agents.

Currently, researchers have successfully combined knowledge-driven planning with data-driven planning by incorporating human prior knowledge into DL models, achieving certain achievements. For example, researchers like Hu and Yarats,¹⁹ Xu and Wang²⁰ use human natural language instructions to guide DNNs to fit towards a faster and more efficient direction. However, the drawback is that human instructions need to be manually provided, which incurs significant data collection costs. Similarly, Chen and Gupta²¹ utilize demonstration data to guide deep reinforcement learning, enabling faster convergence of trained intelligent agents. Nonetheless, the disadvantage lies in the substantial human and resource costs associated with collecting demonstration data.

To address the challenge of making decisions in a large decision-making space within short response times and better incorporate human prior knowledge into DL, this manuscript presents a novel decision-making framework that integrates knowledge guidance and data-driven approaches, leveraging the Monte Carlo tree search (MCTS) as a connector between hierarchical task network (HTN) planning and DL. Experiments conducted in the MiniRTS environment show that it is possible to train an average-level interactive intelligent agent by solely using MCTS and HTN guidance for self-play with minimal human-generated data. This illustrates that utilizing MCTS as a connector between HTN and DL can effectively integrate knowledge into data, providing a new perspective for future research.

The remaining structure of this manuscript is as follows: Section 2 provides a brief introduction to related work and presents an overall framework diagram of the proposed method. Section 3 describes the decision-making framework that combines HTN and DL using MCTS as a connector in detail. Section 4 validates the reliability of the proposed method through experiments conducted in the MiniRTS environment. Section 5 summarizes the work presented in this manuscript and provides future research prospects.

Related work

Using demonstration data to guide DNNs training

After AlphaGo was proposed, DNNs has been widely used and has made some progress in developing interactive intelligent agents. Similar to AlphaGo, DNNs’ role in developing interactive intelligent agents is mainly reflected in two aspects: training value networks to evaluate the environment of intelligent agents, and training policy networks to directly guide the next action of the intelligent agents. Although the above two types of networks can achieve performance beyond human in AlphaGo with the assistance of MCTS, their application cannot be popularized and generalized in applications such as RTS games due to the characteristics of large decision-making space and short response time. For example, the training time and cost required for AlphaStar are enormous,⁹ and Tencent needs to train a specific agent for each hero in the Honour of Kings.¹⁰ Therefore, researchers have begun to use human demonstrated data as a basis for guiding DNNs training.

The work combining intelligent agent behavior with natural language instructions has been proposed as early as in the last century.²² With the development of DL, this technology has gradually demonstrated its advantages. It started with the success of template-based compositional language in solving simple navigation problems for intelligent agents.²³ Then, it achieved success in utilizing human demonstration data to train intelligent agents in maze exploration,^21,24 visual navigation,²⁵ and robot control.²⁶ Subsequently, researchers trained high-level intelligent agents in high-dimensional environments like MiniRTS using human demonstration data¹⁹ a few years ago. Finally, ChatGPT⁶ appears. All of these demonstrate that human knowledge based on demonstration data can enhance the capability of DL in training interactive intelligent agents.²⁰

HTN planning

HTN is a type of intelligent planning technology that seeks feasible solutions for tasks through task decomposition and conflict resolution. The basic idea is to extract specialized knowledge from the planning domain, which is used to recursively decompose complex abstract tasks into increasingly smaller subtasks until the decomposed subtasks can be directly completed through specific planning actions. Due to the similarity between the HTN planning process and human problem-solving thinking process, it has a wide range of applications in the planning domain, including emergency plan formulation in government departments,²⁷ intelligent robot task planning in industry,²⁸ web service composition,²⁹ and first-person shooter games and RTS games in the gaming field.^30,31

In recent years, there have been advancements in utilizing the MCTS method to guide HTN planning. Wichlacz and Holler³² proposed applying MCTS to HTN planning, and experiments demonstrated that using MCTS as the search algorithm for HTN leads to better or faster planning results compared to traditional search algorithms or heuristic search algorithms. Around the same time, Shao and Zhang³³ also proposed an MCTS-based HTN planning approach, using MCTS to select the best decomposition method for compound tasks. They addressed the issue of HTN planning relying on the order of decomposition methods and extended the approach to planning problems with uncertain action outcomes. Goldman³⁴ also employed the idea of using MCTS to select decomposition methods and applied it to the online HTN planning algorithm for partially observable Markov decision processes. The application of MCTS in HTN essentially leverages its look-ahead technique to provide trial-and-error possibilities for search-based HTN planning, thereby reducing the search branches of HTN.³⁵

Motivation

Although using demonstration data to guide DNNs training has achieved some success, its drawback lies in the fact that collecting human demonstration data requires a significant amount of manpower and resources. Additionally, due to the subjectivity of humans, the collected demonstration data may hinder the convergence of the network during the training process. If it is possible to use other intelligent means instead of humans to collect demonstration data with human knowledge, the cost of data collection can be greatly reduced, and the negative impact caused by human subjectivity on the dataset can be avoided.

On the other hand, although MCTS was used in AlphaGo to assist the value network and policy network in providing more accurate results, it did not have specific human knowledge to guide the network training. Therefore, this method is not suitable for high-dimensional spaces. Even though it may perform well in high-dimensional environments, it lacks universality and generalization. If knowledge-driven planning methods can be integrated into the MCTS-based DNNs, it may further enhance the performance of DNNs in high-dimensional applications such as RTS games.

To solve the aforementioned problems, this manuscript proposes a novel decision-making framework that integrates knowledge-guided and data-driven techniques, which leverages the MCTS as a connector between HTN and DL. By integrating MCTS with HTN, the planning performance of HTN can be improved. Additionally, combining MCTS with DL enhances the prediction accuracy of DNNs. Therefore, by using MCTS as a connector between HTN and DL, the strengths of both approaches can be utilized, allowing knowledge to more effectively guide the training of DNNs. Furthermore, challenging problems in intelligent decision-making can be addressed.

The diagram in Figure 1 illustrates the model trained using human natural language demonstration data to guide DNNs. The high-level instruction network is used to generate a natural language instruction, which is then inputted along with the state encoding to the low-level execution network. This network is responsible for generating specific actions and parameters that can be directly executed by the agent.

Figure 1.

Schematic diagram of natural language embedding in deep learning.

The proposed method in this manuscript is built upon the model shown in Figure 1, and the proposed model diagram is illustrated in Figure 2.

Figure 2.

Schematic diagram of a model combining knowledge and data.

The innovation of the proposed method lies in two aspects: Firstly, in terms of training, unlike the use of human natural language demonstration data as shown in Figure 1, the model depicted in Figure 2 incorporates domain knowledge of HTN, and knowledge-containing data can be collected through self-play, which replaces the need for human demonstration data and reduces data collection costs. Secondly, in terms of usage, unlike directly using the network output as the execution action in Figure 1, the model depicted in Figure 2 leverages HTN and MCTS to handle multiple “tasks” generated by the high-level network. It provides corresponding COAs, then MCTS and the low-level network are utilized to simulate and evaluate these COAs, and finally the optimal sequence of execution actions is analyzed. The existence of action sequences gives the proposed method an advantage in terms of overall planning.

Method

This section will provide a detailed introduction to the technical details of the proposed knowledge-guided and data-driven decision-making framework in this manuscript. Firstly, it will introduce the knowledge-based planning method, HTN planning. Then, it will explain how to use MCTS-based HTN planning for high-level task decomposition. Finally, it will discuss how to evaluate low-level COA using MCTS-based neural networks.

Technical details of HTN planning

As a method of intelligent planning, HTN takes the given initial state, initial task network, and domain knowledge as input, and provides specific COA as output. The formal description of HTN typically includes three elements: state space, task network, and domain knowledge.

Definition 1 (State Space):

The state space refers to the mapping space in the computer where the real-world space of a planning problem is represented. In HTN, it is typically represented using predicate logic. A predicate expression p consists of terms $τ_{1}, τ_{2}, \dots, τ_{n}$ , where terms can be variables or constants. Let S define state space and P be the set of predicate expressions. S can be represented as the conjunction (logical AND) of predicate expressions in $P$ :

S = p_{1} \land p_{2} \land \dots \land p_{2^{| P |}}

(1)

Definition 2 (Task Network):

In HTN, the task network $T_{n}$ can be defined as a binary tuple:

T_{n} = ⟨ T, ψ ⟩

(2)

where T represents a set of tasks, including two types of tasks in HTN: atomic tasks (

A

) and compound tasks (

C

). Atomic tasks are tasks that can be directly executed by an agent and can be understood as low-level actions. Compound tasks are tasks that cannot be directly executed by an agent and must be further decomposed into subtasks, which can be understood as middle-level or high-level tasks.

ψ

represents the constraint relationship between tasks, which are the constraints on T that must be satisfied during planning and execution.

Definition 3 (Domain Knowledge):

Domain knowledge in HTN is the core of its encoding and planning, and it can be defined as a binary tuple D:

D = ⟨ O, M ⟩

(3)

where O represents a finite set of HTN operators, and M represents a finite set of HTN decomposition methods.

HTN operations are actions used to execute atomic tasks and have a certain impact on the state space. Define operation o as a quadruple:

o = ⟨ a, p r e (o), e f f (o), c o s t ⟩

(4)

where

a \in A

represents the atomic task corresponding to the operator o, i.e. executing o can accomplish the atomic task a.

p r e (o)

and

e f f (o)

respectively represent the preconditions for executing o and the effects it has on the state space after its execution. The

e f f (o)

has two subsets,

e f f {(a)}^{+}

and

e f f {(a)}^{-}

, which represent the predicate expressions added or deleted in the state space after the execution of the operation. The

c o s t

represents the cost of performing the operation to complete the atomic task.

HTN decomposition methods refer to the rules for decomposing a composite task into a network of subtasks. Define the decomposition method m as a triplet:

m = ⟨ c, p r e (m), T_{n} (m) ⟩

(5)

where c represents the compound task corresponding to the decomposition method m, i.e. using m can decompose the compound task c.

p r e (m)

represents the preconditions for using m, and

T_{n} (m)

represents the sub-task network obtained after decomposing c by using m.

With these three elements mentioned above, HTN can define a specific planning task as $P_{b} = ⟨ S_{0}, T_{n 0}, D ⟩$ , where $S_{0}$ and $T_{n 0}$ represent the initial state space and initial task network, respectively. For a specific planning task $P_{b}$ , the HTN method needs to decompose and combine tasks in order to find an atomic task combination, called a plan $π = ⟨ a_{1}, a_{2}, \dots, a_{n} ⟩$ , or COA, that can complete all tasks in the initial task network $T_{n 0}$ .

High-level task decomposition in HTN based on MCTS

The core of HTN planning lies in hierarchical decomposition. Therefore, if we need to use the HTN method to guide DNNs to obtain better actions, we also need to train the corresponding high-level neural network, whose output corresponds to the compound tasks in HTN.

MCTS is a simulation method based on probability and mathematical statistics, which has been widely applied in Atari games, RTS games, Go, and other practical applications. The MCTS method consists of four steps: selection, expansion, simulation, and backpropagation. Figure 3 illustrates one iteration process of MCTS.

Figure 3.

Schematic diagram of iteration process of MCTS.³⁶

Selection refers to MCTS traversing the search tree starting from the root node and continuously choosing the next lower-level node based on the tree policy until a leaf node (or terminal node) is selected.

Expansion refers to MCTS probabilistically simulating the execution of an action at the leaf node, reaching a new state, and adding the newly represented state as a child node.

Simulation refers to MCTS simulating and executing the complete problem based on the default policy (rollout policy) at the expansion node, until a terminal state (or a specified depth) is reached, and obtaining specific rewards.

Backpropagation refers to MCTS propagating the obtained rewards from the simulation phase, starting from the expansion node to the root node, and updating the statistical information along the entire path.

Figure 4 represents a schematic diagram of high-level task decomposition in HTN based on MCTS.

Figure 4.

Schematic diagram of high-level task decomposition in HTN based on MCTS.

From Figure 4, it can be seen that after the state encoding is inputted into the high-level neural network, it will probabilistically output compound tasks. By selecting multiple compound tasks with higher probabilities as inputs to the HTN planner, specific COAs can be obtained to complete these compound tasks. In Figure 4, the role of the HTN planner is similar to a value network, which evaluates the impact (or reward) on the current state when completing a certain compound task given the state s. The difference is that a value network is a data-driven model, which requires a large amount of data for training, while HTN planning is knowledge-driven, which utilizes pre-encoded domain knowledge.

Since a compound task in HTN can have multiple planning solutions, in Figure 4, HTN planning can also utilize MCTS to obtain the optimal COA. In this manuscript, the MCTS-HTN algorithm proposed in reference³³ is used to simulate the decomposition planning of compound tasks. By applying this algorithm, the optimal planning solution for compound task c under the current state s can be obtained, which is the optimal COA. This process is illustrated in Figure 5.

Figure 5.

Schematic diagram of MCTS-HTN.

The time complexity formula for the MCTS-HTN algorithm is:

O_{h} = O (m \cdot n \cdot n_{2}^{\frac{l g N}{l g (n_{1} + n_{2}) - l g 2}})

(6)

where n represents the number of tasks to be decomposed, m represents the number of simulations,

n_{1}

and

n_{2}

represent the average number of methods per task and the number of subtasks after decomposition, respectively, and N is the total number of nodes in the MCTS search tree.

Evaluation of low-level network based on MCTS

High-level task decomposition based on HTN can decompose several compound tasks outputted by the high-level neural network into the optimal COA for each compound task. However, during actual execution, only one compound task and its corresponding COA can be selected (or even just the first action in the COA). Therefore, it is necessary to evaluate each compound task and its corresponding COA, which can be accomplished using MCTS and the low-level neural network.

For RTS games, low-level actions typically involve parameters. For example, in StarCraft, each unit needs to select an enemy unit as a parameter when choosing an attack action, or select a specific coordinate as the destination for gathering resources. If these parameters are included as part of the COA in HTN, it would result in a vast amount of domain knowledge, thus having difficulty in coding. Therefore, the COA generated by the HTN planner only contains specific low-level actions, while the specific parameters can be provided by the low-level neural network.

Figure 6 represents one iteration of the MCTS process in the evaluation of the low-level network based on MCTS. For each compound task and its corresponding $b e s t_C O A_{n} = ⟨ a_{1}, a_{2}, \dots, a_{k} ⟩$ , multiple simulations are performed using MCTS. Each simulation takes the specific actions from the $b e s t_C O A_{n}$ as input to the low-level neural network, which provides the specific parameter values for the input actions. The action along with its parameters combines to form the final action, which is then executed in the simulation environment. The resulting environment is used to receive the next final action composed of an action and parameters. After all the actions in the $b e s t_C O A_{n}$ have obtained parameters through the low-level neural network and executed in the simulation environment, a rollout policy is used to evaluate the quality of the final simulated state.

Figure 6.

Schematic diagram of one evaluation of low-level network based on MCTS ( $a_{1} \sim a_{k}$ are from $b e s t_C O A$ in Figure 4).

Figure 6 represents one iteration of MCTS, which is a branch from the root node to a leaf node in the MCTS search tree. It is well known that for classification problems, the results provided by DNNs are typically probabilistic. The advantage of MCTS is that it can select several parameters with higher probabilities as branches for expansion, as shown in Figure 7.

Figure 7.

Schematic diagram of search tree construction.

The search tree shown in Figure 7 is constructed by branching with the top three parameters in terms of probability output by the low-level network. Each node represents a state, and the edges represent parameters. The search tree becomes richer as the number of MCTS simulations increases, until the number of simulations approaches infinity, at which point the Q values of the tree nodes tend to converge. When computational resources are exhausted or a specified number of simulations is reached, the MCTS simulations terminate. Then, the path from the root node to a leaf node with the highest Q value is selected as the COA with parameters, which corresponds to the optimal actions for the compound task.

After conducting MCTS simulations for all compound tasks, the best COAs with parameters (each COA having its corresponding Q value obtained from the search tree) are compared. The overall best COA with parameters among all compound tasks is selected as the final action to be executed in the real environment.

The time complexity formula for evaluating action sequences based on MCTS is:

O_{l} = O (m \cdot n \cdot k \cdot O (N N_{l}))

(7)

where n represents the number of action sequences to be simulated, m represents the number of simulations, k represents the length of the action sequence, and

O (N N_{l})

represents the time complexity required to invoke the lower-level network.

This section elaborates on the specific details of the framework shown in Figure 2. First, HTN planning and MCTS are used to evaluate each task output from the high-level neural network instead of a value network and generate the optimal COA. Then, MCTS and the low-level neural network evaluate the COA of each task to obtain the final optimal COA for execution in the real environment. As shown in Figure 8, unlike previous methods, this framework can output a COA rather than just a single action when receiving a state as input. This is one of the advantages of the HTN approach. The COA enables the agent to have some foresight. Moreover, when there is human intervention, it can provide better assistance and stronger interpretability in decision-making compared to single actions.

Figure 8.

Contrastive diagram of output from different frameworks.

Evaluation of experimental results

To demonstrate the effectiveness of the framework proposed in Section 3 in enhancing the decision-making capability of DNNs, we conducted experiments in the MiniRTS environment and compared it with a planning framework based on human natural language instructions.

Experimental environment

MiniRTS is a grid-based strategic adversarial environment, as shown in Figure 9. It captures the important key features of complex RTS games and to some extent represents the problem of decision-making within a large decision-making space with short response times.

Figure 9.

Schematic diagram of MiniRTS environment.

In MiniRTS, there are two agents: Blue and Red. Both agents can be controlled by humans, AI strategies, or built-in bots. The opposing agents gather resources, construct buildings, train various types of units, and engage in combat. The ultimate goal is to destroy the enemy's base and win the game.

There are seven types of units in MiniRTS that can attack enemy units, including six types of offensive units and defensive tower structures. The attack rules resemble a simplified version of the game “Jungle” (or “Dou Shou Qi”), as shown in Figure 10. For example, swordsmen can defeat spearmen, spearmen can defeat cavalry, and cavalry can defeat swordsmen. Apart from the defensive tower structures, there are also five types of building structures used to train different offensive units in MiniRTS. The training rules for these units are illustrated in Table 1. Among them, the workshop is the only building capable of producing three types of offensive units. Additionally, MiniRTS includes a unit called the peasant, which is used to build buildings. All building structures can only be constructed by peasants, and peasants can only be produced by town hall (base).

Figure 10.

Schematic diagram of attacking rules.²⁰

Table 1.

Table of training rules.

Building names	Trainable unit names
Town hall	Peasant
Blacksmith	Swordman
Barrack	Spearman
Stable	Cavelry
Workshop	Archer dargon catapult

The experiments in this manuscript are built upon the work of Hu and Yarats, as shown in Figure 1. They collected training data from nearly 5400 games and trained two DNNs: an instruction network and an execution network. The specific network structures can be found in reference.¹⁹ We selected the RNN (recurrent neural network) structure and trained a network with a relatively small amount of training data generated based on HTN, which achieved comparable performance to the contrast network. Our experiments were running on a Tesla V100 server.

The top 5 probability outputs from the higher-level network are considered. When writing HTN domain knowledge, each composite task is given 3 decomposition methods, and the average number of subtasks after decomposition is 3. Therefore, $O_{h} = O (5 m N)$ , $O_{l} = O (5 m k O (N N_{l}))$ , and this time complexity meets the real-time requirements in actual experiments

Figure 11 shows an example of HTN domain knowledge writing. When the composite task “destroy the enemy's cavalry” needs to be completed, subtask decomposition can be performed through methods, ultimately leading to different action sequences.

Figure 11.

HTN domain knowledge example.

This manuscript demonstrates the effectiveness of the proposed framework through four experiments. Experiment 1 uses MCTS to select the optimal network output, demonstrating that integrating the MCTS method into the network can improve its performance to some extent. Based on the results of Experiment 1, Experiment 2 verifies that a neural network integrated with MCTS can self-adversarially generate high-quality training data. Building on the first two experiments, Experiment 3 embeds HTN planning into the network following the procedure shown in Figure 2. This experiment successfully demonstrates that the proposed framework can effectively integrate knowledge and data, thereby enhancing the agent's performance. Experiment 4 is conducted to verify whether the proposed framework can operate effectively in complex environments with incomplete information.

To validate the effectiveness of the framework proposed in this manuscript, subsequent experiments are conducted in four aspects: MCTS-assisted neural network decision-making, collection high-quality replay data based on MCTS, effectiveness of the collaborative-driven planning method, and the effectiveness of the framework under incomplete information.

MCTS-assisted neural network decision making

Reference¹⁹ collected data from nearly 5400 matches by letting humans play against the rule-based built-in AI in MiniRTS. Each competition required two individuals, where the high-level decision makers provided a natural language instruction to the lower-level executors, instructing them to perform specific actions, and the lower-level executors controlled specific units to carry out the actions based on understanding the instructions. Each match yielded multiple training data pairs in the format of $⟨ s t a t e - i n s t r u c t i o n - a c t i o n - p a r a m e t e r s ⟩$ . Based on the collected data, they proposed a hierarchical network framework consisting of an instruction layer and an execution layer. They trained three types of DNNs: ONEHOT, BOW, and RNN, depending on different encoding methods for the instructions. Among them, the RNN-based model achieved the best adversarial performance when the instruction library contained 500 instructions.

The experiments conducted by reference¹⁹ demonstrated that a hierarchical neural network guided by human natural language instructions outperformed a non-hierarchical neural network mapping states directly to actions. However, the drawback of their approach is the difficulty in collecting human natural language instructions, as well as the inconsistency in the collected data quality, which hindered the network training process.

To demonstrate the effectiveness of the knowledge-guided and data-driven decision-making framework proposed in this manuscript, a series of comparative experiments were conducted based on the aforementioned experimental framework. The main comparison object was the layered model encoded in RNN mode, using a 500-instructions library. This network was referred to as $N e t_c o n t$ .

One of the objectives of the experiments in this manuscript was to demonstrate that the addition of HTN planning could improve the quality of training data. On one hand, the training efficiency could be improved, and on the other hand, humans can be replaced in data collection. To achieve this, an initial network, $N e t_i n i t$ , was trained using the same training data collected in reference,¹⁹ but the original 5400 matches were reduced to 500 matches, which accounted for less than 10% of the original amount. The network structure and training hyperparameters were kept the same as $N e t_c o n t$ .

To demonstrate that the addition of the MCTS method can also improve the network performance to some extent. This experiment used the MCTS method to select the DNNs’ output results. The pseudocode for this is shown below.

ADD_MCTS (net, state, real_game, num)

1: replay ← net.forword(state)

2: simulate_game ← copy(real_game)

3: tree ← MonteCarloTree(simulate_game)

4: MCTS_simulate(tree, replay, num)

5: best_replay ← tree.make_choice()

6: execute(real_game, best_replay)

End ADD_MCTS

The state represents the state coding, $r e a l_g a m e$ represents the actual game environment, and $n u m$ represents the number of simulations. The state encoding is input into the network to obtain $r e p l a y$ , which contains multiple specific instructions and their corresponding actions. The purpose of MCTS is to select the best instruction-action pair from the replay. First, a simulation environment is constructed (line 2). Then, a search tree is built based on this simulated environment (line 3). The search tree is continuously improved through the MCTS_simulate function (line 4). Finally, the $b e s t_r e p l a y$ from the tree is selected (line 5) and input to the real environment for execution (line 6). The pseudocode for the MCTS_simulate function is as follows.

MCTS_simulate (tree, replay, num)

1: For i in range(num)

2: current_node ← tree.root_node

3: child_num ← current_node.child_num

4: While(child_num == max_node_num)

5: current_node ← current_node.prefer_child()

6: child_num ← current_node.child_num

7: new_node ← tree.expand(current_node, replay)

8: update_value(tree, new_node)

9: Endfor

End MCTS_simulate

The MCTS_simulate function performs multiple simulations, each starting from the root node and continuously selecting child nodes until there are unexpanded child nodes (lines 2 to 6). Then, the selected child node is expanded (line 7), which includes the rollout process and the evaluation process. Finally, the evaluation result of the child node is backpropagated to update the entire tree (line 8), and the next simulation begins. The model that uses the MCTS method to filter the results of the $N e t_i n i t$ network is named $N e t_i n i t_m c t s$ .

The $N e t_i n i t$ network and the $N e t_i n i t_m c t s$ network were each countered against the $N e t_c o n t$ network, and the results are shown in Figure 12. The win, loss, and draw represent the probabilities that the respective networks defeated, lost to, or drew with the $N e t_c o n t$ network. $N o_h i e r a r c h i c a l$ is the control network from reference,¹⁹ which is a non-hierarchical structure model.

Figure 12.

The confrontation results of different networks against $N e t_c o n t$ _.

From Figure 12, it can be seen that the $N e t_c o n t$ network had a win rate of 57.9% against the non-hierarchical network $N o_h i e r a r c h i c a l$ , with a 11.7% draw rate, highlighting the importance of hierarchical structures in the development of decision-making agents. The win, loss, and draw probabilities of the $N e t_i n i t$ network against the $N e t_c o n t$ network were 27.5%, 58.3%, and 14.2%, respectively. This is expected, as the training data used by $N e t_i n i t$ is much smaller than that of $N e t_c o n t$ . As the amount of training data increases, the network's performance improves. Compared to the $N e t_i n i t$ network, the $N e t_i n i t_m c t s$ network's win rate against $N e t_c o n t$ improved to 32.6%, while its loss rate decreased to 56.7%. Although the win rate is still lower than the loss rate, this is normal, due to the significant difference in training data size. The improvement in win rate demonstrates that the MCTS method helps the neural network make better decisions, thereby enhancing its competitive ability.

Collecting high-quality replay data based on MCTS

From the results of the previous experiment, it can be observed that the MCTS method can assist the DNNs in achieving better outputs. However, the drawback is that incorporating the MCTS method makes the adversarial process slower because simulations require a significant amount of time. This does not meet the real-time requirements of RTS games, making it difficult to expand and generalize in practical applications.

Although the MCTS method sometimes cannot meet the high-level real-time requirements, it can be applied to offline adversarial scenarios to generate high-quality training data. Based on the results of the previous experiment, the $N e t_i n i t_m c t s$ , which incorporates the MCTS method, can obtain better output results compared to the $N e t_i n i t$ . Therefore, the $N e t_i n i t_m c t s$ can be utilized for self-play, saving its adversarial data. These data can be considered as high-quality data obtained through simulations. Under the same amount of data, a new network trained using these high-quality data should theoretically outperform a network trained using human-collected data.

The purpose of this experiment is to demonstrate that the DNNs enhanced with the MCTS method can indeed obtain higher-quality training data. In this experiment, the $N e t_i n i t_m c t s$ network was used for self-play to generate new training data, and new DNNs were trained iteratively, as shown in Figure 13. During the training process, new networks are continuously trained and iteratively used to update the initial network.

Figure 13.

Schematic diagram of iterative training.

In this experiment, when the number of adversarial matches reached 200, a new DNN was trained, and the new DNN was used for offline adversarial scenarios, thus iteratively improving network performance. The iteratively trained new networks are named $N e t_m c t s_i$ , where i represents the iteration generation.

The $N e t_m c t s_i$ network was countered against the $N e t_i n i t$ network and the $N e t_c o n t$ network, and the results are shown in Figure 14, where the x -axis represents the iteration count, and the y -axis represents the win, loss, and draw probabilities of the $N e t_m c t s_i$ network against the other networks

Figure 14.

Results of $N e t_m c t s_i$ against $N e t_i n i t$ and $N e t_c o n t$ .

From Figure 14, it can be seen that initially, the win rate of the $N e t_m c t s_i$ network against the $N e t_i n i t$ network was below 40%. This was because the former's training data volume was less than half of the latter's. However, as the iteration count increased, the win rate of the $N e t_m c t s_i$ network gradually improved. By the 6^th generation, it had significantly surpassed the $N e t_i n i t$ network. This was not only due to the increased training data volume but also because the training data quality of the $N e t_m c t s_i$ network was better. Additionally, the $N e t_m c t s_i$ network improved from being defeated by the $N e t_c o n t$ network to achieving a win rate close to the $N e t_c o n t$ network after the 5^th iteration. This demonstrates that neural networks enhanced with the MCTS method can self-play to generate high-quality training data. With only 1000 sample games, it is possible to train a network that is comparable to one trained with 5700 human-collected data games, which is less than 20% of the original data volume.

To make a comparison from a horizontal perspective, human demonstration data with the same data volume as $N e t_m c t s_i$ was used to train the neural network, which was named $N e t_h u m a n_i$ . The $N e t_m c t s_i$ network was countered against the $N e t_h u m a n_i$ network, and the results are shown in Table 2.

Table 2.

Adversarial results table for $N e t_m c t s_i$ with $N e t_h u m a n_i$ .

Network1	Network2	Win	Loss	Draw
$N e t_m c t s_1$	$N e t_h u m a n_1$	53.4%	35.1%	11.5%
$N e t_m c t s_2$	$N e t_h u m a n_2$	55.5%	32.6%	11.9%
$N e t_m c t s_3$	$N e t_h u m a n_3$	55.9%	31.8%	12.3%
$N e t_m c t s_4$	$N e t_h u m a n_4$	56.2%	29.4%	14.4%
$N e t_m c t s_5$	$N e t_h u m a n_5$	57.1%	27.6%	15.3%
$N e t_m c t s_6$	$N e t_h u m a n_6$	59.7%	23.4%	16.9%

From Table 2, it can be seen that the win rate of $N e t_m c t s_i$ is higher than that of $N e t_h u m a n_i$ . This indicates that with the same data volume, the data quality obtained using the MCTS method is higher than that collected by humans. Furthermore, as the data volume increases, the gap between the two becomes more pronounced. This further verifies that neural networks enhanced with the MCTS method can indeed obtain higher-quality training data, and the models trained with this data achieved a good win rate when countered against models trained with human-collected data, despite the former using a much smaller data volume.

Effectiveness of collaborative-driven planning method

To demonstrate the scientific validity and effectiveness of the framework proposed in this manuscript. This experiment follows the flow outlined in Figure 2. It treats the natural language instructions generated by the high-level neural network as compound tasks in HTN planning. By writing decomposition methods for different instructions, HTN planning is embedded into the original framework as described in reference.¹⁹ The pseudocode for this integration is given below.

ADD_HTN(net, state, real_game, domain)

1: replay ← net.forword(state)

2: COA ←NULL; Q ← NULL

3: For replay_ in replay

4: coa ← get_bestcoa(replay_, domain)

5: simulate_game ← copy(real_game)

6: q ← evaluate (simulate_game, coa)

7: COA.append(coa)

8: Q.append(q)

9: Endfor

10:best_coa ← make_choice(COA, Q)

11:best_replay ← create_ replay(best_coa)

12:execute(real_game, best_replay)

End ADD_HTN

The $d o m a i n$ represents the domain knowledge of HTN. For each specific instruction-action pair in the $r e p l a y$ , the instruction can be seen as a compound task. By utilizing HTN domain knowledge, it is decomposed into specific $c o a$ (line 4). In the specific implementation, the MCTS-HTN algorithm described in reference³³ is used, which applies the MCTS method to select the best decomposition result for each compound task. Once the best decomposition result is obtained, it is evaluated using the MCTS method in a simulated environment, and the evaluation result q is saved (lines 5 to 6). After all the instruction-action pairs in the $r e p l a y$ are decomposed, the $c o a$ corresponding to the maximum q value is chosen as the $b e s t_c o a$ (line 10). Finally, the $b e s t_c o a$ is used to construct $b e s t_r e p l a y$ , which is then inputted into the actual environment for execution (lines 11 to 12).

After incorporating HTN planning, offline self-play is still conducted according to the flow shown in Figure 13 to generate high-quality training data and train the DNNs. The iteratively trained new networks are named $N e t_h t n_i$ , where $i$ represents the iteration count.

Similar to the previous experiment, the $N e t_h t n_i$ network was countered against the $N e t_i n i t$ network and the $N e t_c o n t$ network, while also being countered against the $N e t_h u m a n_i$ network. The experimental results are shown in Figure 15 and Table 3.

Figure 15.

Results of $N e t_h t n_i$ against $N e t_i n i t$ and $N e t_c o n t$ .

Table 3.

Adversarial results table for $N e t_h t n_i$ with $N e t_h u m a n_i$ .

Network1	Network2	Win	Loss	Draw
$N e t_h t n_1$	$N e t_h u m a n_1$	56.7%	36.2%	7.1%
$N e t_h t n_2$	$N e t_h u m a n_2$	57.2%	30.7%	12.1%
$N e t_h t n_3$	$N e t_h u m a n_3$	58.4%	31.5%	10.1%
$N e t_h t n_4$	$N e t_h u m a n_4$	60.1%	30.2%	9.7%
$N e t_h t n_5$	$N e t_h u m a n_5$	62.3%	28.5%	9.2%
$N e t_h t n_6$	$N e t_h u m a n_6$	65.7%	28.1%	6.2%

The trends of the curves in Figure 15 are very similar to those in Figure 14, but the performance improvement of the $N e t_h t n_i$ network is faster and the upper limit is higher. The performance of the $N e t_h t n_i$ network in Table 3 is also higher than the performance of the $N e t_m c t s_i$ network in Table 2. This indicates that adding HTN during the self-play data collection process yields better results than simply using MCTS to guide the neural network in obtaining self-play data. In other words, the incorporation of HTN knowledge further enhances the quality of self-play data, thereby proving the effectiveness of the proposed knowledge-driven forward search action sequence generation method.

Figure 16 shows a comparison of the training data volume required for the $N e t_m c t s_i$ network and the $N e t_h t n_i$ network to reach the performance of the $N e t_c o n t$ network. The y -axis represents the win-loss ratio of the two networks at different data volumes.

Figure 16.

Comparison of the data volume required to reach $N e t_c o n t$ performance.

From Figure 16, it can be seen that the $N e t_h t n_i$ network requires less training data to reach the performance level of the $N e t_c o n t$ network (win-loss ratio reaching 1) than the one that only uses the MCTS method (i.e. the $N e t_m c t s_i$ network). This is achieved with 800 games at the 4^th iteration. This indicates that the HTN knowledge-driven MCTS forward search model effectively embeds expert knowledge into the deep neural network, allowing it to obtain post-game data of much higher quality than human demonstration data in adversarial training, thus verifying the correctness of the proposed collaborative planning framework.

Effectiveness of the framework under incomplete information

Another feature of the MiniRTS platform is the ability to change the adversarial environment to be partially observable, meaning that the states received by both adversaries are not complete, as shown in Figure 17. The black parts represent unexplored unknown environments, while the gray parts represent areas that have been explored but are currently not visible. As the intelligent unit explores, the environment will gradually become visible.

Figure 17.

Schematic diagram of partially observable environments.

In this experiment, the $N e t_m c t s_6$ network and the $N e t_h t n_6$ network from the previous experiment are countered against the $N e t_c o n t$ network and the $N e t_h u m a n_6$ network in different partially observable environments. The results are shown in Table 4. The first four rows indicate that the intelligent units in the environment can explore the states within 3 tiles around them. The middle four rows indicate that the intelligent units can explore the states within 5 tiles around them. The last four rows indicate that the intelligent units can explore the states within 7 tiles around them.

Table 4.

Adversarial results table for partially observable environment.

Network1	Network2	Win	Loss	Draw
$N e t_m c t s_6$	$N e t_c o n t$	51.7%	37.5%	10.8%
$N e t_m c t s_6$	$N e t_h u m a n_6$	63.8%	21.2%	15.0%
$N e t_h t n_6$	$N e t_c o n t$	52.0%	36.7%	11.3%
$N e t_h t n_6$	$N e t_h u m a n_6$	68.3%	24.5%	7.2%
$N e t_m c t s_6$	$N e t_c o n t$	50.3%	37.9%	11.8%
$N e t_m c t s_6$	$N e t_h u m a n_6$	62.4%	21.4%	16.2%
$N e t_h t n_6$	$N e t_c o n t$	50.6%	38.4%	11.0%
$N e t_h t n_6$	$N e t_h u m a n_6$	67.1%	25.6%	7.3%
$N e t_m c t s_6$	$N e t_c o n t$	48.2%	43.7%	8.1%
$N e t_m c t s_6$	$N e t_h u m a n_6$	60.1%	25.7%	14.2%
$N e t_h t n_6$	$N e t_c o n t$	49.8%	41.5%	8.7%
$N e t_h t n_6$	$N e t_h u m a n_6$	66.3%	28.5%	5.2%

From the results in Table 4, it can be seen that compared to the previous three experiments, the win rate of the $N e t_m c t s_6$ network and the $N e t_h t n_6$ network does not decrease in the partially observable environment. In fact, the win rate even improves when the environment has higher levels of partial observability, indicating that the networks trained according to this framework can effectively handle incomplete information. This proves that the knowledge and data collaborative framework provides a feasible approach for planning in complex environments.

Although the framework proposed in this manuscript has only been experimentally validated in the MiniRTS environment, it can essentially be extended to most game environments or real-world problems. For different environments and problems, the only part of the framework that needs to be modified is the writing and encoding of HTN domain knowledge. If a problem can be solved by training a neural network and has HTN domain knowledge, the framework proposed in this manuscript can quickly combine knowledge-based and data-driven algorithms, thereby enhancing the problem-solving capability. If the environment complexity is very high, the framework can also reduce the convergence difficulty of the neural network by adding domain knowledge, i.e. by balancing knowledge and data, which will be an important area of future research.

Conclusion

Studies have shown that decision-making is one of the top five directions for current researchers.³⁷ In this manuscript, we propose a new decision-making framework that combines knowledge-driven and data-driven approaches using the MCTS method as a connector between HTN and DL. It provides a new approach to solve the problem of decision-making in a large decision space with short response time. We conducted experiments on the MiniRTS environment and found that, compared with networks trained using high-quality retrospective data collected by humans, networks trained based on our proposed framework achieved equivalent performance with only 20% of the training data. This indicates that as a carrier of expert knowledge, HTN can be well combined with DL with the help of MCTS to achieve the goal of knowledge-driven and data-driven approaches. Our proposed framework provides researchers with a knowledge and data-driven planning approach and offers new research perspectives for efficient decision-making.

There are two limitations to our proposed method in this manuscript. Firstly, the initial network training still relies on a small amount of high-quality data collected by humans, which is beyond the scope of the framework proposed in this manuscript. Further investigation is needed to train the initial network without relying on human-collected data entirely. Secondly, the compilation of HTN domain knowledge still requires the involvement of domain experts. However, these experts may not be proficient in writing HTN code. If domain knowledge cannot be encoded effectively, the quality of self-play data will be compromised. Therefore, a key research direction for the future is to explore how to better represent expert knowledge in HTN form.

Apart from the improvements to the two limitations mentioned above, future research can focus on the following two aspects. Firstly, improving the model's generalization and interpretability by exploring whether a single neural network can adapt to multiple planning scenarios with the addition of HTN knowledge, and interpreting the output of the neural network from a hierarchical perspective. Secondly, improving sample efficiency by developing strategies to select high-quality self-play samples, as the quality of these samples can vary. This can speed up network training and improve its efficiency.

Footnotes

ORCID iDs

Tianhao Shao

Ke Zhang

Kai Cheng

Hongjun Zhang

Author contribution statement

Tianhao Shao: Conceptualization, Methodology, Writing - original draft. Ke Zhang: Conceptualization, Methodology, Writing - review & editing. Kai Cheng: Data curation, Software, Funding support. Hongjun Zhang: Validation, Supervision.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (No. 61806221) and the Young Scientists Fund of Army Engineering University of PLA.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The underlying data used in this study are publicly available and can be accessed via [ data.tgz]. The new data generated in this study can be obtained from the corresponding author upon reasonable request.

References

Fei

Zhang

, et al. Complete region of interest for unconstrained palmprint recognition. IEEE Trans Image Process 2024; 33: 3662–3675.

Zhou

Cheng

, et al. Deep learning methods for medical image fusion: A review. Comput Biol Med 2023; 160: 106959.

Zhu

Lei

, et al. Vec-Tok speech: speech vectorization and tokenization for neural speech generation. IEEE Trans Audio Speech Lang Process 2025; 33: 1243–1254.

Tanveer

Rastogi

Paliwal

, et al. Ensemble deep learning in speech signal tasks: A review. Neurocomputing 2023; 550: 126436.

Liu

Huang

, et al. Aligning, autoencoding and prompting large language models for novel disease reporting. IEEE Trans Pattern Anal Mach Intell 2025; 47: 3332–3343.

OpenAI. GPT-4 technical report. 2023, arXiv: 2 303.08774.

Silver

Huang

Maddison

, et al. Mastering the game of go with deep neural networks and tree search. Nature 2016; 529: 484–489.

Zhao

Wang

. Understanding human and machine interaction from decision perspective: an empirical study based on the game of go. J Syst Sci COMplex 2024; 37: 647–667.

Vinyals

Babuschkin

Czarnecki

, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019; 575: 350–354.

10.

Chen

Zhao

, et al. Supervised learning achieves human-level performance in MOBA games: a case study of honor of kings. IEEE Trans Neural Netw Learn Syst 2022; 33: 908–918.

11.

Amitai

Amir

Avni

. ASQ-IT: interactive explanations for reinforcement-learning agents. Artif Intell 2024; 335: 104182.

12.

Wang

Liu

Zhao

, et al. Review on hierarchical task network planning under uncertainty. Acta Aytom Sin 2016; 42: 655–667.

13.

Kumar

Baz

Alhakami

, et al. A hybrid fuzzy rule-based multi-criteria framework for sustainable-security assessment of web application. Ain Shams Eng J 2021; 12: 2227–2240.

14.

Kumar

Pandey

Baz

, et al. Fuzzy-based symmetrical multi-criteria decision-making procedure for evaluating the impact of harmful factors of healthcare information security. Symmetry (Basel) 2020; 12: 664.

15.

Kumar

Baz

Alhakami

, et al. A hybrid model of hesitant fuzzy decision-making analysis for estimating usable-security of software. IEEE Access 2020; 8: 72694–72712.

16.

Kumar

Zamar

Alenezi

, et al. Measuring security durability of software through fuzzy-based decision-making process. Int J Comput Intell Syst 2019; 12: 627–642.

17.

Horcas

Galindo

Heradio

, et al. A Monte Carlo tree search conceptual framework for feature model analyses. J Syst Softw 2023; 195: 111551.

18.

Cao

Zhou

, et al. Multi-cloud service provision based on decision tree and two-layer restricted Monte Carlo tree search. Internet Things 2023; 22: 100751.

19.

Yarats

Gong

, et al. Hierarchical decision making by generating and following natural language instructions. In: Proc. International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019, pp. 10025–10034.

20.

Wang

. Grounded reinforcement learning: learning to win the game under human commands. In: Proc. International Conference on Neural Information Processing Systems (NeurIPS), Virtual, 2022, pp.1–28.

21.

Chen

Gupta

Marino

. Ask your humans: using human instructions to improve generalization in reinforcement learning. In: Proc. International Conference on Learning Representations, Virtual, 2021, pp.1–22.

22.

Winograd

. Understanding natural language. Cognit Psychol 1972; 3: 1–191.

23.

Ruis

Andreas

Baroni

, et al. A benchmark for systematic generalization in grounded language understanding. In: Proc. International Conference on Neural Information Processing Systems (NeurIPS), Virtual, 2020, pp.19861–19872.

24.

Chevalier-Boisvert

Bahdanau

Lahlou

, et al. BabyAI: First steps towards grounded language learning with a human in the loop. In: Proc. International Conference on Learning Representations, 2019.

25.

Zhao

, et al. A novel UAV visual navigation method using online customizable image reference. IEEE Geosci Remote Sens Lett 2024; 21: 1–5.

26.

Barhaghtalab

Sepestanaki

Mobayen

, et al. Design of an adaptive fuzzy-neural inference system-based control approach for robotic manipulators. Appl Soft Comput 2024; 149: 110970.

27.

Tang

Wang

. Emergency response action plan development based on hierarchical task network planning. Manage Rev 2016; 28: 43–50.

28.

Liu

Jiang

, et al. A novel hierarchical task network planning approach for multi-objective optimization. Expert Syst Appl 2024; 251: 124058.

29.

Zhuo

Chen

, et al. Hierarchical task network-enhanced multi-agent reinforcement learning: toward efficient cooperative strategies. Neural Netw 2025; 186: 107254.

30.

Laaveri

. Integrating AI for turn-based 4X strategy game. Finland: Helsinki Metropolia University of Applied Sciences, 2017.

31.

Wichlacz

Torralba

Hoffmann

. Construction-planning models in minecraft. In: Proc. ICAPS Workshop on Hierarchical Planning, California, USA, 2019, pp.1–5.

32.

Wang

Huang

, et al. Hierarchical task planning for power line flow regulation. CSEE J Power Energy Syst 2024; 10: 29–40.

33.

Shao

Zhang

Cheng

, et al. The hierarchical task network planning method based on Monte Carlo tree search. Knowl Based Syst 2021; 225: 107067.

34.

Mannucci

Zimmermann

Frese

. Extending reward-based hierarchical task network planning to partially observable environments. In: 2024 10th International Conference on Automation, Robotics and Applications (ICARA), Athens, Greece, 2024, pp.178–184.

35.

Olz

Bercher

. A look-ahead technique for search-based HTN planning: reducing the branching factor by identifying inevitable task refinements. In: Proc. 16th International Symposium on Combinatorial Search (SoCS), Prague, Czech, 2023, pp.65–73.

36.

Browne

Powley

Whitehouse

, et al. A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games 2012; 4: 1–43.

37.

Abad-Segura

González-Zamar

Squillante

. Examining the research on business information-entropy correlation in the accounting process of organizations. Entropy 2021; 23: 1493.

A decision-making framework using MCTS as a hierarchical task network and deep learning connector

Abstract

Keywords

Introduction

Related work

Using demonstration data to guide DNNs training

HTN planning

Motivation

Method

Technical details of HTN planning

High-level task decomposition in HTN based on MCTS

Evaluation of low-level network based on MCTS

Evaluation of experimental results

Experimental environment

MCTS-assisted neural network decision making

Collecting high-quality replay data based on MCTS

Effectiveness of collaborative-driven planning method

Effectiveness of the framework under incomplete information

Conclusion

Footnotes

ORCID iDs

Author contribution statement

Funding

Declaration of conflicting interests

Data availability statement

References