Sage Journals: Discover world-class research

Abstract

Dynamic job shop scheduling problems with multiple order disturbances present significant challenges in manufacturing systems. This paper proposes a novel approach using Independent Proximal Policy Optimization (IPPO), a multiagent deep reinforcement learning algorithm, to address these challenges. We introduce a five-channel two-dimensional image to represent system states and design a reward function that minimizes both total tardiness and makespan. Experimental results across 72 diverse production scenarios demonstrate that our IPPO-based approach outperforms traditional deep reinforcement learning algorithms and dispatching rules in most cases. The proposed method shows strong optimization and exploration capabilities, offering a promising solution for complex, multiobjective scheduling in dynamic manufacturing environments.

Keywords

Independent proximal policy optimization multiple order disturbances multiagent deep reinforcement learning dynamic job shop scheduling problems

Introduction

Production scheduling is crucial for ensuring quick adaptation to market changes, managing efficient production processes, and accommodating diverse enterprise requirements. Most scholars have focused on the static job shop scheduling problem (JSSP).¹ However, real-world production environments are subject to uncertain disturbances, such as random arrivals, cancellations, and due date changes, which can affect the normal execution of the original scheduling plan²; therefore, conducting in-depth research on dynamic production scheduling issues involving multiple order disturbances is essential to ensure a timely response to these disturbances and to meet the current production needs of enterprises.

Recent research on dynamic scheduling with multiple order disturbances primarily uses heuristic algorithms^3–6 and dispatching rules.⁷ Zhang et al.⁸ proposed a hybrid intelligent algorithm that combined genetic and tabu search algorithms for production scheduling under dynamic order arrivals and equipment breakdowns. Gao et al.⁹ proposed an improved two-stage artificial bee colony algorithm for the dynamic scheduling problem under uncertain disturbances of new order arrivals and variable processing time. In addition, the method of dispatching rules has been researched extensively by some scholars. Sharma et al.¹⁰ developed nine dispatching rules to address the stochastic dynamic JSSP (DJSSP) that took into account setup times dependent on sequences. Teymourifar et al.¹¹ proposed a gene expression programming method combined with a simulation model to extract effective dispatching rules for dynamic production scheduling problems. The majority of research on order disturbances currently focuses on single, one-time disturbances, rather than researching on comprehensive study of multiple disturbances on a larger scale. In terms of algorithmic approaches, heuristic algorithms have high computational accuracies but at a cost of long computation times; conversely, dispatching rule-based methods offers fast computational speeds, but finding a generalized dispatching rule adaptable to most scenarios remains a significant challenge.

Deep reinforcement learning (DRL) algorithms have been widely used for dynamic production scheduling because of their robust perception and decision-making capabilities. Wang et al.¹² applied the Proximal Policy Optimization (PPO) algorithm to address the DJSSP under machine breakdowns and job rework, where the system state was represented by a set of three-channel two-dimensional images. Yuan et al.¹³ introduced the PPO algorithm to tackle the JSSP, employing the Invalid Action Mask technique to reduce the search space. Huang et al.¹⁴ developed a method that combines a Graph Neural Network with the PPO algorithm to solve the distributed production scheduling problem. Wu et al.¹⁵ examined the DJSSP with uncertain processing times and proposed a DRL method based on PPO and hybrid priority experience replay for training the agent. Han et al.¹⁶ proposed a combination of CNN and DRL for the production scheduling problem and designed three two-dimensional matrices to simulate the system state.

Order disturbances, as significant dynamic disruptions, have been extensively studied. Yang et al.¹⁷ investigated intelligent scheduling with dynamic job arrivals and the reconstruction of reconfigurable flow lines using DRL. Luo¹⁸ proposed a DQN algorithm to address the production scheduling problem under dynamic job arrivals. Liu et al.¹⁹ established a hierarchical and distributed architecture to tackle the dynamic flexible JSSP with constant job arrivals. Wang et al.²⁰ presented the PPO algorithm to solve the DJSSP with random job arrivals. Zhao et al.²¹ proposed a PPO algorithm based on an attention strategy network to address the JSSP with dynamically arriving jobs. Experimental results indicate that the proposed algorithms exhibit superior performance.

In summary, DRL has yielded notable results in dynamic production scheduling; however, its application in scenarios involving multiple order disturbances remains underexplored. Regarding research methodologies, the PPO algorithm has been widely applied in production scheduling, it directly optimizes action policies without the need to compute the state's value, making them more suitable for production scheduling problems characterized by path optimization properties. Additionally, the PPO algorithm employs a mechanism to limit the magnitude of policy updates, which addresses issues related to training stability and efficiency. However, this algorithm tends to be less exploratory and may fall into local optima. The Independent Proximal Policy Optimization (IPPO) algorithm^22,23 is a multiagent DRL approach based on PPO, where each agent collaborates during training while collecting data independently. This greatly enhances the diversity and irrelevance of the training data, balancing the stability and exploratory nature of training, and is better suited for dynamic scheduling problems involving multiple order disturbances.

This paper explores a DRL method based on IPPO to address the DJSP under multiple order disturbances. The main contributions of this study are as follows: (1) the introduction of the multiagent DRL algorithm IPPO to tackle scheduling problems. Although many DRL algorithms have been applied to production scheduling issues, most suffer from training instability or insufficient exploration. This paper presents IPPO, a novel multiagent algorithm that has not been previously explored in this context; (2) proposal of a solution for multiobjective dynamic scheduling problems under multiple order disturbances; and (3) comprehensive analysis and comparison of experimental results across various parameter settings and scenarios.

Independent Proximal Policy Optimization scheduling framework

In this paper, an IPPO algorithm is proposed to address the DJSSP under multiple order disturbances. The IPPO algorithm is a multiagent DRL approach, where each agent consists of two Actor-Critic (AC) networks: one is involved in the new policy and the other maintains the old policy. The structure of both AC networks is identical, with CNN layers and fully connected layers. To address the issue of the dynamic image size changes caused by multiple order disturbances, a Spatial Pyramid Pooling (SPP)²⁴ layer was added between the last convolutional layer and the first full connection layer to ensure consistent output size from the CNN. The application of IPPO algorithm in production scheduling involves a Semi-Markov Decision Process (SMDP), where agents continuously interact with the production environment for offline training, and then the trained agents solve various problems online. The DRL framework is divided into two phases: offline training and online application, as illustrated in Figure 1.

Figure 1.

Scheduling framework with Independent Proximal Policy Optimization (IPPO).

In the offline training phase, the IPPO algorithm trains each individual agent separately using a parameter-sharing mechanism. Each agent collects data using the old policy network, calculating the advantage function and system state value, and stores this data in a common storage queue. During training, these data are extracted for model training. The IPPO algorithm mainly trains and updates the parameters of the AC network containing the new policy and simultaneously transmit the new policy network's parameters to the old network's parameters.

The online application phase mainly utilizes the offline-trained model to solve new problems. Although the agent needs a long time for offline training, once completed, the optimal result of the new problem can be obtained in a very short time. Requiring only simple calculations, without needing to recalculate for a long time like traditional heuristic algorithms.

Independent Proximal Policy Optimization for scheduling

Problem formulation

The multiobjective dynamic scheduling problem in intelligent workshops under multiple order disturbances can be described as follows: the production system has N orders to be processed on M machines, where each order comprises $n_{i}$ operations. For all operations of each order, the processing sequence and time are predetermined; each operation is assigned to a specific machine for processing. In the event of disturbances such as dynamic order arrivals, cancellations, or order due date changes, a new scheduling scheme is generated that comprehensively considering various factors. Furthermore, the scheduling must also meet the following constraints: (1) known processing time for each operation on each machine; (2) fixed sequence of operations for each order, where subsequent operations commence only upon completion of preceding ones; (3) each operation of an order can only be processed by one machine; (4) noninterruption of operations once initiated on a machine; (5) processing of only one operation per machine at any one time; (6) noninterruption of ongoing operations in the event of order disturbances; and (7) independence of orders from one another.

The notations and indices used for problem formulation are listed in Table 1.

Table 1.

Notations and indices for problem formulation.

Indices	Description
i	Index of orders, $i = 1, 2, \dots, N$
j	Index of operations, $j = 1, 2, \dots, n_{i}$
m	Index of machines, $m = 1, 2, \dots, M$
k	Index of operations to be processed on the machine, $k = 1, 2, \dots s_{m}$
Parameters
N	Total number of orders
M	Total number of machines
$J_{i}$	ith order
$M_{m}$	mth machine
$O_{i j}$	jth operation of order $J_{i}$
$n_{i}$	Number of operations belonging to order $J_{i}$
$A_{i}$	Arrival time of order $J_{i}$
$s_{m}$	Number of operations that need to be processed on the machine $M_{m}$
$T_{i j m}$	Processing time of $O_{i j}$ on machine $M_{m}$
$F_{i j}$	Completion time of $O_{i j}$
$C_{i}$	Completion time of $J_{i}$
$D_{i}$	Due date of $J_{i}$
Variables
$X_{i j m}$	Decision variable that is equal to 1 if operation $O_{i j}$ is selected for processing on equipment $M_{m}$ and 0 otherwise
$Y_{i j m k}$	Decision variable that is equal to 1 if $O_{i j}$ is the kth operation processed on equipment $M_{m}$ and 0 otherwise

The mathematical description of the multiobjective dynamic scheduling problem is presented in equations (1)–(6):

Minimize (\sum_{i = 1}^{N} \max {C_{i} - D_{i}, 0} + α \sum_{i = 1}^{N} \max C_{i})

(1)

\sum_{m = 1}^{M} X_{i j m} = 1, \forall i, j

(2)

\sum_{i = 1}^{N} \sum_{j = 1}^{n_{i}} Y_{i j m k} = 1, \forall m, k

(3)

\sum_{k = 1}^{s_{m}} Y_{i j m k} = X_{i j m}, \forall i, j, m

(4)

(F_{i, j} - T_{i j m} - F_{i, j - 1}) X_{i j m} \geq 0, \forall i, j, m

(5)

(F_{i, 1} - T_{i 1 m} - A_{i}) X_{i 1 m} \geq 0, \forall i, m

(6)

Equation (1) is to minimize the total tardiness and makespan, where $α$ represents the weighting coefficient. Equation (2) indicates that each operation can only be assigned to one machine. Equation (3) ensures that only one operation can be processed on one machine at a time. Equation (4) guarantees that all operations are assigned to the respective machine. Equation (5) ensures that a subsequent operation can be processed after its preceding operation has been finished. Equation (6) ensures that each order will be processed after its arrival.

Principles of IPPO

The PPO is a policy-based reinforcement learning algorithm derived by improvement based on the trust region policy optimization algorithm.²⁵ The PPO achieves performance improvement by constraining the distributional variance between new and old policies, thus ensuring monotonic policy updates. Moreover, it addresses the challenge of low data utilization in the traditional policy gradient algorithm by employing small-batch updates. The PPO has two main variants, which differ in their approach for restricting changes between new and old policies: the PPO-penalty version and the PPO-clip version. This study focuses on the PPO-clip version, which employs a clipping function to limit the extent of policy changes. It prevents algorithmic instability stemming from excessive policy changes. The objective function is represented using equation (7):

L^{CLIP} (ω^{'}) = E_{t} [\min (\frac{π_{ω^{'}} (a_{t} | s_{t})}{π_{ω} (a_{t} | s_{t})} A^{π_{ω}} (s_{t}, a_{t}), clip (\frac{π_{ω^{'}} (a_{t} | s_{t})}{π_{ω} (a_{t} | s_{t})}, 1 - ε, 1 + ε) A^{π_{ω}} (s_{t}, a_{t}))]

(7)

Essentially, the PPO-clip algorithm limits the differences between the new and old policies within a range of $[1 - ε, 1 + ε]$ . When $A^{π_{ω}} (s_{t}, a_{t}) > 0$ is chosen, it indicates that the value of selecting the action is higher than the average value of all actions. Maximizing the objective function increases $π_{ω^{'}} (a_{t} | s_{t}) / π_{ω} (a_{t} | s_{t})$ , but not higher than $1 + ε$ . In contrast, when $A^{π_{ω}} (s_{t}, a_{t}) < 0$ is chosen, it indicates that the value of choosing the action is lower than the average value of all actions. Maximizing the objective function reduces $π_{ω^{'}} (a_{t} | s_{t}) / π_{ω} (a_{t} | s_{t})$ , but not lower than $1 - ε$ . The PPO-clip algorithm achieves this by using a clipping operation to limit the changes between the new and old policies, resulting in a well-performing algorithm.

The IPPO algorithm is a multiagent DRL framework that represents an application of the PPO algorithm in the multiagent domain. As opposed to single-agent algorithms, multiagent algorithms facilitate the achievement of common objectives through collaborations among a group of agents. The IPPO algorithm is a decentralized extension of the PPO algorithm in the context of multiagent systems, where each PPO agent has its own AC network, enabling independent training and updates. The policy network of each agent in the IPPO algorithm is optimized using the objective function described in equation (8):

L^{i} (w^{'}) = E_{s_{t}^{i}, a_{t}^{i}} [min (\frac{π_{ω^{'}}^{i} (a_{t}^{i}) | s_{t}^{i}}{π_{ω}^{i} (a_{t}^{i}) | s_{t}^{i}})] A_{t}^{i}, clip (\frac{π_{ω^{'}}^{i} (a_{t}^{i} | s_{t}^{i})}{π_{ω}^{i} (a_{t}^{i} | s_{t}^{i})}, 1 - ε, 1 + ε) A_{t}^{i})]

(8)

where i is the agent number, and

A_{t}^{i}

is the generalized advantage estimation (GAE) of agent i at time t. Its calculation formula is shown in equation (9):

A_{t}^{i} = \sum_{l = 0}^{h} (λ γ)^{l} δ_{t + l}^{i} δ_{t + l}^{i}

(9)

L (ω^{'}) = \sum_{i}^{m} L^{i} (ω^{'}) + α H (π_{ω^{'}}^{i} (\cdot | s_{t}))

(10)

where

δ_{t}^{i} = r_{t}^{i} + γ V (s_{t + 1}^{i}) - V (s_{t}^{i})

is the temporal difference error, h is the number of steps in each sampled trajectory. To facilitate interaction and collaboration among agents, as well as to improve the efficiency and performance of the algorithm, the IPPO algorithm introduces a joint objective function. This function encompasses the actions and rewards of all agents, with each agent utilizing the joint objective function to train and update parameters. Additionally, to enhance the algorithm's exploration capabilities, the objective function incorporates policy entropy. The objective function for the total policy network is as shown in equation (10).

where $α$ is the entropy coefficient, which aims to maximize cumulative rewards to ensure maximum policy entropy. The algorithm can also collect and train data by establishing a shared experience pool. This approach helps reduce data correlation, safeguarding the algorithm's exploration capacity and avoiding falling into local optima.

Transformation between scheduling problems and algorithm design

Deep reinforcement learning requires three critical components to be designed in its application: system state, action space, and system reward.

System state feature description

The design of system state features in this study is primarily based on literature.¹⁶ Three global feature channels have been established to represent the system state features, while more sensitive local system state features have been overlooked. To enhance the accuracy of system state representation, this study optimizes and upgrades the production system state features to include two global channels and three local channel features. All channels consist of two-dimensional matrices, where rows represent operations and columns represent orders. The first global feature channel is the operation channel to be processed, with its initial values set to the processing times for all operations. Once an operation is completed, the corresponding position in this channel is updated to 0. The second global feature channel is the completed operation channel, initially set to 0; it is updated to the processing time once an operation is finished. The first, second, and third local feature channels represent the remaining processing time for the current processing operation, the processing time for the operation in the queue to be processed, and the waiting time for the operation in the queue to be processed, respectively.

Figure 2 illustrates the evolution of system state across each channel in a 3 × 3 job shop scheduling. In the initial state $s_{0}$ , the value in the first channel represents the processing time of each operation, whereas the remaining four channels are initialized to zero. After two scheduling steps, the system state transitions to $s_{2}$ . At the initial state $s_{0}$ , the agent selects the first operation of all the orders for processing. After the first operation of the second order is completed, the system transitions to the state $s_{1}$ , and the second operation of the second order is selected to be processed. When the first operation of the third order is finished, the system enters state $s_{3}$ . The red highlighted section in Figure 2 delineates the alterations in channel values throughout these transitions.

Figure 2.

System state transition process.

It should be noted that to enhance the feature extraction performance of the neural network, this study utilizes a 4-frame integration to represent the state of the production system, which serves as the input of the CNN.

Action space

In DRL algorithms for production scheduling, the action space consists of production dispatching rules. In this study, 16 production dispatching rules were selected as the action space for the DRL algorithm. These are listed as follows: shortest processing time (SPT), longest processing time (LPT), least work remaining, most work remaining, shortest subsequent operation (SSO), longest subsequent operation (LSO), shortest remaining operation except current, longest remaining operation except current, first in first out, earliest due date, minimum sum of current and subsequent operation (SPT + SSO), maximum sum of current and subsequent operation (LPT + LSO), minimum ratio of current operation to total operations (SPT/TWK), maximum ratio of current operation to total operations (LPT/TWK), minimum product of current operation and total operations (SPT × TWK), and maximum product of current operation and total operations (LPT × TWK). The diversity of dispatching rules is increased to enable the agent to fully learn the ability to selecting dispatching rules adaptively.

Reward function

For the multiobjective scheduling problem, this paper considers two objective functions: minimizing the total tardiness and minimizing the makespan. The reward function for minimizing the total tardiness is expressed as follows:

r_{k}^{tard} = {Tard}_{k - 1} - {Tard}_{k}, Tar d_{k} = \sum_{i = 1}^{n} δ_{i}^{tard} (τ)

(11)

where

δ_{i}^{tard} (τ)

is the tardiness of the i order in the current system state

t_{k}

{Tard}_{k}

is the total tardiness of all orders in the current system state

t_{k}

, and

r_{k}^{tard}

is the immediate tardiness reward obtained when the action is executed in system state

t_{k - 1}

and reaches system state

t_{k}

. The formula for calculating the cumulative reward for the total tardiness is expressed as follows:

\begin{aligned} R^{tard} = & \sum_{k = 1}^{K} r_{k}^{tard} = \sum_{k = 1}^{K} ({Tard}_{k - 1} - {Tard}_{k}) \\ = & {Tard}_{0} - {Tard}_{1} + {Tard}_{1} - {Tard}_{2} + \dots + {Tard}_{k - 1} - {Tard}_{k} \\ = & {Tard}_{0} - {Tard}_{k} = - {Tard}_{k} \end{aligned}

(12)

In the above formula, ${Tard}_{0}$ represents the initial total tardiness, with a value of 0. The cumulative reward is inversely proportional to the total tardiness, consistently aligning with the scheduling objective of minimizing total tardiness.

The reward function with the minimum makespan is expressed as follows:

r_{k}^{C_{\max}} = - {Idle}_{k}, Idl e_{k} = \sum_{j = 1}^{M} δ_{j}^{C_{\max}} (τ)

(13)

where

δ_{j}^{C_{\max}} (τ)

is the idle time for the j machine to reach the system state

t_{k}

after executing an action in system state

t_{k - 1}

{Idle}_{k}

is the total idle time for all machines to reach the system state

t_{k}

after executing an action in system state

t_{k - 1}

, and

r_{k}^{C_{\max}}

is the immediate reward for the total idle time obtained. The formula for calculating the cumulative reward with the minimum makespan is expressed as follows:

\begin{aligned} R^{C_{\max}} = & \sum_{k = 1}^{K} r_{k}^{C_{\max}} = \sum_{k = 1}^{K} - {Idle}_{k} = - \sum_{J = 1}^{M} I_{j} \\ = & - \sum_{j = 1}^{M} (C_{\max} - \sum_{i = 1}^{N} P_{i, j}) = \sum_{i = 1}^{N} \sum_{j = 1}^{M} P_{i, j} - M C_{\max} \end{aligned}

(14)

where

I_{j}

is the sum of all idle time on the machine j,

P_{i, j}

is the processing time of order i on the machine j, M is the number of machines, N is the number of orders, and the calculation result

\sum_{i = 1}^{N} \sum_{j = 1}^{M} P_{i, j}

is a fixed value. From the calculation of the formula for cumulative rewards, it can be seen that it is also inversely proportional to the makespan, which is consistent with the direction of the objective function.

Independent Proximal Policy Optimization algorithm process

The execution process of the IPPO algorithm follows an SMDP with decision-triggering events such as the completion of an operation on any machine, the arrival of a new order, order cancellation, changes in order due dates, and so on. The algorithmic procedure is illustrated in Algorithm 1:

Algorithm 1.

IPPO Algorithm

1:	Initialize new and old actor–network parameters $ω$ , $ω^{'}$ , new and old Critic network parameters $θ$ , $θ^{'}$ , all the hyperparameters, and sample experience pool $M$ , $T$ .
2:	Initialize PPO agent
3:	for episode = 1: EP_MAX do
4:	for k = 1: agent_num do
5:	Reset the state of each agent to $s_{0}$ , empty the experience pool $T$
6:	while True (The decision time point t is the completion of an operation by any machine, the arrival of a new order, the cancellation of an order, a change in the due date, the Boolean variable done stops the loop when all operations are completed.)
7:	Data sampling in one episode (when all operations are completed): select an action $a_{t}$ based on the system state $s_{t}$ , receive an immediate reward $r_{t}$ , and move to the next system state $s_{t + 1}$ .
8:	Calculate temporal difference error: $δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$
9:	Store data $s_{t}, a_{t}, r_{t}, s_{t + 1}, δ_{t}$ in experience pool $T$
10:	end while
11:	for t = 0:len(T) do
12:	Sampling empirical data from $T$
13:	Compute the advantage function GAE: $A_{t} = \sum_{l = 0}^{h} (λ γ)^{l} δ_{t + l}$
14:	Store data $s_{t}, a_{t}, r_{t}, s_{t + 1}, A_{t}$ in the experience pool $M$
15:	end for
16:	end for
17:	Keep new and old Actor policy network parameters synchronized: $ω \leftarrow ω^{'}$ . Keep new and old Critic policy network parameters synchronized: $θ \leftarrow θ^{'}$
18:	for i = 1: UPDATE_STEP do
19:	Sampling batch size data from the experience pool $M$
20:	Calculate the ratio of the old and new policies: $R a t i o = (\frac{π_{ω^{'}} (a_{t} \| s_{t})}{π_{ω} (a_{t} \| s_{t})}$
21:	Update the parameters $ω^{'}$ of the new actor–network using $L (ω^{'}) = E_{s_{t}, a_{t}} [min (Ratio A_{t}, clip (Ratio, 1 - ε, 1 + ε) A_{t})] + α H (π_{ω^{'}} (\cdot \| s_{t})$
22:	Parameters $θ^{'}$ of the new Critic network are updated using the mean squared loss function $(r + γ V (s_{t + 1}) - V (s_{t}))^{2}$
23:	end for
24:	Empty the experience pool $M$
25:	end for

Algorithm 1 shows the entire scheduling execution process. It starts with parameter and variable initialization and undergoes training for EP_MAX loops, eventually achieving adaptive action selection capabilities across different states. Each training loop comprises data collection and parameter training phases. Lines 4–16 describe the data collection process of each agent. Starting from the initial state $s_{0}$ , each agent selects the action $a_{t}$ with the maximum probability based on the probability distribution generated by the actor–network at decision time t. Subsequently, an immediate reward $r_{t}$ and the next system state $s_{t + 1}$ are observed. The critic network calculates the V values of system states $s_{t}$ and $s_{t + 1}$ , followed by the computation of the temporal difference error $δ_{t}$ . The data $s_{t}, a_{t}, r_{t}, s_{t + 1}, δ_{t}$ are stored in the experience storage pool T and are cycled until all orders are processed. Upon completion of data collection, each agent calculates its advantage function GAE for every decision point according to the formula in line 13 (GAE: $A_{t} = \sum_{l = 0}^{h} {(λ γ)}^{l} δ_{t + l}$ ), and stores the data $s_{t}, a_{t}, r_{t}, s_{t + 1}, A_{t}$ in the common experience pool M. This entire process comprises one data collection loop. Lines 17–24 show the parameter training process for the agents. First, the parameters of the old actor policy network and the old critic value network are updated to synchronize with the new networks. During the training phase, the algorithm iterates through multiple rounds of training to improve the utilization of training data. Batch-size pieces of data from the shared experience pool M are used for training, the policy ratio between the new and old policy networks is calculated, the policy loss function is calculated using the formula in line 21 ( $L (ω^{'}) = E_{s_{t}, a_{t}} [min (Ratio A_{t}, clip (Ratio, 1 - ε, 1 + ε) A_{t})] + α H (π_{ω^{'}} (\cdot | s_{t})$ ), and subsequently, the actor–network parameters are updated. The value loss function, calculated according to the formula in line 22 $(r + γ V (s_{t + 1}) - V (s_{t}))^{2}$ , is utilized to update the critic network parameters. After training, the experience data in the shared experience pool M are cleared, indicating the completion of one round of training.

Simulation experiment and results

In this section, the data generation method outlined in Luo¹⁸ is adopted to randomly generate multiple sets of training, validation, and testing data under various scenarios. The parameters utilized for generating simulation data are listed in Table 2.

Table 2.

Parameter settings in different scenarios.

Parameter	Value
Number of machines	5, 10
Number of initial orders	30
Total number of order disturbances (the arrival of a new order, order cancellation, changes in order due dates)	10, 50
Time interval of successive new order disturbances following a Poisson distribution	25, 100
Due date tightness	1.0, 2.0
Processing time for each operation	Unif[0,50]

In Table 1, the due date for order i is calculated using formula $D_{i} = A_{i} + (\sum_{j}^{n_{i}} t_{i j}) \cdot DDT$ , where $A_{i}$ represents the arrival time of the order, which is randomly generated according to the Poisson distribution, $t_{i j}$ represents the processing time of the operation $j$ of the order, and DDT is the due date tightness.

Network structure and parameter settings

In this study, the IPPO algorithm employed an AC network architecture, where both the actor policy network and the critic value network of each agent share a structural identity. Each network comprises two CNN layers, one SPP layer, and two fully connected layers. The CNN layers have kernel sizes of 4 × 4 and 2 × 2, with strides of 2 and 1, respectively, and output channel numbers of 40 and 80. The first fully connected layer has 512 nodes, while the number of nodes in the second fully connected layer of the actor–network corresponds to the size of the action space. The second fully connected layer of the critic network contains 1 node, which is responsible for the output of the system state value. The CNN convolutional layers use the tanh activation function, while the fully connected layers use the ReLU activation function. The neural network parameters are trained and updated using the Adam optimizer. Since the system state space, action space, and optimization objectives of each agent are identical, a parameter-sharing strategy is adopted to reduce training complexity. Specifically, different agents share the same set of network parameters, making the training more stable.

In the training process of the IPPO, parameter configuration plays a crucial role. This study validated the sensitivity of parameters in a production scenario characterized by five machines, 10 order disturbances, a due date tightness of 1.0, an average time interval of order disturbances of 100, and weight coefficients of 0.5 for both total tardiness and makespan in the multiobjective functions. Primary validations were conducted on the training batch size, entropy coefficient, number of training per dataset, and the learning rate. A total of 2000 training episodes were carried out, and the results are depicted in Figure 3.

Figure 3.

Validation results of each hyperparameter: (a) training batch, (b) entropy coefficient, (c) number of training per dataset, and (d) learning rate.

The horizontal axis in Figure 3 represents the number of training episodes, while the vertical axis i the value of the objective function. It can be observed that the optimal performance is achieved with a batch size of 16, an entropy coefficient of 0.02, 10 training iterations per dataset, and a learning rate of 0.0001.

Based on the aforementioned experimental results, the following parameters were finally determined for training the IPPO algorithm model, as shown in Table 3.

Table 3.

Setting of IPPO hyperparameters.

Parameters	Values
Number of episodes	2000
Entropy coefficient	0.02
Number of training per dataset	10
Clipping threshold $ε$	0.2
Learning rate	0.0001
Minibatch batch size	16
Discount factor $γ$	0.99
Smoothing factor $λ$	0.95

Training process of IPPO algorithm

Independent Proximal Policy Optimization aims to train a model with extensive generalization ability, and the trained model is tested using test data. This study classifies production scenarios based on the weight coefficients of multiobjective functions, the number of machines, and the time interval between the order disturbances. For each scenario, 12 groups of training and validation data were randomly generated. The model employed eight agents and underwent 2000 training episodes. Validation data were used to assess the performance of the model after each episode, and the optimal model was selected based on the validation results. Lastly, 30 sets of test data were randomly generated to evaluate the trained model. Figure 3 illustrates the training process with a total tardiness weight of 0.7, makespan weight of 0.3, five machines, order disturbance time interval of 100, 10 disturbances, and a due date tightness of 1.0.

Figure 4(a)–(c) shows the variation process of the total objective value, variation of the rewards, and changes in the objective values of the validation data, respectively. The trend of the training curves for the reward value and the total objective value was closely aligned and exhibited a high degree of correlation.

Figure 4.

Training process of Independent Proximal Policy Optimization (IPPO) algorithm: (a) average total objective value, (b) average reward value, and (c) average validation value.

Comparison of experimental results

To validate the effectiveness and generalization capabilities of the IPPO model, a comprehensive comparison was made between the IPPO algorithm and the classical PPO algorithms, considering 16 dispatching rules. Test data were randomly generated according to the parameter settings in Table 1. A total of 72 diverse production scenarios were designed for algorithmic comparisons, accounting for factors such as the number of machines, time interval between order disturbances, number of order disturbances, due date tightness, and weight coefficients of the multiobjective functions. Under each production scenario, 30 sets of data were randomly generated, with weight coefficients for total tardiness and makespan set to three scenarios: 0.3 and 0.7; 0.5 and 0.5; and 0.7 and 0.3, respectively. Tables 4–6 show the test results, with optimal values highlighted in bold for each dataset. Because of the large amount of test data, only the integer part of the data was displayed, with optimal values highlighted in bold for each dataset.

Table 4.

Test results with weight coefficients of 0.3 and 0.7.

Machine (Eavg)	TDD	N _disturb	IPPO	PPO	SPT	LPT	LWKR	MWKR	SSO	LSO	SRM	LRM	FIFO	EDD	SPT + SSO	LPT + LSO	SPT/TWK	LPT/TWK	SPT × TWK	LPT × TWK
5(25)	1.0	10	160229	162827	176124	242561	163877	253034	187158	237299	172192	250699	238915	167838	167850	234661	196023	242567	162822	234624
		30	296605	297920	340427	506086	299247	548322	366405	515523	326106	552094	458076	330423	316286	495854	388719	505320	307722	496224
		50	422778	425966	486197	769193	428019	881978	523079	836597	471533	893155	654729	481234	447216	762307	569228	769471	440237	752529
	2.0	10	119641	122991	135085	202030	124041	211792	146357	196103	132299	209431	197624	125660	127184	193828	155451	201453	122131	194695
		30	232970	234997	276339	442792	236281	483760	302669	450983	263140	487294	393186	261894	252497	432216	325797	441218	244286	434052
		50	339015	342900	401519	685492	344954	796558	438260	750612	388282	807067	568523	392378	362978	678362	486162	684803	356749	670468
5(100)	1.0	10	145222	151610	156594	221885	148836	229658	170487	212557	157159	225547	211607	152819	151929	214502	174208	220521	146630	216978
		30	160275	167615	176046	258755	164863	257381	197081	236585	179176	253645	233773	171411	168024	242861	194044	259801	164427	251532
		50	157996	163872	169271	250898	162167	253766	188774	234022	173209	249952	229206	166481	164573	236851	190068	247975	159705	245593
	2.0	10	105449	111523	116717	181621	109416	189389	129936	172280	117595	185240	170372	112012	111715	174622	134850	179636	106659	177632
		30	116948	123636	132358	213605	120677	213574	151594	192652	134464	209599	188280	124862	123882	198968	151127	213155	120861	207720
		50	115209	120946	126264	206337	119078	210908	144116	190951	129711	206667	184683	122461	121360	194312	147915	202353	117017	203078
10(25)	1.0	10	188947	190372	200017	272369	192974	271402	225282	241381	199812	262391	257352	197145	199728	259152	213126	270300	190372	267286
		30	309804	315895	336876	490821	316557	519842	389705	454353	330532	503266	445395	332144	336812	479844	362993	494645	315895	485947
		50	500841	510763	547471	835506	510602	923274	645765	779371	529595	895181	721080	540477	547852	804774	597163	826921	510763	826593
	2.0	10	106379	107233	116534	189407	113161	186873	141999	158278	120212	177815	172635	111837	116730	176441	131229	186284	107233	185782
		30	191397	197150	217521	372166	200906	399330	269868	334909	214652	382690	324303	209481	218177	361719	245824	374587	197150	369490
		50	335135	344874	380778	669503	347605	755146	478476	611837	366378	726772	551901	368600	381430	638808	432647	659141	344875	662969
10(100)	1.0	10	164802	228998	171698	239058	170307	235885	197067	209981	176134	227391	219433	173871	173477	228998	182145	239443	165523	235470
		30	204331	280701	211923	297465	214195	293556	249260	254234	219452	279713	267254	218072	215689	280701	225138	295015	205452	295828
		50	226997	313631	237065	330009	242285	316129	271914	289386	246391	299632	297624	242891	242386	313631	247978	330933	229078	323950
	2.0	10	86734	149912	93889	159363	93482	157149	117596	131712	99021	148510	137454	93409	95110	149912	106055	158152	87460	157613
		30	115035	188968	121972	202379	123804	202940	155110	164491	127990	189730	171875	121964	125176	188968	137945	198797	116190	203372
		50	132829	215357	141749	228425	144508	220879	171821	193002	149190	204981	196787	142710	146153	215357	155428	228296	134691	225238

Table 5.

Test Results with Weight Coefficients of 0.5 and 0.5.

Machine (Eavg)	TDD	N _disturb	IPPO	PPO	SPT	LPT	LWKR	MWKR	SSO	LSO	SRM	LRM	FIFO	EDD	SPT + SSO	LPT + LSO	SPT/TWK	LPT/TWK	SPT × TWK	LPT × TWK
5(25)	1.0	10	243272	256448	271766	380012	248845	401058	288578	374886	262268	397243	377260	257006	257102	368715	305394	379578	249240	367413
		30	458321	515549	534755	806862	463423	882406	575468	827815	507243	888720	731538	517134	493500	793086	616005	804991	479477	790857
		50	661980	754052	768135	1234571	667924	1428617	826412	1353068	739015	1447410	1049264	758489	701929	1227395	907118	1233948	690992	1207015
	2.0	10	175692	186336	203368	312460	182452	332321	220576	306225	195779	328462	308440	186638	189327	300660	237774	311056	181421	300865
		30	352930	401907	427942	701372	358479	774802	469241	720248	402300	780720	623387	402882	387186	687022	511136	698154	373750	687237
		50	523370	605709	627006	1095069	529482	1286249	685047	1209759	600263	1303930	905587	610208	561532	1087487	768675	1092833	551846	1070246
5(100)	1.0	10	219601	236600	238920	344978	223775	361505	259973	333175	236992	354780	330514	231238	230005	334506	268713	342568	221922	337074
		30	241279	266338	267256	400446	245751	403639	299048	368459	268115	397454	363139	257422	252778	377577	297591	400771	246706	389460
		50	234944	257702	255945	387949	241384	397481	285246	364306	258713	391094	356074	249949	247042	367810	291213	382589	239317	380178
	2.0	10	152621	170646	172458	277871	158074	294390	192387	266046	171052	287601	261789	163190	162983	268039	203116	274426	155305	271498
		30	167760	192369	194443	325197	172109	330627	223235	295238	193594	324043	287318	179997	179208	304422	226063	323027	174096	316441
		50	163491	185457	184266	313680	169569	326051	210816	292520	186217	318951	281870	176337	175021	296912	220957	306552	168170	309320
10(25)	1.0	10	291071	293554	308126	425328	293462	428765	348227	378051	304298	414343	404425	301385	306860	405490	330639	421817	291273	417379
		30	484264	490052	527681	779829	489868	834034	612944	724495	512700	806965	708835	517447	526277	763603	571615	785803	491408	771619
		50	786625	801502	866543	1340482	800244	1494716	1027280	1254648	831666	1448309	1156195	852040	866164	1292672	949855	1326613	804332	1326291
	2.0	10	153413	160533	168987	287059	160440	287884	209421	239545	171631	273382	263230	159251	168531	267638	194144	281790	152708	281539
		30	287917	297300	328755	582072	297116	633181	413214	525423	319566	606005	507014	313172	328553	566729	376333	585705	293499	577525
		50	512890	529840	588721	1063810	528582	1214502	748466	975424	559637	1167626	874229	566014	588794	1016062	675662	1046981	527852	1053584
10(100)	1.0	10	247931	254491	259192	368804	254378	367847	299965	324349	263641	353769	338627	260796	261474	353463	276996	369388	248327	362947
		30	298392	311077	310968	449391	309510	448502	368939	382653	318526	425512	401588	317029	315511	423776	330648	445944	298678	446418
		50	325782	345342	341755	492133	345360	480357	397980	428364	353596	454561	441749	348174	349662	467412	360728	493779	327458	483004
	2.0	10	118297	126471	129509	235979	126336	236621	167514	193901	135120	222300	201996	126757	130862	221653	150180	233902	118221	233185
		30	149285	159933	161050	290914	158859	297476	212023	233082	166089	275539	242622	157913	164656	270888	185326	285581	149907	292324
		50	168885	183131	182896	322826	182398	321606	231158	267723	191594	296810	273688	181242	189274	303621	206478	322719	170145	318483

Table 6.

Test results with weight coefficients of 0.7 and 0.3.

Machine (Eavg)	TDD	N _disturb	IPPO	PPO	SPT	LPT	LWKR	MWKR	SSO	LSO	SRM	LRM	FIFO	EDD	SPT + SSO	LPT + LSO	SPT/TWK	LPT/TWK	SPT × TWK	LPT × TWK
5(25)	1.0	10	329882	335657	367407	517463	333812	549081	389997	512472	352343	543787	515604	346173	346354	502769	414765	516589	335657	500202
		30	622368	651231	729083	1107638	627599	1216490	784531	1140107	688380	1225346	1005000	703844	670714	1090318	843291	1104662	651231	1085490
		50	908402	941747	1050073	1699948	907829	1975255	1129745	1869538	1006496	2001665	1443798	1035743	956642	1692483	1245008	1698424	941747	1661501
	2.0	10	237205	240711	271651	422890	240862	452850	294795	416346	259259	447493	419256	247616	251469	407492	320097	420658	240711	407035
		30	474999	503214	579545	959951	480676	1065844	635813	989513	541460	1074145	853588	543869	521874	941827	696475	955089	503214	940422
		50	713331	746943	852493	1504646	714010	1775940	931833	1668905	812244	1800793	1242651	828038	760085	1496611	1051187	1500863	746943	1470024
5(100)	1.0	10	293352	298495	321246	468070	298713	493351	349458	453792	316825	484012	449420	309656	308081	454509	363217	464614	297214	457170
		30	323413	326000	358465	542137	326639	549896	401014	500332	357053	541262	492505	343433	337531	512292	401137	541740	328984	527387
		50	313618	319970	342619	524999	320601	541195	381717	494590	344217	532235	482942	333416	329511	498769	392358	517202	318928	514763
	2.0	10	200553	206492	228198	374120	206732	399391	254838	359812	224509	389962	353205	214367	214250	361456	271381	369216	203950	365364
		30	220568	223637	256527	436788	223540	447680	294876	397824	252724	438487	386356	235131	234534	409876	300998	432898	227330	425161
		50	213926	220037	242268	421022	220060	441193	277515	394089	242722	431235	379057	230212	228682	399511	293999	410751	219323	415562
10(25)	1.0	10	389153	448151	416234	578287	393949	586127	471172	514721	408784	566294	551497	405625	413991	551828	448151	573333	392173	567472
		30	653471	780237	718486	1068837	663178	1148225	836182	994637	694867	1110664	972274	702750	715742	1047362	780237	1076960	666920	1057291
		50	1061464	1302546	1185614	1845458	1089885	2066158	1408795	1729924	1133736	2001436	1591309	1163603	1184475	1780569	1302546	1826305	1097901	1825988
	2.0	10	197828	257058	221439	384711	207719	388894	276843	320812	223050	368949	353825	206665	220331	358834	257058	377295	198182	377296
		30	377639	506842	439989	791977	393326	867031	556560	715937	424479	829320	689725	416863	438928	771739	506842	796823	389848	785560
		50	675314	918676	796664	1458117	709559	1673857	1018456	1339010	752896	1608480	1196556	763428	796157	1393316	918676	1434820	710829	1444198
10(100)	1.0	10	330921	343823	346685	498549	338449	499809	402863	438717	351147	480146	457820	347721	349471	477928	371846	499332	331131	490423
		30	388977	406416	410013	601317	404825	603448	488618	511072	417599	571310	535921	415985	415332	566850	436158	596873	391904	597008
		50	425018	449539	446445	654256	448435	644585	524046	567341	460801	609489	585873	453457	456937	621192	473477	656625	425837	642058
	2.0	10	149098	164003	165129	312595	159190	316092	217431	256090	171218	296090	266537	160104	166614	293394	194304	309652	148982	308757
		30	181652	198323	200128	379449	193913	392012	268936	301672	204188	361348	313369	193861	204135	352808	232706	372364	183624	381275
		50	204422	227546	224042	417227	220288	422333	290495	342444	233998	388638	350588	219773	232395	391885	257527	417141	205599	411728

These results demonstrate the excellent performance of the IPPO algorithm in different scenarios, indicating that this algorithm has learned to adaptively select actions based on different system states. Compared with the PPO algorithm, the IPPO algorithm demonstrates superior performance across various scenarios, showcasing its robust generalization capabilities and proficiency on selecting optimal actions in diverse system states. Furthermore, the IPPO algorithm can integrate multiagent structures and a shared experience pool mechanism, enhancing its exploratory capacity. However, it was observed that the IPPO algorithm does not yield favorable results in all scenarios; this is primarily due to the potential in the design of the system state, action space, and reward function. The test results also clearly showed that finding a single dispatching rule that consistently performs well in different environments is challenging.

Discussion

From the above analysis, it is evident that the trained IPPO model is the most efficient, consistently yielding the best results in the majority of scenarios. First, the optimal hyperparameters were elected through parameter sensitivity experiments. Subsequently, the SPP layer was incorporated into the network structure for model training, taking into account the unique characteristics of multiorder disturbances. The performance of the trained model was then compared in detail with PPO and traditional dispatching rule. The IPPO model consistently demonstrated superior performance across most scenarios.

The effectiveness of the algorithm proposed in this paper can be attributed to several key factors. The five-channel system state incorporates both global features and critical local features. As illustrated in the training curve in Figure 4, the designed reward function aligns perfectly with the scheduling objectives. Second, to address the challenge of numerous parameters and the difficulty of training associated with DRL algorithms, a parameter sharing strategy is implemented, significantly reducing the training complexity. Third, the IPPO algorithm operates as a multiagent algorithm, where agents collaborate primarily through an experience-sharing mechanism, by leveraging shared experiences among multiple agents, the collaboration and exploration among them substantially decrease the correlation of training data, thereby enhancing the agents’ training efficacy. This innovation effectively resolves the inherent limitation of traditional PPO algorithms, which, although stable during training, frequently exhibit inadequate exploratory prowess. However, the large state space in dynamic scheduling problems, combined with minor changes in adjacent system states extracting the same eigenvalues, resulting in unstable training and a tendency to converge to local optima. The multiframe joint representation is used in this study to express the system state, which improve the training stability.

It is evident that the DRL algorithm does not yield favorable results in all scenarios. This is primarily due to the potential in the design of the system state, action space and reward function. In large-scale and complex dynamic scheduling problems, the state space is vast, making it crucial to identify a general and accurate method for representing the system state. Additionally, a well-defined and high-quality action space can significantly enhance the algorithm effectiveness.

Conclusion

This study introduces an IPPO-based approach for solving DJSSPs under multiple order disturbances. Our method, utilizing a novel state representation and reward function design, demonstrates superior performance across a wide range of production scenarios compared with traditional PPO algorithms and dispatching rules. The proposed approach shows strong optimization and exploration capabilities, offering a promising solution for complex, multiobjective scheduling in dynamic manufacturing environments. However, limitations exist such as the computational complexity of the method for very large-scale problems and the ability to address more complex production environments.

Future work should focus on addressing more intricate production disturbances, such as uncertain processing times and machine failures, and on improving the scalability of the approach for larger manufacturing systems, especially on improving the design of system states, action spaces, and reward functions in complex production environments.

Footnotes

Conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Zhiyuan Sun

Wenmin Han

Longlong Gao

Qiongshuai Lyu

References

Çaliş

Bulkan

. A research survey: review of AI solution strategies of job shop scheduling problem. J Intell Manuf 2015; 26: 961–973.

Zhang

Mei

Nguyen

, et al. Evolving scheduling heuristics via genetic programming with feature selection in dynamic flexible job-shop scheduling. IEEE Trans Cybern 2020; 51: 1797–1811.

Ali

Alarjani

Mumtaz

. A NSGA-II based approach for multi-objective optimization of a reconfigurable manufacturing transfer line supported by digital twin: a case study. Adv Prod Eng Manag 2023; 18: 116–129.

Fontes

Homayouni

Gonçalves

. A hybrid particle swarm optimization and simulated annealing algorithm for the job shop scheduling problem with transport resources. Eur J Oper Res 2023; 306: 1140–1157.

Huang

Gong

. An enhanced memetic algorithm with hierarchical heuristic neighborhood search for type-2 green fuzzy flexible job shop scheduling. Eng Appl Artif Intell 2024; 130: 107762.

Gao

, et al. A hybrid multi-objective grey wolf optimizer for dynamic scheduling in a real-world welding industry. Eng Appl Artif Intell 2017; 57: 61–79.

Zhang

Wang

, et al. Effective dispatching rules mining based on near-optimal schedules in intelligent job shop environment. J Manuf Syst 2022; 63: 424–438.

Zhang

Gao

. A hybrid intelligent algorithm and rescheduling technique for job shop scheduling problems with disruptions. Int J Adv Manuf Technol 2013; 65: 1141–1156.

Gao

Suganthan

Pan

, et al. Artificial bee colony algorithm for scheduling and rescheduling fuzzy flexible job shop problem with new job insertion. Knowl Based Syst 2016; 109: 1–16.

10.

Sharma

Jain

. Performance analysis of dispatching rules in a stochastic dynamic job shop manufacturing system with sequence-dependent setup times: simulation approach. CIRP J Manuf Sci Technol 2015; 10: 110–119.

11.

Teymourifar

Ozturk

, et al. Extracting new dispatching rules for multi-objective dynamic flexible job shop scheduling with limited buffer spaces. Cognit Comput 2020; 12: 195–205.

12.

Wang

, et al. Dynamic job-shop scheduling in smart manufacturing using deep reinforcement learning. Comput Netw 2021; 20: 107969.

13.

Yuan

Cheng

Wang

, et al. Solving job shop scheduling problems via deep reinforcement learning. Appl Soft Comput 2023; 143: 110436.

14.

Huang

J-P

Gao

X-Y

. An end-to-end deep reinforcement learning method based on graph neural network for distributed job-shop scheduling problem. Expert Syst Appl 2024; 238: 121756.

15.

Yan

Guan

, et al. A deep reinforcement learning model for dynamic job-shop scheduling problem with uncertain processing time. Eng Appl Artif Intell 2024; 131: 107790.

16.

Han

B-A

Yang

J-J

. Research on adaptive job shop scheduling problems based on dueling double DQN. IEEE Access 2020; 8: 186474–186495.

17.

Yang

. Intelligent scheduling and reconfiguration via deep reinforcement learning in smart manufacturing. Int J Prod Res 2022; 60: 4936–4953.

18.

Luo

. Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl Soft Comput 2020; 91: 106208.

19.

Liu

Piplani

Toro

. Deep reinforcement learning for dynamic scheduling of a flexible job shop. Int J Prod Res 2022; 60: 4049–4069.

20.

Wang

Liao

. Smart scheduling of dynamic job shop based on discrete event simulation and deep reinforcement learning. J Intell Manuf 2024; 35: 2593–2610.

21.

Zhao

Fan

Zhang

, et al. A DRL-based reactive scheduling policy for flexible job shops with random job arrivals. IEEE Trans Autom Sci Eng 2023; 21: 2912–2923.

22.

Zhang

Liu

, et al. High-speed ramp merging behavior decision for autonomous vehicles based on multi-agent reinforcement learning. IEEE Internet Things J 2023; 10: 22664–22672.

23.

Xia

Ahmadpour

. Non–intrusive load disaggregation of smart home appliances using the IPPO algorithm and FHM model. Sustain Cities Soc 2021; 67: 102731.

24.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 2015; 37: 1904–1916.

25.

Schulman

Levine

Abbeel

, et al. Trust region policy optimization. Proc 32nd Int Conf Mach Learn: PMLR 2015; 37: 1889–1897.

Dynamic job shop scheduling under multiple order disturbances using deep reinforcement learning

Abstract

Keywords

Introduction

Independent Proximal Policy Optimization scheduling framework

Independent Proximal Policy Optimization for scheduling

Problem formulation

Principles of IPPO

Transformation between scheduling problems and algorithm design

System state feature description

Action space

Reward function

Independent Proximal Policy Optimization algorithm process

Simulation experiment and results

Network structure and parameter settings

Training process of IPPO algorithm

Comparison of experimental results

Discussion

Conclusion

Footnotes

Conflicting interests

Funding

ORCID iD

References