Sage Journals: Discover world-class research

Abstract

Batch processing machines are often the bottleneck in semiconductor manufacturing and their scheduling plays a key role in production management. Pioneer researches on multi-objective batch machines scheduling mainly focus on evolutionary algorithms, failing to meet the online scheduling demand. To deal with the challenges confronted by incompatible job families, dynamic job arrivals, capacitated machines and multiple objectives, we propose a clustering-aided multi-agent deep reinforcement learning approach (CA-MADRL) for the scheduling problem. Specifically, to achieve diverse nondominated solutions, an offline multi-objective scheduling algorithm named Multi-Subpopulation fast elitist Non-Dominated Sorting Genetic Algorithm (MS-NSGA-II) is firstly developed to obtain the Pareto Fronts, and a clustering algorithm based on cosine distance is employed to analyze the distribution of Pareto frontier solution, which would be used to guide reward functions design in multi-agent deep reinforcement learning. To realize multi-objective optimization, several reinforcement learning base models are trained for different optimization directions, each of which composed of batch forming agent and batch scheduling agent. To alleviate time complexity of model training, a parameter sharing strategy is introduced between different reinforcement learning base model. By validating the proposed approach with 16 instances designed based on actual production data from a semiconductor manufacturing company, it has been demonstrated that the approach not only meets the high-frequency scheduling requirements of manufacturing systems for parallel batch processing machines but also effectively reduces the total job tardiness and machine energy consumption.

Keywords

Parallel batch processing machines dynamic scheduling multi-objective optimization parameter sharing strategy reinforcement learning

Introduction

Batch Processing Machines (BPMs) are widely used in semiconductor manufacturing systems to improve productivity by handling multiple lots simultaneously. Many processes in semiconductor manufacturing, such as cleaning, diffusion, oxidation, etching, and ion implantation, are efficiently handled by BPMs.^1–3 Moreover, BPMs often represent a critical bottleneck in manufacturing systems due to their high acquisition cost and long processing time. The implementation of effective scheduling strategies for BPMs can yield substantial improvements in on-time delivery and a reduction in production costs.^4–6 The current developments in scheduling BPMs are usually classified as single BPM versus parallel BPMs, incompatible versus compatible job families, identical versus non-identical job sizes, identical versus arbitrary release times, and capacitated versus non-capacitated machines.⁷ The scheduling of the parallel BPMs for semiconductor wafer fabrication studied in this paper involves incompatible job families, non-identical job sizes, arbitrary release times and capacitated machines to maximally approximate the real scenario. To be specific, dynamically arrived jobs are first formed into batches based on job families. The size of each batch cannot exceed the capacity of the BPM, and the processing time of each batch is equal to the processing time of the largest job within the batch. In wafer fabrication systems, on-time order delivery and production cost are two very important optimization objectives, and in this paper we consider both minimizing the total tardiness as well as the overall energy consumption.⁸

Over the years, researchers have proposed various methods to solve the parallel BPMs scheduling problem, including heuristic rules, evolution algorithms, reinforcement learning (RL), and so on. Ikura and Gimple⁹ first proposed the use of heuristic rules to solve the scheduling problem of BPMs. They introduced priority rules for the $1 | r_{j}, p - batch, p_{j} = p | Cmax$ problem to minimize the makespan. Uzsoy¹⁰ address the $1 | p - batch, sj | Cmax$ problem by sorting the jobs based on the longest processing time (LPT). Vepsalainen and Morton¹¹ develop an Apparent Tardiness Cost (ATC) heuristic rule for scheduling a single BPM with minimizing $\sum w_{i} T_{i}$ as a performance objective. Building on the work of Uzsoy,¹⁰ Dupont et al.¹² propose two additional methods, BFLPT and SKP. Among these, SKP has a higher computational cost, but it yields better scheduling solutions. Li et al.¹³ proposed several heuristic rules based on the best-fit longest processing time (BFLPT) for unrelated parallel batch processing scheduling, aiming to minimize makespan. Heuristic rules are typically designed based on experience and rules in static environments, without considering the changes and uncertainties in dynamic environments. While they can effectively allocate priority to processing jobs and quickly respond to dynamically arriving jobs, they also have limitations on the solution space and cannot provide diverse solutions for multi-objective scheduling problems.

When facing more complex or multi-objective scheduling problems, using evolution algorithms such as genetic algorithms,¹⁴ particle swarm optimization,¹⁵ ant colony optimization,¹⁶ greedy algorithms,¹⁷ etc., can lead to better performance in obtaining scheduling solutions. Damodaran et al.¹⁸ propose a greedy random adaptive search procedure (GRASP) to minimize makespan for the single BPM scheduling problem with non-identical job sizes. The solutions showed that this approach outperformed simulated annealing and genetic algorithms. Lu et al.¹⁹ designed a hybrid multi-objective optimization algorithm integrating iterative greedy and efficient local search for solving an energy-efficient scheduling problem for a distributed flow shop in a heterogeneous factory, with the objective of minimizing the completion time and total energy consumption. Shahvari and Logendran²⁰ propose four algorithms based on bi-objective particle swarm optimization to solve the dual-resource scheduling problem on unrelated parallel machines. Li et al.²¹ address the parallel BPMs scheduling problem with incompatible job families and dynamic job arrivals. They utilize ant colony optimization algorithm to minimize TWT (total weighted tardiness) and makespan and the solutions demonstrate significant improvements compared to conventional ATC (Apparent Tardiness Cost) rule. But the performance of their algorithms are strongly influenced by the distribution of job arrival times and the number of jobs. Scheduling problems in dynamic environments are characterized by uncertainty and time variability. While evolutionary algorithms perform well in offline scheduling problems, fully exploring the solution space of the problem, they are cannot offer swift responses to dynamically incoming jobs. This entails adjustments not only in algorithm parameters but also in iterations and evaluations. Consequently, this leads to a notable increase in computational complexity and cost.

In conclusion, both heuristic rules and evolution algorithms have certain limitations in solving dynamic multi-objective parallel BPMs scheduling problems. Recent research in the field of machine learning indicates that RL has potential and applicability in complex scheduling in the manufacturing industry.²² In dynamic environments, RL algorithms learn optimal action policies by actively interacting with the environment. These algorithms dynamically adapt their strategies in response to real-time feedback and environmental fluctuations, enabling them to effectively handle uncertain elements like the arrival of jobs and resource availability through iterative experimentation and exploratory learning.^23,24 Consequently, RL algorithms are widely used in various complex production scheduling problems. Huang²⁵ designed a double-deep Q-learning network for the problem of scheduling massively parallel batch machines by defining four job sequencing rules and three scheduling time windows, thus generating twelve action rules in the operation space. Luo et al.²⁶ proposed a Deep Reinforcement Learning (DRL)-based approach for the Flow Shop Scheduling Problem with Batch Machines (FSSP-BPM), which mainly consists of a basic scheduling framework based on an encoder-decoder model and an attention mechanism. Wang et al.,²⁷ on the other hand, studied the non-permutation flow-shop scheduling problem and proposed a DRL algorithm based on LSTM to minimize the total completion time. Wang et al.²⁸ propose a multi-agent RL method based on double deep Q-networks to address the scheduling problem in online hybrid flow shop problem with parallel BPMs and effectively reduces the total tardiness time. The RL algorithms proposed in the aforementioned literature exhibit satisfactory performance in the context of single-objective scheduling problems. However, when confronted with multi-objective scheduling problems, the convergence of the RL algorithm during the training process becomes challenging, as does the design of the reward function. Thus, further adjustments are necessary in order to address these problems. Wang et al.²⁹ design a composite multi-objective reward function based on weighted inventory cost and tardiness penalty cost in their RL algorithm, aiming to achieve multi-objective optimization scheduling under weighted optimization combinations. The key point of this approach lies in transforming the combination of multiple objectives into a single objective and incorporating it into the reward function of RL, in order to achieve a balance between different objectives at each decision step. Furthermore, Li et al.³⁰ utilize multi-strategy RL algorithms to design multiple reward functions by weighting the multi-objective functions with a uniform distribution of weights. Although this method allows for training multiple RL models and obtaining multiple scheduling solutions in different optimization directions for achieving dynamic multi-objective optimization, it poses a challenge in terms of ensuring the diversity of the Pareto frontier solution when using uniformly distributed weights. The issue arises from the fact that selecting weight values uniformly from a given range fails to cover the true Pareto front in the objective space. Consequently, this approach restricts the algorithm’s ability to explore the solution space, thereby potentially overlooking high-quality solutions that are not considered due to their associated weight combinations. This suggests that there is scope for further improvement.³¹

To sum up, we propose a clustering-aided multi-agent deep reinforcement learning for multi-objective parallel BPMs scheduling. The main contributions of this paper can be summarized as follows:

(1) In order to explore the real problem solution space distribution, an evolutionary algorithm called MS-NSGA-II was developed to sample the multi-objective solution space of the problem. Subsequently, the distribution of Pareto frontier solutions is analyzed employing the cosine distance based K-means clustering algorithm, which decomposes the original multi-objective optimization problem into a number of sub-problems in different optimization directions.

(2) A multi-intelligent agent deep reinforcement learning model has been devised for addressing the two subproblems associated with batch forming and batch scheduling in parallel BPMs scheduling problem. To ensure effective memory and prediction of shop floor states, as well as facilitate collaborative communication between agents, a Long Short-Term Memory (LSTM) network has also been incorporated.

(3) To accelerate the algorithm’s convergence and guide the agents’ learning toward the optimal strategy, the weight parameters obtained from the clustering analysis are innovatively used to design the reward function of the deep reinforcement learning base model in different optimization directions. Additionally, a parameter sharing strategy is employed to speed up training and enhance training quality.

The remainder of this paper is organized as follows. The section 2 describes the parallel BPMs problem and formulates the optimization model. The section 3 proposes a clustering-aided multi-agent reinforcement learning for multi-objective parallel BPMs scheduling algorithm. The section 4 presents the numerical experiments and practical instances study to verify the effectiveness of the proposed approach. The section 5 discusses the conclusions and future research works.

Problem formulation

For the convenience of formal description, notations, sets, and parameters are defined as shown in Table 1.

Table 1.

The list of the sets, parameters, and decision variable.

Category	Notation	Meaning
Sets	M	The set of batch machines, m = 1,2,…,M
	J	The set of lots, j = 1,2,…,N
	B	The set of batches, b = 1,2,…,B
	F	The set of lot families, f,g = 1,2,…,F
	B _b	The set of lots of the batch b
	F _f	The set of lots of the family f
Parameters	Q _max	The maximum quantity of batch
	d _j	The delivery date of lot j
	r _j	The release time of lot j
	C _j	The completion time of lot j
	PE _m	The energy consumption per unit time of batch machine m during processing
	SE _m	The energy consumption per unit time of batch machine m during process switching
	BC _m	The maximum capacity of batch machine m
	S_f,g	The Setup time of switching between lot family f and lot family g
	SN_m,f,g	The total number of switching between lot family f and lot family g on machine m
	P _f	The processing time of lot family f
	t _bm	The start time of batch b on batch machine m
Decision variable	X _jbm	=1, if lot j belongs batch b and is processed on batch machine m =0, else
Decision variable	Y _bfm	=1, if batch b belongs family f and is processed on batch machine m =0, else

Problem description

The scheduling problem of parallel BPMs in semiconductor manufacturing systems is a well-known NP-hard problem.³² It can be represented ( $P_{m} | Batch, r_{j}, P_{f}, e_{j}, d_{j}, incompatible, S_{fg} | TT, TEC$ ) that takes into account incompatible lot families, arbitrary lot release times, non-identical batch sizes, and capacitated machines.

Figure 1 shows parallel BPMs scheduling in a semiconductor manufacturing system in two main steps, batch forming and batch scheduling. Batch forming refers to combining multiple lots into a single batch for processing in batch machines, batch scheduling refers to determining the order and timing of processing for each batch. The entire scheduling process is described as follows: N wafer lots arrive dynamically and need to form different batches in the buffer area based on their lot families. Lots within a batch belong to the same lot family, and the size of a batch cannot exceed the maximum capacity Q^max of the BPMs that processes it. Only lots from the same lot family can be processed simultaneously as a batch on BPMs. Non-equivalent parallel BPMs consist of two or more different BPMs, and the machines are non-preemptive, which means that once a machine starts processing, it cannot be interrupted. During the processing, if the lot families of the current and subsequent batches differ, the BPMs will undergo a process type switch. This switch results in changes in the physical conditions, consequently impacting the setup time.

Figure 1.

Parallel BPMs scheduling in semiconductor manufacturing systems.

Optimization model

To better address the parallel BPMs scheduling problem, this section establishes some reasonable assumptions and constraints that need to be met.

1) Problem assumptions:

Assumption: All wafer lots arrive dynamically. There are no special cases such as cancellations. Different BPMs have different maximum capacity limits. Batches with different process types are incompatible. Lots within the same family have the same processing time and can be batch processed. There are different setup times and energy consumption between consecutive batches due to different process types. A batch can only be processed by one machine at any given time.

Min f 1 = TT, Min f 2 = TEC

(1)

TT = \sum_{j = 1}^{N} \max (C_{j} - d_{j}, 0)

(2)

\begin{matrix} TEC = \sum_{b = 1}^{B} \sum_{f = 1}^{F} \sum_{m = 1}^{M} P E_{m} * Y_{bfm} * P_{f} \\ + \sum_{m = 1}^{M} \sum_{f, g = 1}^{F} S E_{m} * S_{f, g} * S N_{m, f, g} (f \neq g) \end{matrix}

(3)

Equation (1) represents that the objective is to minimize the total tardiness (TT) time of batch and the total energy consumption (TEC).

Equations (2) and (3) represent the specific calculation steps for TT and TEC.

\sum_{j = 1}^{N} \sum_{b = 1}^{B} \sum_{m = 1}^{M} X_{jbm} = 1

(4)

Equation (4) ensures that a lot can only participate in forming one batch and be processed on one machine.

\forall m \in M, \sum_{j = 1}^{N} \sum_{b = 1}^{B} X_{jbm} \leq B C_{m}

(5)

Equation (5) represents that the size of b batch cannot exceed the maximum capacity of the m machine processing that batch.

\sum_{f = 1}^{F} Y_{bfm} = 1, \forall b \in B, \forall m \in M

(6)

Equation (6) represents the lot family constraint, where the lot family of the lots in b batch processed on m machine is the same.

t_{bm} * Y_{bfm} \geq t_{b' m} + (P_{b'} + S_{f, g}) * Y_{b' fm}, \forall b, b' \in B, \forall m \in M

(7)

Equation (7) represents the constraint on the lot family switching time between adjacent batches. Batch b belongs to lot family g, and batch b′ belongs to family f.

t_{bm} \geq \max {r_{j} | \forall j \in B_{b})} + S_{f, g}, \forall b \in B

(8)

Equation (8) represents the constraint on the start time of batch b on machine m.

C_{j} \geq t_{bm} + P_{f} * X_{jbm} * Y_{bfm}, \forall b \in B, \forall m \in M

(9)

Equation (9) represents the constraint on the completion time of lot j.

r_{j}, t_{bm} > 0, \forall j \in J, \forall m \in M, \forall b \in B

(10)

Equation (10) represents the constraint that the release time of lot j and the start time of batch b on machine m are greater than 0.

Parallel BPMs scheduling approach

For the dynamic multi-objective scheduling problem of parallel BPMs, the scheduling framework based on CA-MADRL is shown in Figure 2. This framework comprises three main components: offline scheduling algorithm MS-NSGA-II, K-means clustering algorithm, and MA-DRL model. Initially, the offline solving capability of MS-NSGA-II is employed to sample the Pareto frontier points in the multi-objective space. Subsequently, the sampled points are analyzed using the K-means clustering algorithm, which divides the objective space into multiple subspaces based on cosine distance. The central direction of each subspace is used as the optimization direction and is translated into the optimal combination of weights for the objective function of the multi-objective optimization subproblems. Furthermore, considering the specific characteristics of parallel BPMs problems, a multi-agent deep reinforcement learning (MADRL) base model is constructed. This model includes batch forming agent and batch scheduling agent, with interaction between agents achieved through an LSTM network. Finally, the optimal combination of weights in different optimization directions obtained after clustering is used to design the reward function of the agent. A parameter sharing strategy is then employed to train multiple MADRL base models, which are triggered in parallel during scheduling to achieve multi-objective optimization.

Figure 2.

The overall framework of CA-MADRL approach.

MS-NSGA-II Pareto front sampling method

Metaheuristic algorithms offer distinct advantages in solving static multi-objective optimization problems, through an iterative process that consumes a tremendous amount of time, several optimization solutions are generated.^33,34 Although this approach is not suitable for solving the dynamic BPMs scheduling problem, it can still provide us with excellent data and experience for training RL models. To prevent the utilization of uniformly distributed weight vectors for decomposing the multi-objective optimization problem and consequently impeding the quality of the Pareto solution set, it is crucial to devise an evolutionary algorithm that can comprehensively explore the solution space of the multi-objective parallel BPMs scheduling problem. Relative to NSGA-II, MS-NSGA-II is able to explore a wider solution space while better maintain population diversity through the parallel evolution of multiple independent populations. Based on this, we design batch forming chromosomes and batch scheduling chromosomes according to the characteristics of parallel BPMs problems to improve the evolution speed of the populations. Through iterative optimization of the multiple populations, which includes non-dominated sorting and crowding distance calculation, we are able to obtain the Pareto frontier solutions for the problem. This lays a solid foundation for addressing dynamic parallel BPMs multi-objective scheduling.

Encoding and initialization

For the two subproblems of batch forming and batch scheduling in parallel BPM scheduling, the chromosome is divided into two segments. The first segment describes the batch forming plan for lots, while the second segment describes the batch scheduling plan for batches. The batch forming segment uses integer encoding, with a segment length equal to the number of lots in the chromosome, each gene position represents a lot. The batch scheduling segment uses floating-point encoding, with a segment length equal to the number of batches, each gene position represents a batch.

The chromosome encoding for n lots and b batches consists of two segments. The first gene segment has a range of $[1, 2, \dots, b]$ , indicating the batch number to which the corresponding lot is formed. The second gene segment has a range of $[1, m + 1)$ . Here, the integer part represents the machine scheduled to the batch, while the decimal part represents the processing priority of the batch. Figure 3 presents a small-scale instance, in the batch forming segment, the six lots were divided into three batches. In the batch scheduling segment, the integer part represents the machines arranged for the lot, which are machines 2,3,1, and the fractional part represents the priority of processing of the lot, which are batches 3,1,2.

Figure 3.

An example of chromosome initialization.

Crossover and mutation operators

In the chromosome of the batch forming, we randomly select one lot family and exchange it with another. The crossover process in the batch forming section is shown in Figure 4. In this particular example, there are three lot families, and the lot family 2 is randomly selected for crossover, resulting in the exchange of genes between two lot families. However, this crossover operation may lead to incompatible conflicts between lot families in the grouping process. For instance, in offspring 2, lot 3, 4, 5 belong to different lot families but are formed to the same batch 3. To address the issue of generating a large number of illegal solutions after crossover, the chromosome segments after crossover are re-encoded to ensure compliance with the constraints related to incompatible lot families. We choose to split that batch and re-encode the batch numbers using a right-shift strategy, as shown in the re-encoding in the lower part of Figure 4.

Figure 4.

Example of crossover in batch forming chromosome.

In the chromosome of the batch scheduling, we utilize single-point crossover, as shown in Figure 5. When a batch is scheduled to a machine and the batch size exceeds the maximum capacity of the machine, we handle the exceeding batches by randomly selecting a machine number from the set of machines ${M_{m} | B C_{m} > \sum_{j = 1}^{N} X_{jbm}}$ that meet the capacity requirements using a uniform distribution. Then, we add a random number between 0 and 1 to the machine number to obtain the new gene value. we utilize single-point crossover.

Figure 5.

Example of crossover in batch scheduling chromosome.

Mutation in the batch forming chromosome serves the purpose of ensuring solution legality, increasing solution diversity, and escaping local optima. To achieve this, the mutation process is divided into two steps. In the first step, a random lot family is selected, and then two genes of that lot family are randomly chosen for numerical exchange. This exchange, as shown in Figure 6, occurs within the same lot family to avoid batch incompatibility issues and effectively enhance the evolution speed of the population. Mutation in the batch scheduling chromosome, on the other hand, involves randomly selecting one gene and regenerating it using an initialization method. This process is shown in Figure 7.

Figure 6.

Example of mutation in batch forming chromosome.

Figure 7.

Example of mutation in batch scheduling chromosome.

Multi-subpopulation

The convergence and diversity of the population search process of genetic algorithms can be affected by crossover rates, mutation rates and elite strategies. A higher mutation rate can increase the diversity of the solution set, but it may also prolong the convergence time of the algorithm. To address this issue, a single population of the traditional NSGA-II algorithm is divided into multiple sub-populations. Each subpopulation uses different crossover and mutation parameters. With this division, the convergence and diversity varies among the subpopulations. This approach increases the diversity of solutions and helps to get rid of local optima. Based on the above-designed MS-NSGA-II algorithm, the parallel BPMs scheduling problem is solved offline by iterative optimization to obtain a Pareto frontier solution set. Previous studies have shown that sampling the optimal and suboptimal solutions can better reflect the distribution of the Pareto frontier solution set.³⁵ Therefore, in this section, the output sampling points are the Pareto frontier solutions obtained from the first three layers of the algorithm.

K-Means clustering algorithm based on cosine distance for multi-objective space decomposition

After obtaining the Pareto front sampling points using the MS-NSGA-II algorithm, the distribution of the solution set can be analyzed by decomposing the objective space into several subspaces. Traditional methods use mature system scheme decomposition techniques proposed by Das and Dennis³⁶ to generate a set of uniformly distributed reference points on the unit hyperplane. However, the performance of solutions generated by uniformly distributed subspaces is poor due to the inconsistency with the distribution of the Pareto front. K-Means clustering is a widely used clustering algorithm with fast convergence and high interpretability. Some researchers have used reference vectors combined with the traditional Euclidean distance K-Means clustering method to select solutions.³⁷ However, the diversity of Pareto front solutions is essentially the diversity in the direction of weight vectors, and cosine distance is more suitable for measuring the differences in direction, as shown in Figure 8(a).

Figure 8.

(a) Illustration of Euclidean distance and cosine distance and (b) illustration of clustering based on cosine distance.

When the angle between two Pareto front points is small in a multi-objective optimization space, the direction of the weight vectors is roughly the same, as shown in Figure 8(b). If clustering is performed directly based on Euclidean distance, the result would be three clusters: $(x 1, x 2), (x 3, x 4)$ , and $(x 5, x 6)$ . However, from the graph, it can be observed that $(x 1, x 2, x 3)$ adheres to the weight vector $e_{1}$ , and $(x 4, x 5, x 6)$ adheres to the weight vector $e_{2}$ . Therefore, it can be seen that Euclidean distance is not the optimal solution for clustering Pareto front points, and cosine distance would be a better choice.

The expression for the cosine of the angle between two weight vectors $e_{1}$ and $e_{2}$ is as follows: $< \sin θ_{i}, \cos θ_{i} >$

\cos (e_{1}, e_{2}) = \frac{e_{1} \cdot e_{2}}{‖ e_{1} ‖ ‖ e_{2} ‖}

(11)

Since the cosine value of the angle between two vectors approaches 1 as the angle approaches 0, equation (12) is used as the quantified expression of cosine distance:

d = 1 - \cos (e_{1}, e_{2})

(12)

The cost function of the K-Means clustering algorithm based on cosine distance can be defined as the sum of the cosine distances between the directions of each Pareto front point and the direction of the centroid in its corresponding subspace, expressed as equation (13).

J (c, μ) = min_{c} min_{μ} \sum_{i = 1}^{M} 1 - \cos (x_{i}, {μ_{c}}_{i})

(13)

In the equation (13), $x_{i}$ represents the ith Pareto front point, $c_{i}$ is the subspace to which $x_{i}$ belongs, M is the total number of Pareto sampling points, $μ_{c_{i}}$ represents the weight combination of the optimization directions corresponding to the subspace, and if $μ_{c_{i}} = (ob j_{1}, ob j_{2})$ , where $ob j_{k}$ represents the optimization value of the kth objective. Furthermore, based on trigonometric relationships, the direction angle $θ_{c_{i}}$ is calculated and the weight vector is normalized as $(\cos θ_{c_{i}}, \sin θ_{c_{i}})$ . The calculation is as equations (14) and (15):

\sin θ_{c_{i}} = \frac{ob j_{1}}{{obj}_{1}^{2} + {obj}_{2}^{2}}

(14)

\cos θ_{c_{i}} = \frac{ob j_{2}}{{obj}_{1}^{2} + {obj}_{2}^{2}}

(15)

By iterating through subspace allocation and updating the subspace centers, the Pareto front points are divided into multiple subspaces, where the Pareto front solutions are concentrated in the direction of the subspace centers, which represents the optimal optimization direction. The pseudocode for the algorithm is shown in Algorithm 1.

Algorithm 1: K-Means clustering algorithm based on cosine distance
Input: Pareto frontier sampling points
Output: Optimization direction of the decomposed subspace’s center
1: Randomly select K center directions in the objective space, denote them as $μ_{1}^{0}, μ_{2}^{0}, \dots, μ_{k}^{0}$
2: Define the cost function $J (c, μ) = min_{c} min_{μ} \sum_{i = 1}^{M} 1 - \cos (x_{i}, {μ_{c}}_{i})$
3: for $t \leftarrow 0$ to epoch_max:
4: Assign each sample $x_{i}$ to the nearest subspace based on cosine distance:
5: $c_{i}^{t} \leftarrow \underset{k}{argmin} (\cos (x_{i}, μ_{k}^{(t)}))$
6: Recompute the center of each cluster k:
7: $μ_{k}^{t + 1} \leftarrow \underset{μ}{argmin} \sum_{i : c_{i}^{i} = k} \cos (x_{i}, μ_{k}^{(t)})$
8: end for

After obtaining the optimal direction of the centers for multiple subspaces, the parallel batch processing multi-objective scheduling problem is transformed into a finite number of fixed-weight optimization subproblems. The target weights for each subspace center direction can be represented as $(\cos θ, \sin θ)$ . The subspace centers obtained through cosine distance clustering can reflect the true distribution of the Pareto front solution set, simplifying the model while preserving the diversity of solutions. The number of divided subspaces, K, is a key parameter. A smaller K value leads to a larger loss, resulting in a greater deviation in reflecting the distribution of the objective space. A larger K value leads to a smaller loss and a more accurate reflection of the objective space, but the computational complexity increases with the increase of K. Therefore, in this section, the elbow method is used to determine the value of K through multiple experiments.³⁸

Parameter sharing based multi-objective deep reinforcement learning scheduling

After performing Pareto front point sampling with MS-NSGA-II and decomposing the objective space with the K-means clustering algorithm to obtain different weight parameter combinations for the objective functions, it is used in the design of the reward function of agents for the MADRL base model. The base model is trained using a parameter sharing strategy, namely, after training the sub-model in the current direction, the parameters of that model are directly utilized as the initialization parameters for the model in the subsequent direction. This strategy significantly reduces training time and enhances the overall training quality. Essentially, this represents an improved multi-strategy RL approach. As shown in Figure 9, when a scheduling is triggered, the dynamic scheduling base models of multiple subspaces are computed in parallel. This computation yields multiple solutions in the dynamic environment, thereby achieving multi-objective optimization in RL.

Figure 9.

Implementing multi-objective scheduling using reinforcement learning.

State space and action space

To address the two sub-problems of parallel BPMs scheduling, batch forming and batch scheduling, we have designed batch forming agent and batch scheduling agent. The agents make scheduling decisions based on the state information of the scheduling environment and they perceive the dynamic changes in the environment through the state information.

1) State space

Design state matrix $F = [f_{1, j, n}, f_{2, b, k}, f_{3, m, l}]$ based on the state characteristics related to the scheduling constraints and optimization objectives of parallel BPMs. The $f_{1, j, n}$ represents the states of the lot waiting to be formed into a batch, j represents the number of a lot, n represents the number of a lot state. The $f_{2, b, k}$ represents the states of the batch waiting to be scheduled, b represents the number of a batch, k represents the number of a batch state. The $f_{3, m, l}$ represents the states of the machine, m represents the number of a machine, l represents the number of a machine state. Table 2 provides specific state parameters.

2) Action space

Table 2.

State parameters for parallel BPMs scheduling.

Parameter type	Expression	Meanings
Lot state parameters	$f_{1, j, 1} = P_{j} = P_{f}, \forall j \in F_{f}$	The process time of lot j
	$f_{1, j, 2} = r_{j}$	The release time of lot j
	$f_{1, j, 3} = w t_{j} = t - r_{j}$	The waiting time of lot j
	$f_{1, j, 4} = F_{f}$	The family type of lot j
	$f_{1, j, 5} = rw t_{j} = d_{j} - t - P_{j}$	The relax time of lot j
Batch state parameters	$f_{2, b, 1} = P_{f}$	The process time of batch b
	$f_{2, b, 2} = F_{f}$	The family type of batch b
	$f_{2, b, 3} = Q_{b}$	The size of batch b
	$f_{2, b, 4} = \min {rw t_{j} \| j \in B_{b}}$	The minimum relax time of lots within batch b
	$f_{2, b, 4} = mean {rw t_{j} \| j \in B_{b}}$	The mean relax time of lots within batch b
	$f_{2, b, 4} = \max {rw t_{j} \| j \in B_{b}}$	The maximum relax time of lots within batch b
Machine state parameter	$f_{3, m, 1} = {\begin{matrix} r t_{m, b}, if Y_{bfm} = 1 \\ 0 \end{matrix}$	The remaining processing time of batch b
	$f_{3, m, 2} = F_{f}$	The lot family type processed by machine m
	$f_{3, m, 3} = B C_{m}$	The maximum capacity of machine m
	$f_{3, m, 4} = i t_{m}$	The idle time of machine m

The action space of parallel BPMs scheduling primarily includes the action space of batch forming agent and the action space of batch scheduling agent.

In the action space of batch forming agent, a batch buffer is set up, and the action determines whether to form a batch with the current pending lots or keep lots in the buffer. The action space of batch forming agent is defined as follows:

Action 1: Selecting a certain number of lots for batch forming

a_{t}^{B} = {j_{1}, j_{2}, \dots, j_{Q_{\max}}}

(16)

Action 2: Wait

a_{t}^{B} = 0

(17)

In the action space of batch scheduling, when there is idle machine, the action is made to select the appropriate batch for processing. Choosing to wait indicates not selecting any batch for processing. The machine matching rule is set to select the machine with the minimum switching time from the set of machines that meet the capacity requirements.

Action 1: Selecting batches for processing on machines

a_{t}^{S} = {b_{1}, b_{2}, \dots, b_{idm}}, idm \in {m_{1} . m_{2}, \dots m_{M}}

(18)

Action 2: Wait

a_{t}^{S} = 0

(19)

Reward function

The reward function should consider both objectives in this chapter. Based on the obtained weight vector, the objectives are weighted to obtain a composite multi-objective function as equation (20):

obj = \cos θ \cdot TT + \sin θ \cdot TEC

(20)

1) Reward function of batch forming agent:

R^{B} = - \sin θ \cdot {tt}_{dc}^{B}, {tt}_{dc}^{B} = \sum_{j = 1}^{n} ({dt}_{dc}^{B} - max (d_{j}, {dt}_{dc - 1}^{B})) \cdot s w_{j} (t)

(21)

s w_{j} (t) = {\begin{matrix} 1, job j is waiting to be formed at time t \\ 0, others \end{matrix}

(22)

${tt}_{dc}^{B}$ is the total tardiness generated by all the lots waiting to be formed batches when performing the $d c_{th}$ batch forming. ${dt}_{dc}^{B}$ represents the time of the $d c_{th}$ batch forming.

2) Reward function of batch scheduling agent:

R^{S} = - \sin θ \cdot {tt}_{dc}^{B} - \cos θ \cdot {tec}_{dc}^{S}

(23)

{tec}_{dc}^{S} = \sum_{m = 1}^{M} (P E_{m} * P_{f} * Y_{bfm} + S E_{m} * S_{f, g} * e w_{m} (dc))

(24)

\begin{matrix} e w_{m} (dc) = \\ {\begin{matrix} 1, job family transition on machine m during the d c_{th} scheduling \\ 0, others \end{matrix} \end{matrix}

(25)

${tec}_{dc}^{S}$ represents the total energy consumption generated by the machines when the batches are scheduled to the machines and completes processing when performing the $d c_{th}$ batch forming.

Multi-agent reinforcement learning

The interaction process between the RL agents and the parallel BPMs can be effectively described using Markov Decision Processes (MDP). When scheduling events are triggered, the agents begin by observing the current state of the scheduling environment, denoted as $s_{t}^{B}, s_{t}^{S}$ . The agents then select scheduling decisions, denoted as $a_{t}^{B}, a_{t}^{S}$ , from the set of executable scheduling decisions. The reward values, denoted as $r_{t}^{B}, r_{t}^{S}$ , are assigned to the agents based on the contribution of the selected scheduling decisions to the overall objective. This reward values serve as feedback to guide the learning process of agents. By collecting a substantial amount of scheduling experience data through the aforementioned interaction, the model is updated. This iterative process allows the RL agents to learn and improve their decision-making capabilities over time.

Figure 10 illustrates the multi-agent RL model. The model comprises two agents: the batch forming agent and the batch scheduling agent. Each agent is equipped with an Actor module responsible for generating scheduling strategies. The agents engage in sequential scheduling using a dynamic scheduling mechanism. They interact with the parallel BPMs scheduling environment and continuously optimize their scheduling strategies based on the learned scheduling experiences. Both agents share a global Critic and a global LSTM network. The Critic represents the action-value function, which evaluates the quality of scheduling decisions. The global LSTM network maps the environment’s global state and the agents’ scheduling decisions to the strategy evaluation. This multi-agent RL model allows the agents to collaborate and learn from each other’s experiences, leading to improved scheduling strategies and overall performance.

Figure 10.

Multi-agent reinforcement learning model.

Parameter sharing based model training

As shown in Figure 2 and Algorithm 2, the process of multi-agent interacting with the MDP is defined. After designing the individual reward functions based on different optimization weights, it is necessary to train the models to ensure that each base model can find near-optimal or optimal solutions in each optimization direction. A parameter sharing strategy is used in the training process, that is, the different base models are trained in order of θ from largest to smallest, when the first base model is trained, that is, its network parameters have been optimized to be near optimal. The next base model is trained with the optimal network parameters of the previous model as a starting point, which speeds up the training speed and improves the training quality. In short, the network parameters are sequentially transferred from the previous subproblem to the next.

Algorithm 2: CA-MADRL
Load Optimal weight combinations $[(\sin θ_{1}, \cos θ_{1}), \dots, (\sin θ_{K}, \cos θ_{K})]$ , LSTM network.
Initialize agent parameters: $θ^{1}, θ^{2}, ϕ, ψ$
$θ_{old}^{1}, θ_{old}^{2}, ϕ_{old}, γ_{old} \leftarrow θ^{1}, θ^{2}, ϕ, ψ$
For each episode do:
Initialize decision time $dt = 0$ and decision count $dc = 0$ .
Initialize lot sequence, batch forming experience pool BP, batch scheduling experience pool SP, interaction vector $m_{dc}$ and global state $S_{dc}$ .
While not done do:
for lot in waiting lots:
Initialized the reward function $R_{k}^{B}, R_{k}^{S}$ based on the weights $(\sin θ_{k}, \cos θ_{k})$ .
Observe the local state of the batch forming: $s_{dc}^{B}$ ;
for $k \leftarrow 0 to K$ :
The base model k outputs the batch forming action:
$a_{dc, k}^{B} ~ π_{k}^{0} (m_{dc - 1}, s_{dc}^{B}; θ_{k}^{0}, ψ)$ ;
Execute action $a_{dc, k}^{B}$ to form batch;
Refresh the state after batch forming: $S_{dc + 1}, r_{dc}^{B} \leftarrow Env (S_{dc}, a_{dc, k}^{B})$ ;
Refresh the interaction vector: $m_{dc + 1} = LSTM (S_{dc}, a_{dc, k}^{B}, γ)$
Save $[S_{dc + 1}, a_{dc, k}^{B}, r_{dc}^{B}, m_{dc}]$ to BP
$dc \leftarrow dc + 1$ ;
While True,
Observe the local state of the batch scheduling: $s_{dc}^{S}$
for $k \leftarrow 0$ to K:
The base model k outputs the batch scheduling action:
$a_{dc, k}^{S} ~ π_{k}^{1} (m_{dc - 1}, s_{dc}^{S}; θ_{k}^{1}, ψ)$
Execute action $a_{dc, k}^{S}$ to schedule batch;
Refresh the state after batch forming: $S_{dc + 1}, r_{dc}^{S} \leftarrow Env (S_{dc}, a_{dc, k}^{S})$ ;
Refresh the interaction vector: $m_{dc + 1} = LSTM (S_{dc}, a_{dc, k}^{S}, γ)$ ;
Save $[S_{dc + 1}, a_{dc, k}^{S}, r_{dc}^{S}, m_{dc}]$ to SP;
$dc \leftarrow dc + 1$ ;
If The batch scheduling result is waiting,
Break;
End while
Wait until the next batch forming or batch scheduling is triggered, and modify the current time: t
End while
Calculate the global discount: $Q (s_{dc}, a_{dc}), \forall k$
For agent = batch forming agent $A^{B}$ , batch scheduling agent $A^{S}$ do:
For epoch = 1,2,…,N do:
Calculate policy network gradient: $\nabla L (θ^{i}, γ)$ and value network gradient: $\nabla J (ω, ψ)$
Update policy network: $(θ^{i}, ψ) \leftarrow (θ^{i}, ψ) + α_{θ} \nabla J (θ^{i}, ψ)$
Update value network: $(ω, ψ) \leftarrow (ω, ψ) - α_{ω} \nabla L (ω, ψ)$
End for
$θ_{old}, ω_{old}, ψ_{old} \leftarrow θ, ω, ψ$
End for
End for

The main processes of Algorithm 2 include: initializing weight combinations and agent parameters, determining training parameters and experience pool initialization, batch forming phase, batch scheduling phase, calculating global discounts, and updating the policy network and value network. Through the above process CA-MADRL algorithm is able to achieve optimization in batch processing and scheduling tasks through multi-agent deep reinforcement learning, and continuously improves its strategies to obtain better solutions. Additionally, the LSTM network plays a crucial role in storing environment states, making predictions, and facilitating communication among agents. Hence, prior to model training, the LSTM network is frozen and the LSTM modules of multiple base models are consolidated into a shared module. This allows the LSTM network to consistently capture the scheduling records of different base models during the training process.

Numerical experiments

To verify the effectiveness of the CA-MADRL approach proposed in this paper, we conducted experiments using 16 instances designed based on actual production historical data of the diffusion area from a semiconductor manufacturing company. The diffusion area is a typical parallel BPMs scheduling workshop, where wafers are usually processed in batches on carriers. These carriers can accommodate multiple wafers and can be processed simultaneously in the parallel BPMs. The scheduling workshop’s lot data includes parameters such as lot families, processing times of lot families, and due dates of lots. The machine data includes parameters such as lot families that the machine can process, maximum capacity of machines, processing and idle power consumption of machines. Production setup time when switching lot families, as shown in Table 3. We set up four different types of machines and lot combinations $(m \times n) \in {5 \times 50, 5 \times 100, 10 \times 50, 10 \times 100}$ . The arrival time of lots follows a normal distribution, which can be represented by the following formula:

f (x) = \frac{1}{\sqrt{2 π} σ} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}

(26)

Table 3.

Machine setup time for switching between lot families.

	Lot family 1	Lot family 2	Lot family 3
Lot family 1	0 h	0.5 h	1 h
Lot family 2	0.5 h	0 h	2 h
Lot family 3	1 h	2 h	0 h

$μ$ represents the mean arrival time of lots, while $σ^{2}$ represents the variance of lot arrivals. A larger value of $σ^{2}$ indicates sparser lot arrivals, while a smaller value indicates more compact lot arrivals. We set up four combinations of time arrival coefficients $(T_{ave} = μ \times σ^{2}) \in {10 \times 3, 10 \times 6, 20 \times 3, 20 \times 6}$ based on experimental tests conducted by Chang et al.³⁹ In total, we obtain 16 instances to evaluate the effectiveness of the proposed algorithm based on $m \times n \times T_{ave}$ .

Parameters tuning

The designed MS-NSGA-II algorithm is used to sample the Pareto frontier for multiple test instances. The parameters in MS-NSGA-II are set according to the relevant experience of genetic algorithm parameter setting in the literature¹⁴ and the number of decision variables in batch machines scheduling, taking into account the diversity of populations and exploration ability. The parameters are set as follows: population size is set to 100, maximum number of iterations is set to 500. The combination of crossover and mutation rates for the three subpopulations are set as (0.8, 0.02), (0.9, 0.01), and (0.7, 0.03) respectively. Through iteratively calculating, frontiers point sampling was performed for each instance, resulting in a set of sample points.

In this section, the elbow method³⁸ is used to optimize the value of K. The results are shown in Figure 11(a). As K increases, the clustering error gradually decreases. When K > 4, the curve tends to flatten, indicating that increasing the number of clusters no longer significantly improves the clustering effect. Therefore, K = 4 is determined to achieve a small clustering error with a relatively small number of clusters, ensuring diversity among the clusters. The visualization of the clustering results using Instance 1 is shown in Figure 11(b).

Figure 11.

(a) Curve for optimizing the parameter K and (b) clustering results.

In the RL algorithm experiment, it is necessary to optimize the performance of the algorithm by determining some key parameters. Firstly, probability matching is used to randomly sample candidate scheduling decisions based on the weights of policy outputs,⁴⁰ in order to avoid the agent getting stuck in local optima and improve the diversity of the solution set. A suffix of 0 or 1 is added after the state matrix to differentiate between batch forming and scheduling for the global Critic. Two agents with the same network structure are built to optimize parameters such as learning rate ${α^{B}, α^{S}}$ , discount factor $γ$ , and interaction vector length $m$ . The results of orthogonal experiments for some key parameters are shown in Figure 12.

Figure 12.

Orthogonal experiments in reinforcement learning.

The batch size and experience buffer capacity are determined based on the interaction data during the algorithm scheduling process. Since the interaction data can be reused after parameter updates, the experience buffer capacity is set to 3–5 times the number of interaction data obtained in one round, which improves the efficiency of agent scheduling interactions and accelerates the learning process. The time window size and time scaling factor $bt$ are set according to the scheduling process. The optimizer, discount factor, and number of hidden neurons in the hidden layer are determined based on experience. The final determination of the relevant parameters is shown in Table 4.

Table 4.

Reinforcement learning experimental parameters.

Parameters	Value	Parameters	Value
Number of hidden layers in the networks	4	Time window $tw$	2 h
Number of neurons in the hidden layers of the networks	128	$α_{θ}$	$10^{- 6}$
Activation function in the hidden layers of the networks	Relu	$α_{ω}$	$10^{- 6}$
Activation function in the output layer of the networks	Softmax	Batch size	4096
Discount factor for rewards $γ$	0.99	Length of interaction vector m	8
Optimizer	Adam	bt	48 h
Number of neurons in the hidden layers of LSTM	128	Capacity of experience replay buffer	409,600

The training process of multi-agent

The various parameters of the RL model training process can reflect the correctness of the algorithm design and the effectiveness of improvements. Set the algorithm according to the parameters in Table 4 and train it on the training set. Record the changes in the agent’s parameters during the iteration process. Figure 13(a) shows that the global cumulative discounted reward value of the CA-MADRL algorithm gradually increases during the training process. Additionally, both Figure 13(b) for the batch forming agent and Figure 13(c) for the scheduling agent exhibit a good upward trend in average cumulative discounted rewards. This indicates that the two agents have formed a good cooperative relationship through information interaction centered around LSTM units, achieving global optimization. The total tardiness and total energy consumption in Figure 13(d) gradually decrease, validating the consistency between the reward function designed with optimal weights and the global optimization objective. The value network loss in Figure 13(e) converges gradually, indicating that the global critic’s evaluation error of the scheduling made by the two agents decreases over time, establishing a global evaluation system for the dyeing workshop scheduling. In Figure 13(f), the evaluation of the agent’s action values by the global critic gradually increases, indicating that as the evaluation error decreases, the scheduling performance of the agent improves.

Figure 13.

(a) Global cumulative discounted reward, (b) reward of batch forming agent, (c) reward of batch scheduling agent, (d) weighted objective function value, (e) loss of value network, and (f) estimated values of the value network.

Algorithm performance comparison experiment

To evaluate the performance of the proposed CA-MADRL algorithm, a traditional multi-policy reinforcement learning (MP-RL) algorithm and the MS-NSGA-II proposed in this paper as comparison algorithms. MS-NSGA-II has a population size of 100 and maximum iterations of 300 and 400. The traditional MP-RL uses weight combinations with uniformly distributed optimization directions. The number of optimization directions is the same as the value obtained from the clustering algorithm. Other parameters of the base model are the same as the parameter design of the proposed approach. Since the RL base model can only obtain one solution in each run, each base model for each optimization direction is independently run 5 times to obtain multiple solutions.

Multi-objective optimization cannot be simply judged based on the objective values obtained. Referring to the evaluation system for the solution set of multi-objective optimization established in existing research,⁴¹ the evaluation of multi-objective optimization results mainly relies on the following three indicators: $Δ$ , GD, and IGD.⁴² In the experiment, since the true Pareto front PF* is unknown, the computationally expensive MS-NSGA-II algorithm is used to accurately solve each test instance and the maximum number of iterations is 500. In this way, an approximate Pareto front $P F^{*}$ that represents the true PF is obtained. For each test instance, we run CA-MADRL and MP-RL separately for 5 times in each direction, and compare the obtained solution sets with the $P F^{*}$ . In order to avoid potential errors, each instance was solved separately 20 times and their means were compared. Table 5 illustrates the comparative results of three algorithms in terms of three metrics. The CA-MADRL algorithm outperforms the other algorithms on all 16 algorithms, which indicates that the diversity and convergence of the solution sets obtained are improved. This also reflects that the strategy of using the true Pareto solution set obtained by MS-NSGA-II and the optimization directions obtained by cosine-based K-Means clustering is correct. Additionally, it also gives the average CPU time required to generate the solutions for each algorithm, with 8.82 s for CA-MADRL, 9.29 s for MP-RL, 63.06 s for MS-NSGA-II-300 and 81.79 s for MS-NSGA-II-400. The computational efficiency of the CA-MADRL algorithm is significantly better than other algorithms, especially when dealing with large-scale problems, which meets the requirements of dynamic scheduling of real semiconductor batch machines.

Table 5.

Comparison results of multiple indicators.

Algorithms	Instance	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
CA-MADRL	Δ	6.18E-01	5.97E-01	6.42E-01	6.30E-01	6.22E-01	5.98E-01	6.12E-01	5.87E-01	6.55E-01	6.38E-01	6.32E-01	6.42E-01	6.78E-01	7.23E-01	6.58E-01	6.81E-01
	GD	2.33E-03	2.44E-03	2.07E-03	2.15E-03	2.21E-03	2.41E-03	2.28E-03	2.46E-03	2.04E-03	2.10E-03	2.14E-03	2.08E-03	1.96E-03	1.71E-03	1.97E-03	1.98E-03
	IGD	2.20E-02	2.81E-02	2.55E-02	2.49E-02	2.68E-02	2.73E-02	2.77E-02	2.99E-02	2.47E-02	2.61E-02	2.63E-02	2.59E-02	2.10E-02	1.85E-02	2.35E-02	2.02E-02
	CPU/s	6.56	5.58	6.61	5.13	13.02	10.41	12.35	11.24	7.31	7.53	5.74	5.81	13.89	10.00	10.84	9.05
MP-RL	Δ	6.94E-01	6.33E-01	6.93E-01	7.06E-01	7.01E-01	6.77E-01	6.85E-01	6.16E-01	7.12E-01	6.65E-01	6.76E-01	7.02E-01	6.94E-01	7.34E-01	6.85E-01	7.08E-01
	GD	2.88E-03	3.15E-03	2.57E-03	3.00E-03	4.01E-03	3.23E-03	3.12E-03	3.26E-03	2.78E-03	2.76E-03	2.92E-03	3.02E-03	3.24E-03	2.47E-03	2.88E-03	2.73E-03
	IGD	3.07E-02	3.91E-02	4.28E-02	5.62E-02	3.69E-02	4.21E-02	3.52E-02	4.52E-02	3.76E-02	3.29E-02	3.41E-02	4.36E-02	2.73E-02	2.53E-02	3.17E-02	2.83E-02
	CPU/s	7.82	7.53	6.55	5.74	12.38	11.43	13.12	10.84	7.73	9.88	7.51	6.06	12.94	9.53	11.65	7.98
MSNSGA-II-300	Δ	7.93E-01	6.83E-01	7.39E-01	7.12E-01	7.09E-01	6.76E-01	6.99E-01	6.72E-01	7.29E-01	7.04E-01	7.11E-01	7.09E-01	7.39E-01	7.85E-01	6.98E-01	7.53E-01
	GD	3.67E-03	3.17E-03	2.72E-03	2.78E-03	2.99E-03	3.12E-03	2.83E-03	3.13E-03	2.73E-03	2.71E-03	2.73E-03	2.77E-03	2.55E-03	2.28E-03	2.37E-03	2.61E-03
	IGD	3.88E-02	3.58E-02	3.03E-02	3.10E-02	3.34E-02	3.59E-02	3.39E-02	3.78E-02	3.12E-02	3.13E-02	3.08E-02	3.01E-02	2.72E-02	2.24E-02	2.73E-02	2.59E-02
	CPU/s	51.28	52.33	54.87	51.35	80.59	72.73	78.69	71.29	59.92	62.33	53.51	53.97	77.83	65.17	64.85	58.27
MSNSGA-II-400	Δ	7.42E-01	6.47E-01	6.97E-01	6.82E-01	6.71E-01	6.42E-01	6.61E-01	6.38E-01	6.93E-01	6.72E-01	6.79E-01	6.72E-01	7.04E-01	7.77E-01	6.77E-01	7.14E-01
	GD	3.23E-03	2.81E-03	2.44E-03	2.56E-03	2.69E-03	2.83E-03	2.59E-03	2.82E-03	2.47E-03	2.41E-03	2.41E-03	2.48E-03	2.19E-03	2.38E-03	2.14E-03	2.36E-03
	IGD	3.47E-02	3.02E-02	2.81E-02	2.79E-02	3.01E-02	3.07E-02	3.17E-02	3.45E-02	2.89E-02	2.86E-02	2.85E-02	2.79E-02	2.42E-02	2.32E-02	2.59E-02	2.27E-02
	CPU/s	63.34	64.88	66.92	62.32	101.37	95.62	97.34	90.58	82.34	88.47	70.21	71.17	99.57	86.65	92.17	78.73

To visually demonstrate the impact of clustering assistance on multi-objective scheduling of RL agents, we plot the Pareto frontiers obtained by CA-MADRL and MP-RL when solving Instance 5. In Figure 14, the solutions obtained from the MP-RL algorithm guided by a uniform distribution of directions show a significant deviation from the $P F^{*}$ , especially the solutions obtained in the $18^{°}$ angle direction. In contrast, the CA-MADRL algorithm proposed in this paper utilizes the optimal direction angles derived from cosine-based clustering analysis as the guiding directions. The solution distribution of CA-MADRL aligns closely with the $P F^{*}$ . This finding suggests that cosine-based clustering methods can effectively direct the learning direction of agents in reinforcement learning, improving the optimization efficiency of the algorithm and the diversity of the solution set.

Figure 14.

Pareto frontier comparison.

Conclusion

This paper presents a novel approach called CA-MADRL to solve multi-objective scheduling problems. It effectively addresses the issue of RL algorithms lacking diversity in solutions when faced with multi-objective optimization, as well as the computational time problem faced by evolution algorithms in dynamic environments. The multi-objective scheduling problem is solved by MS-NSGA-II algorithm and analyzed by cosine based clustering algorithm to obtain the distribution of Pareto solutions in the real solution space. Further, the original problem is decomposed into multiple sub-problems and multiple MADRL base models are trained to solve it, where the weights obtained from the clustering analysis are used for the design of the reward function of the agents and a parameter sharing strategy is used to speed up the training process.

The proposed multi-objective BPMs scheduling approach is validated using 16 test instances designed based on actual production data from a semiconductor manufacturing company and the results show that the trained multi-objective scheduling model can effectively approximate the true Pareto front, resulting in a reduction in total tardiness time and total energy consumption. Furthermore, the response time of the method is greatly reduced compared to the time-consuming exact solution process of evolutionary algorithms, which makes it more suitable for practical scheduling applications in factories. The proposed approach provides multiple optimization solutions, but among these solutions, how to automatically select the optimal weights for workshop scheduling based on the actual production status in the aggregated optimization directions is of great practical significance. It is a research direction for achieving fully automated scheduling in parallel BPMs for multi-objective optimization. At the same time, big data from the production process is used in conjunction with artificial intelligence technology to predict future production demand and machine status for advance scheduling and optimization, further improving the performance of the scheduling system.

Footnotes

Acknowledgements

We thank the editors and the anonymous reviewers for their fruitful comments and suggestions in improving the quality of this paper.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Key R&D Program of China [grant number 2022YFB3305003], the Open Project of Henan Key Laboratory of Intelligent Manufacturing of Mechanical Equipment, Zhengzhou University of Light Industry [grant number IM202303], National Natural Science Foundation of China [grant number 52005099], and the Shanghai Science and Technology Innovation Action Plan [grant number 22511101903].

ORCID iD

Peng Zhang

Data availability statement

The data that has been used is confidential.

References

Zhong

Liu

Bao

A job-priority based soft scheduling approach for uncertain work area scheduling in semiconductor manufacturing. Int J Prod Res 2022; 60(16): 5012–5028.

Mönch

Fowler

Mason

SJ.

Production Planning and control for semiconductor wafer fabrication facilities: modeling, analysis, and Systems. New York, NY: Springer New York, 2013 (Operations Research/Computer Science Interfaces Series, vol. 52).

Arroyo

JEC

Leung

JYT

. Scheduling unrelated parallel batch processing machines with non-identical job sizes and unequal ready times. Comput Oper Res 2017; 78: 117–128.

Zhou

Wang

Zhang

, et al. Research on dyeing workshop scheduling methods for knitted fabric production based on a multi-objective hybrid genetic algorithm. Meas Control 2020; 53(7-8): 1529–1539.

Gokhale

Mathirajan

. Heuristic algorithms for scheduling of a batch processor in automobile gear manufacturing. Int J Prod Res 2011; 49(10): 2705–2728.

Shen

Zhu

. A parallel-machine scheduling with periodic constraints under uncertainty. Adv Mech Eng 2019; 11(12): 1687814019892430.

Fowler

Mönch

. A survey of scheduling with parallel batch (p-batch) processing. Eur J Oper Res 2022; 298(1): 1–24.

Huang

Meng

, et al. A Pareto-based collaborative multi-objective optimization algorithm for energy-efficient scheduling of distributed permutation flow-shop with limited buffers. Robot ComputIntegr Manuf 2022; 74: 102277.

Ikura

Gimple

. Efficient scheduling algorithms for a single batch processing machine. Oper Res Lett 1986; 5(2): 61–65.

10.

Uzsoy

Scheduling a single batch processing machine with non-identical job sizes. Int J Prod Res 1994; 32(7): 1615–1635.

11.

Vepsalainen

APJ

Morton

. Priority rules for job shops with weighted tardiness costs. Manage Sci 1987; 33(8): 1035–1047.

12.

Dupont

Ghazvini

. Minimizing makespan on a single batch processing machine with nonidentical job sizes. J Eur Syst Autom 1998; 32(4): 431–440.

13.

Huang

Tan

, et al. Scheduling unrelated parallel batch processing machines with non-identical job sizes. Comput Oper Res 2013; 40(12): 2983–2990.

14.

Jijun

Daogang

. Application research on improved genetic algorithm and active disturbance rejection control on quadcopters. Meas Control 2024: 1–11.

15.

Chen

Shen

. Dynamic search control-based particle swarm optimization for project scheduling problems. Adv Mech Eng 2016; 8(4): 1687814016641837.

16.

Cui

Huang

, et al. Multi-strategy adaptable ant colony optimization algorithm and its application in robot path planning. KnowlBased Syst 2024; 288: 111459.

17.

Niu

Nie

Zhang

, et al. A greedy randomized adaptive search procedure (GRASP) for minimum weakly connected dominating set problem. Expert Syst Appl 2023; 215: 119338.

18.

Damodaran

Ghrayeb

Guttikonda

MC.

GRASP to minimize makespan for a capacitated batch-processing machine. Int J Adv Manuf Technol 2013; 68(1–4): 407–414.

19.

Gao

, et al. Energy-efficient scheduling of distributed flow shop with heterogeneous factories: a real-world case from automobile industry in China. IEEE Trans Ind Inform 2021; 17(10): 6687–6696.

20.

Shahvari

Logendran

. A bi-objective batch processing problem with dual-resources on unrelated-parallel machines. Appl Soft Comput 2017; 61: 174–192.

21.

Qiao

. ACO-based multi-objective scheduling of parallel batch processing machines with advanced process control constraints. Int J Adv Manuf Technol 2009; 44(9-10): 985–994.

22.

Jiang

Yuan

Xiong

, et al. Obstacle avoidance USV in multi-static obstacle environments based on a deep reinforcement learning approach. Meas Control 2024; 57(4): 415–427.

23.

Wang

Xie

Guo

, et al. Deep reinforcement learning-based rehabilitation robot trajectory planning with optimized reward functions. Adv Mech Eng 2021; 13(12): 16878140211067011.

24.

Liu

Wang

, et al. Task-level decision-making for dynamic and stochastic human-robot collaboration based on dual agents deep reinforcement learning. Int J Adv Manuf Technol 2021; 115(11–12): 3533–3552.

25.

Huang

Mixed-batch scheduling to minimize total tardiness using deep reinforcement learning. Appl Soft Comput 2024; 160: 111699.

26.

Luo

Jiang

Liu

, et al. Flow-shop scheduling problem with batch processing machines via deep reinforcement learning for Industrial Internet of Things. IEEE Trans Emerg Top Comput Intell 2024; 1–16.

27.

Wang

Cai

, et al. Solving non-permutation flow-shop scheduling problem via a novel deep reinforcement learning approach. Comput Oper Res 2023; 151: 106095.

28.

Wang

Zhang

, et al. Independent double DQN-based multi-agent reinforcement learning approach for online two-stage hybrid flow shop scheduling with batch machines. J Manuf Syst 2022; 65: 694–708.

29.

Wang

Zhang

A reinforcement learning method to optimize the priority of product for scheduling the large-scale complex manufacturing systems. Auckland, New zealand: Curran Associates Inc, 2018.

30.

Pan

Liang

YC.

An effective hybrid tabu search algorithm for multi-objective flexible job-shop scheduling problems. Comput Ind Eng 2010; 59(4): 647–662.

31.

Nguyen

Vamplew

, et al. A multi-objective deep reinforcement learning framework. Eng Appl Artif Intell 2020; 96: 103915.

32.

Brucker

Gladky

Hoogeveen

, et al. Scheduling a batching machine. J Scheduling 1998; 1(1): 31–54.

33.

Loukil

Teghem

Tuyttens

Solving multi-objective production scheduling problems using metaheuristics. Eur J Oper Res 2005; 161(1): 42–61.

34.

Arroyo

JEC

Armentano

. Genetic local search for multi-objective flowshop scheduling problems. Eur J Oper Res 2005; 167(3): 717–738.

35.

Nouiri

Bekrar

Trentesaux

Towards energy efficient scheduling and rescheduling for dynamic flexible job shop problem. IFAC-PapersOnLine 2018; 51(11): 1275–1280.

36.

Das

Dennis

. Normal-boundary intersection: A new method for generating the Pareto surface in nonlinear multicriteria optimization problems. SIAM J Optim 1998; 8(3): 631–657.

37.

Yuan

Yang

. Research on K-value selection method of K-means clustering algorithm. J 2019; 2(2): 226–235.

38.

Syakur

Khotimah

Rochman

EMS

, et al. Integration K-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conf Ser Mater Sci Eng 2018; 336(1): 012017.

39.

Chang

P-Y

Damodaran

Melouk

Minimizing makespan on parallel batch processing machines. Int J Prod Res 2004; 42(19): 4211–4220.

40.

Rivas

Probability matching and reinforcement learning. J Math Econ 2013; 49(1): 17–21.

41.

Gao

, et al. An effective multiobjective algorithm for energy-efficient scheduling in a real-life welding shop. IEEE Trans Ind Inform 2018; 14(12): 5400–5409.

42.

Kesireddy

Medrano

. Elite multi-criteria decision making—Pareto front optimization in multi-objective optimization. Algorithms 2024; 17(5): 206.

A clustering-aided multi-agent deep reinforcement learning for multi-objective parallel batch processing machines scheduling in semiconductor manufacturing

Abstract

Keywords

Introduction

Problem formulation

Problem description

Optimization model

Parallel BPMs scheduling approach

MS-NSGA-II Pareto front sampling method

Encoding and initialization

Crossover and mutation operators

Multi-subpopulation

K-Means clustering algorithm based on cosine distance for multi-objective space decomposition

Parameter sharing based multi-objective deep reinforcement learning scheduling

State space and action space

Reward function

Multi-agent reinforcement learning

Parameter sharing based model training

Numerical experiments

Parameters tuning

The training process of multi-agent

Algorithm performance comparison experiment

Conclusion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

Data availability statement

References