Abstract
The job-shop scheduling problem (JSSP) is a complex combinatorial problem, especially in dynamic environments. Low-volume-high-mix orders contain various design specifications that bring a large number of uncertainties to manufacturing systems. Traditional scheduling methods are limited in handling diverse manufacturing resources in a dynamic environment. In recent years, artificial intelligence (AI) arouses the interests of researchers in solving dynamic scheduling problems. However, it is difficult to optimize the scheduling policies for online decision making while considering multiple objectives. Therefore, this paper proposes a smart scheduler to handle real-time jobs and unexpected events in smart manufacturing factories. New composite reward functions are formulated to improve the decision-making abilities and learning efficiency of the smart scheduler. Based on deep reinforcement learning (RL), the smart scheduler autonomously learns to schedule manufacturing resources in real time and improve its decision-making abilities dynamically. We evaluate and validate the proposed scheduling model with a series of experiments on a smart factory testbed. Experimental results show that the smart scheduler not only achieves good learning and scheduling performances by optimizing the composite reward functions, but also copes with unexpected events (e.g. urgent or simultaneous orders, machine failures) and balances between efficiency and profits.
Introduction
In the last three decades, the world has experienced three times of industrial revolutions, enabling industry with automated machines and information systems to support rapid and innovative product design. The manufacturing industry plays an important role in the development processes of the economy and technology. Today, manufacturing companies face a large number of challenges in a turbulent environment originating from saturated markets, abrupt and unprecedented changes in market demands, an increasing number of product variants, and smaller lot sizes.1,2 This trend is intensified by production modes such as mass customization and individualization, which guarantee the creation of unique products to meet the requirements of almost every customer. 3 The transformation of traditional industrial markets inevitably leads to the evolution of manufacturing modes.
Production scheduling is a complex combinatorial problem, more specifically, a non-polynomial-hard (NP-hard) problem, which coordinates manufacturing resources to get optimal sequences and combinations. It is traditionally elaborated in a centralized manner using offline methods and considering the problems are static and deterministic. 4 However, initial production plans often become invalid due to many unexpected events that come from internal (e.g. machine failures, operator absence), external (e.g. rush orders, unavailability of raw materials), or composite factors (e.g. time fluctuation, resource uncertainty). Therefore, scheduling algorithms are required to modify the original schedules for eliminating the disturbances. Early studies focus on rescheduling resources offline to achieve new schedules, which may suspend the manufacturing processes and increase the lead time. 5 Simulation technologies (e.g. multi-agent system) build scheduling algorithms with empirical rules and historical data, but they are limited in achieving optimal schedules in a changeable manufacturing environment. 6 Recently, artificial intelligence (AI) algorithms such as reinforcement learning (RL) are adopted to industrial manufacturing with regard to online scheduling. However, most RL-based scheduling algorithms are inefficient at learning to optimize the decision-making policies when multiple objectives are considered in a smart manufacturing factory. 7 The overall performances of manufacturing systems are influenced by many factors such as order requirements, machine properties, and supply chain profits, which can be transformed into composite reward functions in RL-based scheduling systems. This paper realizes online scheduling based on RL with composite reward functions, which makes manufacturing systems to be more efficient and robust. Scheduling decisions are made online according to the operation attributes and machine states together. Composite rewards are given to the smart scheduler at all action moments, equipping smart manufacturing systems with self-learning and adaptation capabilities.
The remaining sections of this paper are organized as follows. Section 2 shows an overview of existing manufacturing scheduling approaches. Based on RL, Section 3 invites composite rewards and online methods to develop a system with self-learning abilities for job-shop scheduling. Section 4 designs an experimental testbed to conduct case studies on online scheduling. Section 5 expresses detailed experimental results and proposes the general relationships between composite rewards and manufacturing scheduling. Finally, Section 6 rounds up the paper with conclusions and prospects.
Literature review
The production scheduling problem has been widely studied, mainly due to its dynamic nature, highly combinatorial aspects, and applicability in manufacturing systems. 8 In past years, researchers proposed many scheduling methods such as mathematical programming,9,10 rule-based dispatching,11–13 expert systems,14,15 heuristic search,16,17 or biomimetic algorithms. 18 For example, Hou et al. 19 formulated an integrated distributed production and distribution problem with a mixed integer programming model and developed an enhanced brain storm optimization algorithm to solve the model.
Fu et al. 20 proposed an evolutionary algorithm and a local search method to solve a dual-objective stochastic hybrid flow shop deteriorating scheduling problem considering job deterioration and uncertain job processing time. However, these methods handle jobs or uncertainties with continuous iterations or periodic optimization, which takes extra computing time to adjust the initial plans offline and obtain newly optimal schedules. In addition, duplicated calculations for rescheduling resources may cause the whole system to shut down when the environmental states change frequently. 21 When dealing with scheduling and rescheduling problems, some researchers used swarm intelligent algorithms 22 to obtain optimal solutions from a large action space. In dynamic and uncertain environments, the optimized schedule produced by traditional scheduling methods can quickly become unacceptable, so dynamic rescheduling is required as fast as possible.
The fourth industrial revolution marked by intelligent manufacturing attracts researchers to a new research area of intelligent scheduling. 23 Agent technologies provide a natural way to conduct production scheduling in distributed manufacturing environments.24–27 Many socialized interactive mechanisms, such as contract net protocol, 28 game theory, 29 auction mechanism, 30 are used to coordinate machines in multi-agent systems, where resources can acquire rational dynamic schedules autonomously based on specific rules. 31 In traditional multi-agent systems, machines participate in negotiation and interaction with each other according to specific rules. Such agent-based or holonic 4 scheduling systems implement distributed manufacturing, but they require the support of sophisticated rules and have poor adaptability. Rule-based scheduling methods provide a platform for machines to realize real-time job-shop scheduling; however, historical data and future circumstances cannot be stored and anticipated. Hence, some evaluation factors are used to record operation logs such as capability, self-confidence, activity, and robustness. These factors provide guidance for prospective decisions, but they can only mark down the participation times of machines in the decision platform. Thus, such improvements are weak in providing comprehensive and rapid decision mechanisms for machine interaction, and it is difficult to improve the scheduling performance substantially. Distributed scheduling integrates discrete units or factories and enables them to operate independently, which improves the operating efficiency of the whole manufacturing system. 32 However, the distributed manufacturing manner may cause environmental fluctuation and local optimum.
Currently, innovative online scheduling methods are needed to follow the trend of mass customization of products. These methods have one or more typical features such as self-organized operation, self-adaptation adjustment, and self-learning optimization. With the success of Alpha Go,33,34 RL aroused the interest of researchers, and many of them tried to apply RL to production scheduling. Zhang et al. 35 addressed a dynamic unrelated parallel machine scheduling problem with a mean weighted tardiness objective, but unexpected events, order details, and rewards in addition to tardiness are not involved. Shiue et al. 36 proposed a real-time scheduling method based on RL using multiple dispatching rules by incorporating two main mechanisms, that is, an offline learning module and a Q-learning-based module. Shahrabi et al. 37 used RL to improve the scheduling performance for dynamic job-shop scheduling problems, considering random job arrivals and machine failures. Kardos et al. 38 designed a scheduling algorithm based on Q-learning to solve the dynamic job-shop scheduling problem for reducing the average lead-time of production orders. However, it is difficult for basic Q-learning algorithms to adapt to problems with a large action space. Wang et al. 39 used correlated equilibrium to propose a multi-agent RL algorithm for makespan and cost optimization to guide the scheduling of multi-workflows over clouds, but many real events and factors existing in real manufacturing conditions are neglected. Such research brought new approaches to manufacturing scheduling, dealing with dynamic task assignments, and solving uncertainties well in job-shop scheduling problems. Nevertheless, the researches mentioned above are mainly concerned with optimizing makespan and costs from the operational research aspect, regardless of an overall study of various disturbances, reward-scheduling mechanisms, and industrial implementation. Meanwhile, most approaches to job-shop scheduling own fake real-time characteristics without online features that are largely influenced by job numbers and computing efficiency.
This current paper concentrates on establishing an online scheduling system to deal with the tendency of globalized and individualized manufacturing. A smart factory testbed was built to realize communication between machines and the smart scheduler. RL is used to establish a smart scheduler for online scheduling orders and handling unexpected events, which equips the self-organized manufacturing system with self-adaptive abilities. The smart scheduler improves its comprehensive capabilities through operating processes guided by innovative composite rewards. An instant reward is given to the smart scheduler after a scheduling action is taken rather than after all orders are completed, which increases the training efficiency of the smart scheduler.
Research methodology
Model description
An order contains only one job denoted by

The timeline of job
The total number of machines is denoted by
Some constraints are set up to keep the shop floor operating robustly. First, the chosen machine is capable of undertaking the corresponding operations, that is
Jobs are generated in real time and are automatically added to the job list, ensuring that the actual machining conditions are met. As shown in Figure 2, each block represents an operation of a job marked by a time label with initialization time and completion time, and the boundaries of blocks in the job list represent scheduling moments. As the former operation of a job is completed, the next operation of this job is initialized and added to the operation list. The scheduler handles operations along the dash line in the operation list of Figure 2. For example, job

Online scheduling with RL according to the operation list:
RL for online scheduling
Reinforcement learning
RL is a series of algorithms to learn how to map situations to actions so as to maximize a numerical reward signal.
40
RL can be formalized as Markov decision process (MDP) which is a straightforward framework of learning from interactions with a dynamic environment to maximize the total reward. A typical RL model comprises two parts of agent and environment. The agent learns to choose an action
where
Many algorithms are designed to obtain optimal policies of MDPs such as temporal difference (TD), dynamic programming (DP), and Monte Carlo (MC) methods. Like DP, TD methods update estimates individually or partly rather than waiting for an eventual result, which increases the efficiency of calculation. Like MC methods, TD methods interact directly with environments to acquire their statuses, which avoids depending on the transition possibility of models. Industrial environments need high stabilization, swift reaction, and low latency, so TD methods are appropriate for the operation of smart factories. Q-learning 41 is an off-policy TD algorithm that incorporates the advantages of DP and MC methods. The iterative equation of Q-learning is shown as follows:
where
In a Q-learning algorithm, the maps from the state space to the action space are recorded in a Q-table. If the state space is so large, a neural network can be used to take place of the Q-table, that is Deep Q-Network (DQN) is invented. 42 The DQN has two neural networks: a target network and an evaluation network. The loss function of DQN, shown in equation (3), is used to update the parameters of the evaluation network. Parameters of the target network are copied from the evaluation network at every few steps.
where
From RL to job-shop scheduling problem
The precondition of solving scheduling problems with RL methods is turning scheduling problems into multi-step decision problems. In a job-shop scheduling problem, there are three main characters, including jobs, machines, and schedulers. Jobs are assigned to appropriate machines by schedulers. To implement online scheduling, schedulers should make decisions as soon as jobs are generated, which is a continuous single-step decision manner. Online scheduling is dependent on the current state of shop floors, and this scheduling approach turns a conventional JSSP into an RL model. The relationships between RL and JSSP models are listed in Table 1. The environment of a shop floor contains machines, material handlers, warehouses, etc. All jobs come from warehouses, and they will be brought back to them when the last operation is finished. Material handlers transport jobs between machines or between machines and warehouses. Machines play an important role in the processing of jobs.
The relationships between RL and JSSP models.
State space of the shop floor
In a job-shop scheduling system, all machines are interconnected to build an environment for machining jobs and give back rewards for the scheduling actions. Each action can receive a reward, and all the rewards of a workpiece constitute a value for estimating the trace. A smart scheduler helps workpieces choose the appropriate machines according to the current state of the shop floor. The state space is made up of the attributes of schedulable operations and machines. The detailed state features of the shop-floor environment are listed in Table 2. At time step
State features of the shop-floor environment.
Rewards for scheduling optimization
Evaluation criteria of manufacturing scheduling methods come from three aspects including productivity (e.g. time and flow efficiency), cost (e.g. resource consumption), and customer satisfaction (e.g. tardiness and quality). Most researches aim at only one objective that minimizes the makespan which belongs to the first aspect. Actually, resource consumption should be considered along with shortening the machining period, which not only meets customers’ satisfaction but also saves manufacturing costs. Increasing utilization rate of machines or choosing high-efficiency machines can minimize the makespan of jobs. Generally, it is necessary to balance increasing efficiency and saving costs to get higher profits.
Shortening makespan
One of the optimization objectives is to minimize the mean tardiness rate of all jobs as follows:
where
where
Saving production costs
The other optimization objective is to minimize the mean consumption of resources. The amount of resource consumption rises along with the elevation of machining efficiency. The resource consumption
Profits increase with the reduction of costs. Machines with higher speed can complete operations swiftly, but consume more resources. The prices of orders depend on the workload time
where
The profit
Increasing machine utilization rates
Different machines have different properties, and schedulers are willing to choose efficient machines that complete tasks quickly. If a large number of tasks are assigned to specific capable machines, it may cause system congestion and workload imbalance. Some bottleneck resources can also slow down the operating efficiency of manufacturing systems. It is necessary to balance the workloads among the targeted machines. The standard deviation of the utilization rates of machines of type
where
The utilization rate of machine
where
Composite reward functions
There are many evaluation criteria and optimization objectives on job-shop scheduling for different working conditions, including time, cost, machine utilization, and system capacity. The machining time of each workpiece is recorded. One of the important objectives is to minimizing the makespan to guarantee the delivery time of orders. Many methods aim to reduce the flow time of workpieces, such as improving machining efficiency and shortening waiting time. High-speed machines can complete tasks rapidly, but require higher costs. Low-speed machines consume lower equipment costs, but need more time for machining workpieces. A workload balance is supposed to be maintained among machines with different working abilities. The operating time and idle time of each machine are recorded to calculate the utilization rate of each machine, and a comparison is made among machines. It is necessary to balance workloads among machines and improve their utilization rates. The warehouse downloads orders from the order system and stores them in its buffers. If there are some spare targeted machines, the warehouse will assign tasks to them. If the shop floor is always busy and the task buffers are full of tasks all the time, the scheduling system can provide improvement suggestions to entrepreneurs.
According to the above criteria, the optimization objectives of job-shop scheduling problems mainly contain three types, including lead time, resource consumption, and machine utilization rates. Most researches on job-shop scheduling optimization are concerned with time parameters such as makespan, tardiness, switching time, deadline, etc. Therefore, under actual conditions, resource consumption should be taken into consideration as well. Thus, a multi-objective optimization measure is invited into the RL-based scheduling method, where makespan and resource consumption are considered together. Such approaches are not restricted to minimizing the makespan of orders to guarantee the target completion time, but they conform to actual machining conditions as well.
When a scheduler takes an action at time step
where
The weights
Scheduling actions
Scheduling methods aim to achieve a balance between jobs and machines to obtain optimized combinations and sequences of these two resources. In general, the actions of a scheduler include two types of scheduling and machining, which can change the state of the shop floor. Detailed steps are shown in Figure 3 to illustrate the online scheduling processes based on DQN algorithms. The scheduler should take actions at scheduling moments, including the operation initialization moments and completion moments. At these moments, the scheduler collects all the states of related machines and makes decisions based on operation attributes and machine states. As a scheduling action is chosen, the job will be transported to the target machine. The states of machines are updated and a reward for the action is given. Between two scheduling moments, the scheduler and machines do nothing but process operations. If there is no operation to do or all the operations are finished, machines will be in a standby mode and take no action. If all machine buffers are occupied, scheduled jobs have to wait until a job is finished and a buffer is released. In these operation actions, the scheduler moves the timeline and updates all machine states to the next scheduling moment.

Online scheduling processes of the smart scheduler based on RL.
Regarding scheduling actions, when a new order is generated or the current operation is finished, the scheduler will choose a machine for its next operation. Apart from scheduling actions, the environment will not move until the operations on a machine are finished. The RL method simplifies scheduling models, and all conditions will not be ignored, as shown in Figure 4. Figure 4 is a real-time Gantt chart that is used for the explanation of action types. It is assumed that there are three millers, three lathes, and a warehouse on the shop floor, and each machine has a buffer whose length is 4. The horizontal axis is a timeline and the vertical axis represents machines. Above the timeline, operations that have been scheduled are marked by rectangles in bold lines, and operations that will be scheduled are in double dot-dash lines. The operations waiting for being scheduled are below the timeline and the solid bricks represent that they are processed without delay. At time t1, an operation is initialized at the shop floor. If it is a milling operation, it can be scheduled on Miller 2 and processed immediately, that is, ActM2. It can also choose to queue up behind the operation lists of Miller 1 or Miller 3. Miller 3 needs more waiting time than Miller 1, but it needs less machining time on the same operation. If it is a lathing operation, optional machines are Lathe 1 and Lathe 2. It cannot choose Lathe3 because the remaining buffer length of Lathe 3 is 0. At time

Action types of the smart scheduler.
Neural networks are used to establish a mapping relationship between operation-machine states

The neural network of DQN for getting
Experimental design
Manufacturing resource specifications

The layout of the smart manufacturing factory.
The details of orders for scheduling.
The inherent attributes of machines.
Case studies for the scheduling model
Experimental results and discussion
Learning performance of the smart scheduler
Composite reward functions are designed to help the smart scheduler improve its decision-making ability. Different reward functions bring different scheduling results, and combinations of reward functions may obtain better improvements than a single component. The composite reward

Learning curves of all the components of composite rewards: (a) reward for time saving rate, (b) reward for energy saving rate, (c) reward for machine utilization rate, (d) reward for workload distribution deviation, and (e) composite rewards.
At the very beginning, the scheduler may have many attempts many steps to schedule the orders in an episode. With the accumulation of experience, its decision-making abilities improved a lot during these training episodes, and it can make scheduling decisions instantly with its rich experience. Figure 7(a) to (d) are learning curves of the model based on
After an order is completed, reward

Learning processes of the scheduler based on time saving rates: (a) time saving rates and (b) scheduling processes of the scheduler.
The reward

Learning processes of the scheduler based on machine utilization rates: (a) waiting time of jobs and (b) scheduling processes of the scheduler.
The workload balance of machines is a significant criterion for evaluating the performance of shop floors. The standard deviations of the machine utilization rates, given by equation (12), is a useful parameter for illustrating the workload balance of machines. Standard deviations of machine operating time are respectively calculated on lathes and millers, and all the results of the former 300 episodes are shown in Figure 10(a). Both the fitting curves of lathes and millers show a decreasing trend. The standard deviations of lathe operating time decrease from 70.48 to 23.92 by 66.06%, and these of millers decrease from 92.82 to 30.34 by 67.31%. As shown in Figure 10(b), the scheduler trained with reward

Learning processes of the scheduler based on workload distribution deviations: (a) the standard deviations of workloads and (b) scheduling processes of the scheduler.
The composite reward

Scheduling results of the scheduler based on composite rewards: (a) results of the untrained scheduler and (b) results of the trained scheduler.
Scheduling performance for resource variations
According to the above experimental results, the trained scheduler performs well in scheduling processes. Thirty new orders are created stochastically to evaluate the adaptive abilities of a trained scheduler. The smart scheduler is trained with composite rewards

Scheduling performances for new orders of the scheduler based on composite rewards: (a) results of the untrained scheduler and (b) results of the trained scheduler.
Self-adaptive performances for new orders, failure machines, and simultaneous jobs.
Machine failures are common in production processes, and such problems are inevitable and may disturb shop floors. The scheduler faces great challenges in dealing with these unexpected technical failures. According to the Gantt chart in Figure 13, the smart scheduler perceived the failure of Miller 1 and scheduled the following milling operations to the other normal millers. In comparison with Figure 12(b), utilization rates of the normal millers increased a lot while the actions of three lathes almost keep the same. Standard deviations of lathe utilization rates decrease from 26.86 to 27.40 by 4.33%, and these of millers increase from 12.25 to 22.00 by 79.59%. The total waiting time of jobs increases from 77 to 250 by 224.68%. The total time saving rates decrease from 19.46 to 18.26 by 6.17%. Detailed results are recorded in the third row of Table 5. Obviously, the smart scheduler can eliminate the effects of machine failures and keep the stability of the manufacturing system.

Scheduling performance for machine failures.
Large amounts of orders may be generated simultaneously. As shown in Figure 14, 30 orders are assumed to be generated at the same time. Jobs queue to be processed without delay, and the utilization rates of all machines are very high. The detailed scheduling results of simultaneous orders are recorded in the last row of Table 5. Although the total waiting time is so long, the target completion time can be guaranteed as well. Although the scheduler has not met such events before, it still keeps a preferable and robust performance in handling machine failures. Hence, the smart scheduler has good adaptability for scheduling different kinds of orders online according to the real-time statuses of machines on a shop floor.

Scheduling for simultaneous orders.
Adaptive performance for different objectives
On most occasions, there are various machines with different machining abilities and efficiency on the shop floor. The efficiency of machines is denoted by the machining speed factor

Scheduling results considering order urgency and machine efficiency: (a) operating time of machines and (b) dealing with urgent orders.
As shown in Table 6, the total operating time of machines changes during the training period of the smart scheduler. The operating time of Lathe 1 and Miller 1 increases dramatically by 324.41%–750.81%. Meanwhile, the operating time of other machines decreases to some extent. The total utilization time of lathes and millers decreases by 62.06%–66.58%. The scheduling performance of the smart scheduler improved a lot during the training period considering order urgency and machine efficiency.
Operating time of machines considering order urgency and machine efficiency.
The production objectives of companies include not only improving operating efficiency but also saving costs to increase profits. The energy consumption factor

Scheduling results considering profits and energy consumption: (a) profits contributed by machines and (b) scheduling actions for saving energy.
As shown in Table 7, the total profits of machines increase from 2190.76 to 2392.20 by 9.19% during the training period of the scheduler. Profit factors contributed by Lathe 3 and Miller 3 respectively increase from 504.25, 515.15 to 678.65, 713.59 by 34.59%–38.52%. Meanwhile, profits from other machines show a certain reduction to some extent. Apparently, the trained scheduler considering energy saving rates is capable of scheduling operating on machines with lower energy consumption rates autonomously.
Machine profits considering resource consumption rates.
Scheduling performances of the scheduler considering machining efficiency and resource consumption has been separately verified in the previous parts of this section. In real manufacturing processes, efficiency and profits should be taken into consideration at the same time. In this part, these two criteria are considered simultaneously by considering all the components of composite reward

Scheduling results considering order urgency and resource consumption.
Comparison of different scheduling methods
Most production scheduling methods aim at achieving time-related objectives that are concerned with the order sequence and completion time. RL-based scheduling with composite rewards (RL-C) is compared with some common online or offline scheduling methods. About 100 orders with 152 operations are generated stochastically to evaluate the scheduling performances of Genetic Algorithm (GA), Shortest Processing Time First (SPTF), First Come First Serve (FCFS), and RL-based methods. FCFS is a common rule-based method for dynamic manufacturing scheduling. GA is an offline heuristic method that can be used to obtain optimal schedules offline. Another offline scheduling method, SPTF, is adopted to obtain a schedule with the shortest waiting time in total. FCFS is an online method and schedules jobs according to their initialization time. A basic RL-based scheduling method (RL-B) is also adopted to make a comparison, while aiming to minimize the idle time of machines. The experimental algorithms belong to three typical categories of scheduling methods, including iterative optimization (i.e. GA), rule-based simulation (i.e. SPTF and FCFS), and AI-based self-learning (i.e. RL-B and RL-C).
The learning performances of three online scheduling methods (i.e. FCFS, RL-B, and RL-C) are shown in Figure 18, where the performance metric is the total waiting time of 100 orders in each episode. The waiting time of an order is from the scheduling completion moment to the machining start moment. FCFS schedules orders based on specific rules and is not capable of improving decision-making abilities during scheduling processes. At the very beginning, RL-based scheduling methods (i.e. RL-B and RL-C) learn to minimize the waiting time of orders in real time. The learning curves of RL-B and RL-C converge near episode 270 and episode 170. RL-C converges faster than RL-B and needs 37.0% fewer training episodes than RL-B. Therefore, the proposed RL-C scheduling method shows better learning performances than the traditional rule-based method and basic RL scheduling method.

The learning performance of three online scheduling methods.
GA and SPTF are offline scheduling method that reschedule the jobs when unexpected events (e.g. urgent orders and machine failures) happens in the manufacturing environment. FCFS, RL-B, and RL-C are online scheduling methods that schedule jobs and handle unexpected events in real time. The calculation time and makespan of each scheduling method are recorded in Table 8. GA can give the best schedules, but it takes a long time to get the result by a complex iteration period, and some unexpected events may cause an incredible increase in computing time. As 10 urgent orders are generated stochastically, online methods can handle the events well with a negligible increase in computing time and makespan. As a machine breaks down, all the scheduling methods can reschedule the operations on normal machines with an increase of makespan. As shown in Figure 19, the curves of RL-C nearly coincide with the results of GA and are more robust than other scheduling methods. Composite rewards improve the online decision-making abilities of the smart scheduler.
Comparison of different scheduling methods (values: computing time/makespan; units: s).

Comparison of the makespan of different scheduling methods.
Conclusions and future work
This paper presents a smart scheduler for online scheduling low-volume-high-mix orders in a smart manufacturing factory. RL algorithms equip the smart scheduler with the abilities of self-organization, self-learning, and self-adaptation. A new composite reward function enables the smart scheduler for online optimization of multiple scheduling objectives. A series of experiments are performed to evaluate how the reward functions (i.e.
Low-volume-high-mix orders and unexpected events bring a large number of uncertainties to real-world production environment. RL and composite reward functions help the smart scheduler learn autonomously to schedule manufacturing resources with dynamic features in real-time. The smart scheduler can adapt to different real case situations by optimizing the components of the composite reward function. Therefore, engineers do not need to spend much time rescheduling resources when the attributes of orders or machines vary in the dynamic manufacturing environment, which reduces the human costs, computing resources, and lead time.
Real production processes are more complex than experimental environments, and various factors (e.g. transition time and transportation optimization) existing in real working conditions should be considered in future researches. To share the computing tasks with the centralized scheduler, the researches on improving the decision-making abilities of each machine are being conducted by the authors of this paper. In addition, we will delve into how to optimize the framework and parameters of DQN or other RL-based scheduling algorithms.
Footnotes
Handling Editor: Chenhui Liang
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The National Key Research and Development Program of China (No. 2018YFE0117000), the National Natural Science Foundation of China (No. 52075257), the Fundamental Research Funds for the Central Universities (No. NS2021070), and the Key Research and Development Program of Jiangsu Province (No. BE2021091).
