Sage Journals: Discover world-class research

Abstract

In the cut-tobacco drier production line, precise temperature control of the thin-plate roaster is essential for ensuring product quality and operational safety. However, existing temperature prediction models exhibit limited prediction accuracy, resulting in unstable product quality and potential safety risks. To address this challenge, a temperature prediction model for the thin-plate roaster is proposed, integrating a back-propagation (BP) neural network optimized by the parrot optimization algorithm (PO). Production data from a tobacco factory were collected, focusing on seven key operational parameters: IP4 steam film valve opening, process gas velocity, process gas temperature, process gas pressure, condensate water temperature, dehumidification air volume, and hood pressure. A BP neural network was constructed using these features as input variables. To enhance model performance, the PO algorithm was employed to optimize the initial weights and thresholds of the BP network. The proposed PO-BP model was systematically compared with traditional optimization algorithms, including particle swarm optimization (PSO), grey wolf optimization (GWO), and genetic algorithm (GA). Experimental results demonstrate that the PO-BP model achieves superior generalization ability and prediction accuracy. Specifically, the coefficient of determination ( $R^{2}$ ) reaches 0.998, and the root mean square error (RMSE) is reduced to 0.001. These metrics indicate that the PO-BP model effectively mitigates the BP network’s tendency to converge to local minima, while simultaneously improving global search capability and convergence efficiency. Consequently, the proposed model provides a robust and accurate solution for thin-plate temperature prediction in industrial roasting processes.

Keywords

cut-tobacco drier thin plate temperature prediction BP neural network prediction accuracy parrot optimization algorithm

Introduction

With the continuous advancement of artificial intelligence in recent years, manufacturing enterprises have progressively adopted intelligent management and predictive maintenance for production-line machinery and equipment.^1,2 As the backbone of the tobacco industry, tobacco machinery is pivotal in ensuring the safe and stable operation of the entire production process.³

Tobacco machinery encompasses a variety of specialized units, among which dryers, leaf-processing machines, and cutting machines are indispensable for cigarette manufacturing.⁴ The dryer, serving as a critical component for tobacco conditioning, has its operating temperature identified as a decisive parameter. Heat is transferred to the tobacco via thermal conduction: steam flowing between thin plates radiates heat, evaporating moisture from the tobacco. The generated vapor is subsequently condensed and expelled through the exhaust port, completing the dehydration cycle, as shown in Figure 1.⁵ Any deviation in thin-plate temperature will disturb the final moisture content of the cut tobacco, leading to either excessive or insufficient dryness, thereby compromising flavor quality and lowering the market value of the finished cigarettes. Such deviations can impose substantial losses on the entire production line. Hence, an accurate method for predicting the baking temperature of cut tobacco under varying operational conditions is urgently required to safeguard product quality.^6–8

Figure 1.

Drying machine processing flow diagram.

State-of-the-art research on condition prediction can be broadly classified into three categories: model-based, experimental, and data-driven approaches. Model-based methods demand an in-depth understanding of drying-machine mechanisms and entail substantial manual effort, whereas experimental methods incur high costs. In contrast, data-driven techniques require minimal domain expertise and are more cost-effective. With the rapid advancement of artificial intelligence, these data-driven approaches have been widely adopted for machinery condition prediction.⁹ For health-state prediction of proton-exchange-membrane fuel cells, Hong et al.¹⁰ proposed a hybrid framework that couples a semi-empirical model with machine learning. Stochastic operating scenarios were first generated via the semi-empirical model to produce voltage outputs, which were then employed as health indicators and fed into a neural network for training the final prediction model. Luo et al.¹¹ introduced an RBF-neural-network-based state-of-charge (SOC) predictor for batteries and further enhanced its accuracy by integrating an ensemble learning algorithm (ELA). Wang et al.¹² developed a short-term multi-state forecasting scheme for wind turbines using Graph Attention Networks (GAT) combined with Graph Learning (GL) modules. By constructing a time-series graph and exploiting GL modules for automated feature extraction, the model effectively predicts multi-variable turbine states. Lan et al.¹³ presented a WOA-RF-Adaboost-based hydroturbine condition-prediction model. VMD was first applied to decompose the pressure-pulsation signal; permutation entropy, kurtosis of IMFs, and the mean of the original signal were then calculated to form a feature vector for the RF classifier. Wang et al.¹⁴ introduced an MFE-GRU-TCA hybrid model for lithium-ion battery health-state and remaining-useful-life (RUL) prediction, integrating multi-feature extraction with a temporal-convolution attention mechanism to improve SOH and RUL accuracy. Ding et al.¹⁵ developed a friction-system operating-condition predictor that fuses chaotic characteristics with a BP neural network, aiming to forecast the break-in state of shafts and gears. Zheng et al.¹⁶ proposed an HVCB mechanical-state prediction strategy combining LSTM and SVM. LSTM first predicts breaker signal (BS), contact travel (CT), and coil current (CC); key mechanical feature parameters are then derived from these predictions and diagnosed by an SVM model for final state classification. Pei et al.¹⁷ designed an adaptive data-fusion framework integrating XGBoost and Dempster-Shafer theory to feed a multi-task neural network for real-time baking-quality prediction and parameter optimization in industrial thermal systems. The scheme analyzes multi-source feature variation patterns to identify tobacco-baking states and determine heating times accurately. Chen et al.¹⁸ examined deep-learning-based post-disaster damage-state assessment of nonlinear structures by combining Transformer and Informer networks with a customized classifier. Leveraging attention mechanisms for long-range dependencies, the model delivers accurate long-term predictions for seismic response and structural damage states.

The above studies collectively demonstrate that machine-learning and deep-learning models exhibit strong generalization capability and powerful nonlinear representation, making them particularly suitable for complex data-driven industrial applications. They provide an efficient and practical solution for accurately predicting the thin-plate temperature of drying machines.¹⁹ Martínez-Martínez et al.²⁰ developed an artificial neural network (ANN) approach to predict temperature and relative humidity during tobacco drying. A wireless sensor network (WSN) deployed on an industrial dryer continuously acquired spatial temperature–humidity profiles, which were then fed into a trained ANN to forecast values at different locations and future time steps. Wu et al.²¹ fused tobacco images with environmental curing data to construct a three-layer back-propagation (BP) neural network capable of high-accuracy temperature and humidity prediction. Wu²² employed a genetic algorithm (GA) to optimize support vector machines (SVM) for curing-state recognition, achieving an accuracy of 96.5%. Building on this, Wu and Yang²³ proposed a multi-sensor data-fusion framework that combines least-squares SVM with an adaptive neuro-fuzzy inference system (ANFIS). Odor, image, and moisture features were fused as inputs to enhance prediction reliability further. Leveraging a “multi-scale convolution–attention–lightweight physical regularisation” architecture, Mu et al.²⁴ realized high-precision, 2-h-ahead moisture-content forecasting on real tobacco-curing big data, delivering a practical soft-sensing solution for smart curing. Guo et al.²⁵ introduced a “COMSOL–Neural Network–Poisson Fusion” framework that synergizes mechanism-driven (COMSOL) and data-driven (neural network) modelling. COMSOL provides spatial gradient fields, neural networks refine numerical accuracy, and a Poisson fusion algorithm removes boundary discontinuities, enabling real-time, high-resolution temperature–humidity field prediction across the curing chamber with minimal sensor deployment. Wang and Qin²⁶ designed a lightweight feature set combining RGB-HSV color and mass characteristics for densely packed tobacco leaves. A two-layer stacking ensemble (SPFM) that couples LSTM and XGBoost achieved 97.4% accuracy in real-time curing-state prediction and has been successfully embedded in intelligent curing platforms. Zheng et al.²⁷ applied a Quantum Genetic Algorithm (QGA) to tune XGBoost hyperparameters, yielding a QGA-XGB moisture-deficit predictor automatically. The system provides real-time humidification compensation, reducing the average inlet moisture deviation of shredded tobacco to 0.079% and improving the process capability index (CPK) to 2.02, significantly outperforming manual and empirical controls. Wen et al. tackled pronounced moisture variability and strong variable coupling at the outlet of cylinder-and-tube-plate shredders by installing smart sensors for ambient temperature, humidity, and steam quality.²⁸ A “Random-Forest Incremental-Learning Prediction + Multi-level Collaborative Feedback” control strategy stabilized moisture: CPK increased by 45%, standard deviation decreased by 52%, and non-steady-state duration was shortened by 15%.

Although numerous temperature-prediction strategies have been proposed for tobacco-shred drying, prevailing models still exhibit shortcomings in generalization, architectural complexity, computational overhead, and real-time responsiveness. To overcome these limitations, this paper introduces a PO-BP neural-network model specifically devised for thin-plate temperature prediction in tobacco dryers. The Parrot Optimization algorithm is employed to search for globally optimal weights and thresholds, thereby establishing the PO-BP predictor. This approach strengthens parameter identification and global search during model construction, mitigates convergence to local minima, and markedly enhances prediction accuracy. Comparative experiments conducted on real production data from a cigarette factory demonstrate that the PO-BP model not only achieves higher accuracy but also captures the thin-plate temperature dynamics more effectively than alternative BP variants.

Neural network model construction

The basic idea of BP algorithm

The back-propagation (BP) algorithm—also termed the error back-propagation algorithm—comprises an input layer, one or more hidden layers, and an output layer. It is widely applied to classification, regression, pattern recognition, and data mining tasks.^29–34 The architecture of a BP neural network is illustrated in Figure 2. The procedure begins by forwarding the input signal through the hidden layers to the output layer. The predicted values are then compared with the target values; if discrepancies exist, the error gradient is propagated backward from the output layer to the input layer. Gradient descent is employed to update the network weights iteratively. Nevertheless, standard BP networks are susceptible to local minima because of their gradient-based nature. To mitigate this limitation, this paper adopts a Parrot Optimization (PO) enhanced BP neural network.

Figure 2.

BP neural network architecture illustration.

In the BP algorithm, the core mathematical tool is the chain rule of calculus. If $z$ is a function of $y$ and is differentiable, and $y$ is a function of $x$ and is differentiable, it can be expressed by the following formula:

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}

(1)

For a given training set $D = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m})}, x_{m} \in R^{d}, y_{m} \in R^{l}$ , the steps for implementing the BP algorithm are as follows.

Step 1: Define the loss function

For the sample $(x_{k}, y_{k})$ , let us assume that the output of the neural network is ${\hat{y}}_{k} = ({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{l})$ , that is,

{\hat{y}}^{k}_{j} = f (β_{j} - θ_{j})

(2)

In equation (2), $β_{j}$ represents the weighted input sum for the j-th neuron in the output layer, and $θ_{j}$ represents the threshold for the j-th neuron in the output layer.

The mean square error of the neural network on the samples is given by³⁵:

E_{k} = \frac{1}{2} \sum_{j = 1}^{l} {({\hat{y}}_{j}^{k} - y_{j}^{k})}^{2}

(3)

Step 2: Define the parameter adjustment strategy

The BP neural network algorithm is based on the gradient descent strategy, adjusting the parameters in the direction of the negative gradient of the target. This can be expressed mathematically as³⁶:

v = v_{0} + Δ v

(4)

Δ v = - η \frac{\partial E_{k}}{\partial v}

(5)

Step 3: Calculate the gradient of the output layer threshold $\frac{\partial E_{k}}{\partial θ_{j}}$ .

Since ${\hat{y}}_{j}^{k}$ directly influences $E_{k}$ , and $θ_{j}$ directly affects ${\hat{y}}_{j}^{k}$ , according to the chain rule in equation (1),

\frac{\partial E_{k}}{\partial θ_{j}} = \frac{\partial E_{k}}{\partial {\hat{y}}_{j}^{k}} \cdot \frac{\partial {\hat{y}}_{j}^{k}}{\partial θ_{j}}

(6)

This can be obtained from equation (3):

\frac{\partial E_{k}}{\partial {\hat{y}}_{j}^{k}} = {\hat{y}}_{j}^{k} - y_{j}^{k}

(7)

Choose the sigmoid function as the activation function, and according to equation (2), we can obtain:

\frac{\partial {\hat{y}}_{j}^{k}}{\partial θ_{j}} = - {\hat{y}}_{j}^{k} (1 - {\hat{y}}_{j}^{k})

(8)

So there is:

\frac{\partial E_{k}}{\partial θ_{j}} = \frac{\partial E_{k}}{\partial {\hat{y}}_{j}^{k}} \cdot \frac{\partial {\hat{y}}_{j}^{k}}{\partial θ_{j}} = {\hat{y}}_{j}^{k} (1 - {\hat{y}}_{j}^{k}) (y_{j}^{k} - {\hat{y}}_{j}^{k})

(9)

Denote it as $g_{j}$ , that is,

g_{j} = \frac{\partial E_{k}}{\partial θ_{j}} = {\hat{y}}_{j}^{k} (1 - {\hat{y}}_{j}^{k}) (y_{j}^{k} - {\hat{y}}_{j}^{k})

(10)

Step 4: Calculate the gradient of the connection weight $w_{hj}$ from the hidden layer to the output layer, $\frac{\partial E_{k}}{\partial w_{hj}}$ .

Since ${\hat{y}}_{j}^{k}$ directly affects $E_{k}$ , and $β_{j}$ directly affects ${\hat{y}}_{j}^{k}$ . The weight $w_{hj}$ directly affects $β_{j}$ . According to the chain rule:

\frac{\partial E_{k}}{\partial w_{hj}} = \frac{\partial E_{k}}{\partial {\hat{y}}_{j}^{k}} \cdot \frac{\partial {\hat{y}}_{j}^{k}}{\partial β_{j}} \cdot \frac{\partial β_{j}}{\partial w_{hj}}

(11)

According to Step 3:

\frac{\partial {\hat{y}}_{j}^{k}}{\partial β_{j}} = {\hat{y}}_{j}^{k} (1 - {\hat{y}}_{j}^{k})

(12)

From the definition of $β_{j}$ , there can derive:

\frac{\partial β_{j}}{\partial w_{hj}} = - g_{j} b_{h}

(13)

where $g_{j}$ is the gradient of the j-th neuron in the output layer, and $b_{h}$ is the output activation value of the h-th neuron in the hidden layer.

By combining with equation (10), we can express the gradient of the connection weights from the hidden layer to the output layer as:

\frac{\partial E_{k}}{\partial w_{hj}} = - g_{j} b_{h}

(14)

Step 5: Calculate the gradient of the hidden layer threshold $γ_{h}$ , $\frac{\partial E_{k}}{\partial γ_{h}}$ .

Since $b_{h}$ affects $E_{k}$ , and $γ_{h}$ affects $b_{h}$ , according to the chain rule,

\frac{\partial E_{k}}{\partial γ_{h}} = \frac{\partial E_{k}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial γ_{h}}

(15)

Also because.

{\begin{matrix} \frac{\partial E_{k}}{\partial b_{h}} = \sum_{j = 1}^{l} \frac{\partial E_{k}}{\partial {\hat{y}}_{j}^{k}} \cdot \frac{\partial {\hat{y}}_{j}^{k}}{\partial β_{j}} \cdot \frac{\partial β_{j}}{\partial b_{h}} = - \sum_{j = 1}^{l} g_{j} w_{hj} \\ \frac{\partial b_{h}}{\partial γ_{h}} = \frac{\partial f (α_{h} - γ_{h})}{\partial γ_{h}} = - f (α_{h} - γ_{h}) = - b_{h} (1 - b_{h}) \end{matrix}

(16)

So there is:

\frac{\partial E_{h}}{\partial γ_{h}} = b_{h} (1 - b_{h}) \sum_{j = 1}^{l} g_{j} w_{hj}

(17)

From equation (17), we can conclude that the gradient of the hidden layer threshold depends on the output of the hidden layer neurons, the gradient of the output layer threshold, and the connection weights between the hidden layer and the output layer. In a multilayer feedforward network, the gradient of the hidden layer threshold is expressed as the threshold gradient of layer $m$ , $g_{h}^{(m)}$ ; the output of the hidden layer neurons is represented as the output of the layer $m$ neurons, $b_{h}^{(m)}$ ; the connection weights between the hidden layer and the output layer are represented as the weights of layer $m + 1$ , $w_{hj}^{(m + 1)}$ ; and the gradient of the output layer threshold is expressed as the threshold gradient of layer $m + 1$ , $g_{j}^{(m + 1)}$ . Then, the above expression can be transformed to:

g_{h}^{(m)} = b_{h}^{(m)} (1 - b_{h}^{(m)}) \sum_{j = 1}^{i} w_{hj}^{(m + 1)} g_{j}^{(m + 1)}

(18)

It can be seen that in the threshold adjustment process, the gradient of the current layer’s threshold depends on the gradient of the next layer’s threshold.

Similarly, the gradient of the connection weights from the hidden layer to the output layer $\frac{\partial E_{k}}{\partial w_{hj}}$ can be expressed as the gradient of the connection weights $p_{hj}^{(m)}$ at layer $m$ . In a similar manner, $g_{j}$ is expressed as the threshold gradient $g_{j}^{(m)}$ at layer $m$ , and $b_{h}$ is expressed as the output of the neurons at layer $m - 1$ , $b_{h}^{(m - 1)}$ . Thus, equation (14) can be transformed into:

p_{hj}^{(m)} = - g_{j}^{(m)} b_{h}^{(m - 1)}

(19)

In other words, the gradient of the connection weights for the current layer depends on the gradient of the current layer neuron thresholds and the output of the previous layer neurons. As long as the gradient of the previous layer neuron thresholds is known, we can calculate the gradient of the current layer neuron thresholds and the connection weights. Based on equation (10), we can then determine the gradient of the output layer neuron thresholds, which allows us to calculate the gradients of the neuron thresholds and connection weights for the entire network, thus achieving the goal of training the network.

The algorithm 1 pseudo-code is as follows:

Algorithm 1 BP neural network algorithm
Require: Training set $T = {(x_{i}, y_{i})}_{i = 1}^{n}$ , learning rate $η$ Ensure: Connection weights $w^{k}$ 1: Function $BP (T, η)$ 2: The connection weights and thresholds in the network are randomly initialized within the range (0,1) 3: repeat 4: for all $(x_{i}, y_{i}) \in D$ do 5: Compute the output for the present sample ${\hat{y}}^{i}_{j}$ . 6: Compute the gradient for neurons in the output layer $g_{j}$ 7: Compute the gradient of neurons in the hidden layer $\frac{\partial E_{h}}{\partial γ_{h}}$ 8: Update the weights $Δ w_{hj}$ and $Δ v_{kh}$ . 9: Update thresholds $Δ θ_{j}$ and $Δ γ_{h}$ . 10: end for 11: until Reach stop condition 12: End Function

Algorithm 1 BP neural network algorithm

Require: Training set

T = {(x_{i}, y_{i})}_{i = 1}^{n}

, learning rate

η

Ensure: Connection weights

w^{k}

1: Function

BP (T, η)

2: The connection weights and thresholds in the network are randomly initialized within the range (0,1)
3: repeat
4: for all

(x_{i}, y_{i}) \in D

do
5: Compute the output for the present sample

{\hat{y}}^{i}_{j}

.
6: Compute the gradient for neurons in the output layer

g_{j}

7: Compute the gradient of neurons in the hidden layer

\frac{\partial E_{h}}{\partial γ_{h}}

8: Update the weights

Δ w_{hj}

and

Δ v_{kh}

.
9: Update thresholds

Δ θ_{j}

and

Δ γ_{h}

.
10: end for
11: until Reach stop condition
12: End Function

PO-BP neural network

Basic idea of the parrot algorithm

In 2024, Lian et al. introduced the PO algorithm—an effective meta-heuristic inspired by the interaction between parrots and their owners.³⁵ The method begins by randomly generating an initial population of candidate solutions. In each iteration, PO navigates the search space by stochastically exhibiting one of several parrot-inspired behaviors, guiding the population toward the vicinity of the global optimum. Throughout the optimization, every individual is continuously influenced by the current best solution, dynamically adjusting its position until the predefined termination criteria are satisfied. Compared with traditional meta-heuristics such as Genetic Algorithms (GA) and Particle Swarm Optimization (PSO), PO’s probabilistic behavioral selection markedly enhances population diversity. This mechanism enables the PO-BP model to escape local minima while consistently maintaining solution quality.

The CEC2005 test functions are a standard function set used to evaluate and compare the performance of optimization algorithms. To assess the PO algorithm, four representative functions were selected: unimodal F1, multimodal F6, hybrid F12, and composite F13. As illustrated in Figure 3, PO consistently outperforms the compared algorithms, exhibiting both faster convergence and lower fitness values across all four functions, thereby evidencing its superior optimization capability.

Figure 3.

Optimization performance comparison of various algorithms on different functions: (a) unimodal function F1, (b) multi-modal function F6, (c) mixed function F12, and (d) combination function F13.

Parrot Optimization algorithm implementation steps:

Initialize the population: In the initialization phase, a swarm of parrots is generated, with each agent (parrot) encoding a potential solution within the search space;

Defining the fitness function: The fitness function is used to evaluate the solution of each parrot in the Parrot Optimization algorithm;

Simulating foraging behavior: Parrots move within the search space, seeking better solutions;

Simulating staying behavior: In certain situations, a parrot may choose to stay in its current position, conducting a more detailed search in the current area;

Simulating communication behavior: Information exchange between individual parrots to help find the global optimal solution;

Simulating fearful behavior toward strangers: When a parrot encounters a stranger, it exhibits fear and flees, which can be seen as an escape strategy to help the algorithm break out of local optimal solutions;

Update positions: Based on the behaviors described earlier, update the position of each parrot;

Iteration and Convergence: Keep executing the steps above until you either reach the set number of iterations or fulfill the convergence criteria;

Outputting the optimal solution: The algorithm will yield the best solution found or an approximate optimal solution.

Parrot algorithm mathematical model

Population initialization

The initialization formula for PO is:

X_{i}^{0} = a + rand () \cdot (b - a)

(20)

In equation (20), $X_{i}^{0}$ denotes the parrot’s starting position; $a$ and $b$ are the maximum and minimum limits of the search area, respectively; $rand ()$ signifies a random value between 0 and 1.³⁷

Foraging behavior

In foraging behavior, parrots determine the location of food by observing either the food’s position or the owner’s position, and then fly to their respective locations.³⁸ The position update formula is as follows:

\begin{matrix} \begin{matrix} X_{i}^{t + 1} = (X_{i}^{t} - X_{best}) \cdot L (d) + rand () \cdot {(1 - \frac{t}{Ma x_{iter}})}^{\frac{2 t}{Ma x_{iter}}} \cdot X_{mean}^{t} \end{matrix} \end{matrix}

(21)

In equation (21), $X_{i}^{t}$ denotes the parrot’s current location, while $X_{i}^{t + 1}$ indicates its location after foraging behavior is updated. $t$ represents the number of the current iteration. $X_{best}$ is the current best-found position. $X_{mean}^{t}$ denotes the mean position of the current population. The $L (d)$ distribution characterizes the intermittent long jumps observed in parrot flight trajectories. $(X_{i}^{t} - X_{best}) \cdot L (d)$ signifies the individual parrot’s displacement relative to the population, while $rand () \cdot {(1 - \frac{t}{Ma x_{iter}})}^{\frac{2 t}{Ma x_{iter}}} \cdot X_{mean}^{t}$ denotes the stochastic adjustment of the population’s positional centroid during iteration t, where:

$Ma x_{iter}$ : Maximum number of iterations;

$X_{mean}^{t}$ : Mean position of the population at iteration t.

The expression for $X_{mean}^{t}$ is as shown in equation (22).

X_{mean}^{t} = \frac{1}{N} \sum_{k = 1}^{N} X_{k}^{t}

(22)

Based on the rules in equation (23), the distribution can be obtained.

{\begin{matrix} L (d) = \frac{ε \cdot ζ}{{| v |}^{\frac{1}{β}}} \\ ε ~ N (0, d) \\ v ~ N (0, d) \\ ζ = {(\frac{β (1 + β) \cdot \sin (\frac{π β}{2})}{β (\frac{1 + β}{2}) \cdot β \cdot 2^{\frac{1 + β}{2}}})}^{β + 1} \end{matrix}

(23)

Stationary behavior

The stationary behavior of a parrot can be represented by the equation:

X_{i}^{t + 1} = X_{i}^{t} + X_{best} \cdot L (d) + rand () \cdot ones (1, d)

(24)

In equation (24), $ones (1, d)$ denotes a d-dimensional all-ones vector (i.e., a vector of length d with all elements equal to 1), $X_{best} \cdot L (d)$ denotes the behavior of parrots flying to where their owner is located, and $rand () \cdot ones (1, d)$ denotes a stochastic positioning at a random location on the host body.³⁹

IV Communication behavior

In PO, the way parrots communicate includes both flying toward and avoiding the population for interaction. Under the premise of a homogeneous probability distribution between the two behavioral modes, the population’s centroid is represented by its mean position. This process can be mathematically expressed as:

X_{i}^{t + 1} = {\begin{matrix} 0.2 rand () \cdot (1 - \frac{t}{Ma x_{iter}}) \cdot (X_{i}^{t} - X_{mean}^{t}), P \leq 0.5 \\ 0.2 rand () \cdot \exp (- \frac{t}{rand () \cdot Ma x_{iter}}), P > 0.5 \end{matrix}

(25)

In equation (25), $0.2 rand () \cdot (1 - \frac{t}{Ma x_{iter}}) \cdot (X_{i}^{t} - X_{mean}^{t})$ represents the process of an individual parrot joining the population for communication, while $0.2 rand () \cdot \exp (- \frac{t}{rand () \cdot Ma x_{iter}})$ signifies the parrot flying away immediately after communication.⁴⁰

Fear behavior toward strangers

The instinct of a parrot to show fear and move away from strangers can be represented by the equation (26):

\begin{matrix} X_{i}^{t + 1} = X_{i}^{t} + \cos (0.5 π \cdot \frac{t}{Ma x_{iter}}) \cdot rand () \cdot (X_{best} - X_{i}^{t}) \\ - \cos (π \cdot rand ()) \cdot {(\frac{t}{{Max}_{iter}})}^{\frac{2}{{Max}_{iter}}} \cdot (X_{i}^{t} - X_{best}) \end{matrix}

(26)

Here, $\cos (0.5 π \cdot \frac{t}{Ma x_{iter}}) \cdot rand () \cdot (X_{best} - X_{i}^{t})$ signifies the adjustment to fly in the direction of the owner, and $\cos (π \cdot rand ()) \cdot {(\frac{t}{{Max}_{iter}})}^{\frac{2}{{Max}_{iter}}} \cdot (X_{i}^{t} - X_{best})$ represents the process of moving away from strangers.⁴¹

The algorithm 2 pseudo-code is as follows:

Algorithm 2 PO algorithm
1: Initialize the PO parameters 2: Initialize the solutions’ positions randomly 3: for $i = 1 : Max_iter$ do 4: Compute the objective function 5: Find the best position 6: for $j = 1 : N$ do 7: $Step = randi ([1, 4])$ 8: if $Step = = 1$ then 9: Update position by equation 21 10: else if $Step = = 2$ then 11: Update position by equation 24 12: else if $Step = = 3$ then 13: Update position by equation 25 14: else if $Step = = 4$ then 15: Update position by equation 26 16: end if 17: end for 18: Return the best solution 19: end for

Algorithm 2 PO algorithm

1: Initialize the PO parameters
2: Initialize the solutions’ positions randomly
3: for

i = 1 : Max_iter

do
4: Compute the objective function
5: Find the best position
6: for

j = 1 : N

do
7:

Step = randi ([1, 4])

8: if

Step = = 1

then
9: Update position by equation 21
10: else if

Step = = 2

then
11: Update position by equation 24
12: else if

Step = = 3

then
13: Update position by equation 25
14: else if

Step = = 4

then
15: Update position by equation 26
16: end if
17: end for
18: Return the best solution
19: end for

PO-BP prediction model

Inaccurate predictions may arise from randomly setting connection weights when building a BP neural network. Furthermore, training with gradient descent is hindered by its slow pace and the difficulty of avoiding local minima, making global optimization in neural network training difficult. Therefore, the parrot optimization algorithm optimizes biases of the BP neural network and the weights by treating them as parrot positions, aiming to find the best solution. In the initialization phase, the PO algorithm randomly generates a set of candidate solutions that cover the entire search space. This randomness enables the algorithm to start searching from multiple different starting points, increasing the chances of finding the global optimum. Moreover, the $Levy$ flight strategy of PO allows the algorithm to make large-scale jumps, effectively exploring the search space and avoiding local optima. Throughout the search process, individuals dynamically adjust their search strategies by randomly selecting different behaviors, preventing premature convergence. Through the synergistic effect of four behaviors, the algorithm can simultaneously perform global and local searches, enabling better exploration of the solution space and consequently demonstrating higher accuracy and stability in practical applications.

The algorithm flowchart for the PO-BP network is illustrated in Figure 4.

Figure 4.

Algorithm flowchart of the PO-BP network.

Experiments and results

Experimental environment

To confirm the method’s feasibility in this article, the model will be validated through simulations in MATLAB on a 64-bit Windows OS. The Matlab version used is 2023B, and the system is configured with an Intel Core i7-9750H 2.60GHz processor, 16.0GB of RAM, and an NVIDIA GeForce GTX 1660Ti GPU.

Data sources

The experimental dataset was derived from the thin-plate dryer of a tobacco factory, aiming to construct a data-driven predictive model for real-time monitoring of plate temperature dynamics during the drying process. To ensure the validity of model training and the reliability of predictive results, this study implemented a systematic data cleaning process on the original industrial data. The following procedures were executed to address potential missing values and anomalies originating from sensor malfunctions or recording processes:

Missing Value Handling: Discard all sample records containing NaN or Inf values.

Anomaly Removal: Use the $3 σ$ rule to samples lying outside the interval [mean – 3*SD, mean + 3*SD]. For non-normally distributed features, apply the $IQR - base$ robust outlier detection: discard any data points below Q1 − 1.5* IQR or above Q3 + 1.5* IQR.

Normalization Pre-Check: Before applying Min-Max scaling, verify the extreme value range to prevent distortion caused by outliers.

After data cleaning, the dataset shrank from 6000 to 5642 samples, guaranteeing both data quality and model stability. From this cleaned set, 2000 samples were selected for the experiment and split into training and test sets in an 8:2 ratio. The inputs include seven parameters: IP4 steam film valve opening, process gas velocity, process gas temperature, process gas pressure, condensate water temperature, dehumidification air volume, and hood pressure. The output data is the thin plate temperature. The specific parameters are listed in Table 1.

Table 1.

Partial sample data.

Types of parameters	X1	X2	…	X1999	X2000
IP4 steam film valve opening (%)	64.89186	64.85509	…	52.52792	52.59764
Process gas velocity (m/s)	0.108664	0.108878	…	0.119593	0.119395
Process gas temperature (°C)	109.9383	110.0441	…	110.912	110.901
Process gas pressure (bar)	0.489674	0.489678	…	0.501156	0.501169
Condensate water temperature (°C)	28.32314	28.23215	…	113.8404	113.8401
Dehumidification air volume ( $m^{3} / h$ )	3533.723	3500.661	…	4027.105	4040.268
Hood pressure ( $μ$ bar)	−40.97	−40.94	…	−39.67	−39.72
Thin plate temperature (°C)	100.7646	100.7731	…	115.0911	115.0873

From Table 1, the tobacco-shred drying process shows significant dimensional differences among variables, which could result in a slower convergence speed for the prediction model. Therefore, normalization of parameters is needed to reduce the impact of data scale, features, and distribution differences on the model. Max-min normalization is applied to normalize the input parameters.

X_{normalized} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

(27)

In equation (27), $X_{normalized}$ represents the normalized value, $X$ is the raw data, $X_{\min}$ represents the lowest value in the column with the data, and $X_{\max}$ represents the highest value. The selected original data is rescaled to the interval (0, 1), ensuring the stability of the model when processing input data.

Parameter settings

The training iterations for the PO-BP prediction model were set to 1000, with a learning rate of 0.01, a training target minimum error of 1e-6, and a display frequency of once every 25 training iterations. The momentum factor was set to 0.01. The hidden layer’s node count was derived from the empirical formula proposed by Jadid and Fairbairn.⁴²

hiddennum = \sqrt{m + n} + a

(28)

In equation (28), $m$ represents the count of nodes in the input layer, $n$ denotes the count of nodes in the output layer, and $a$ is usually an integer ranging from 1 to 10. Different numbers of hidden layer nodes were used to train and test the BP model. Figure 5 presents the MSE of the training set under varying numbers of hidden layer nodes, with the x-axis representing the hidden layer’s node count and the y-axis representing the MSE. The results showed that when the hidden layer’s node count was 7, the MSE for the training set was 6.0961e-05; when there were eight hidden nodes, the MSE was 4.7834e-05; with nine hidden nodes, the MSE was 2.7104e-05; with 10 hidden nodes, the MSE was 4.9347e-05; and with 11 hidden nodes, the MSE was 5.7034e-05. The gradual decrease in MSE is observed as the hidden layer node count grows, reaching its lowest value at nine nodes. After that, increasing the number of nodes leads to fluctuations and an upward trend in MSE. Therefore, the hidden layer’s node count is set to 9. Setting the value of $γ$ in the $Levy$ distribution to 1.5, with an initial population size of 10, and running 30 iterations.

Figure 5.

Comparison of training set MSE under different numbers of hidden layer nodes.

Predictive performance evaluation metrics

The paper employed $R^{2}$ and RMSE as the key evaluation metrics to compare the prediction accuracy of different neural network models.⁴³ The term is used to represent the degree of fit between the model’s predicted results and actual data. When it is closer to 1, it indicates that the model performs better. RMSE is used to assess the accuracy of a model. A lower RMSE value signifies a more minor average square root difference between predicted and actual values, which means that the model predictions are more accurate.

R^{2} = \frac{\sum_{i = 1}^{N} {(\hat{y_{i}} - \bar{y_{i}})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y_{i}})}^{2}}

(29)

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(\hat{y_{i}} - y_{i})}^{2}

(30)

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(\hat{y_{i}} - y_{i})}^{2}}

(31)

MAE = \frac{1}{N} \sum_{i = 1}^{N} | \hat{y_{i}} - y_{i} |

(32)

MAPE = \frac{100 %}{N} | \frac{\hat{y_{i}} - y_{i}}{y_{i}} |

(33)

Equations (29) to Equations (33) are the calculation formulas for various evaluation metrics, where N signifies the overall number of samples, $y_{i}$ signifies the genuine value of the experimental sample, $\bar{y_{i}}$ represents the average of all actual values from the experimental samples, and $\hat{y_{i}}$ indicates the value that the model predicts. In addition to the two primary indicators $R^{2}$ and RMSE, Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), and MSE were also incorporated as supplementary indicators to evaluate model performance comprehensively.

Results and analysis

To verify the differences between the PO-BP network established this time and several other BP neural networks, after training them with the training set, they are each used for predicting plate temperature. Figure 6 illustrates the comparison between the predicted values and the expected values for each network.

Figure 6.

Comparison of prediction results.

From the predictions of several BP neural networks shown in Figure 6, the traditional BP neural network prediction model shows significant deviations from the actual values and fails to adequately fit the trend of actual changes, resulting in poor prediction accuracy (purple line). After optimization by several traditional metaheuristic optimization algorithms(GWO, PSO, and GA), the model’s accuracy improved to some extent, but as the number of samples increased, the prediction results gradually deviated more from the actual values in the latter half of the data (yellow, red, and blue lines). Due to its unique mechanism, the PO algorithm demonstrates advantages in optimization algorithms. The established PO-BP model can better fit the trend of actual values compared to other models, effectively addressing the issue of BP neural networks getting trapped in local optima and significantly improving the prediction model’s accuracy.

The fundamental reason why the PO-BP model outperforms the traditional BP neural network lies in the fact that the Parrot Optimization algorithm simultaneously addresses two major inherent weaknesses of BP—proneness to local minima and slow/oscillatory convergence—from both the ”search mechanism” and ”weight initialization” perspectives. In traditional BP neural networks, if the randomly initialized starting point falls within an “error valley” or on a “flat plateau,” gradient descent is highly prone to becoming stuck or sliding ineffectively. In contrast, the introduced PO algorithm effectively flattens all network weights and thresholds into a D-dimensional vector, which serves as a “parrot individual” in the PO process. The PO algorithm first performs several dozen population-wide large jumps ( $Levy$ flights + alternating among four distinct behaviors) across the entire parameter space, returning the vector with the smallest error as the initial values for the BP network. In each iteration, PO randomly selects one of its four behaviors to execute, effectively embedding the capability to escape local minima into the algorithm’s core logic. In the weight space of the BP neural network, the foraging behavior enables individuals to explore new regions with large steps, allowing them to traverse error valleys rapidly; the stationary behavior facilitates precise local search within a confined area, making it possible to accurately pinpoint the valley bottom; the communication behavior helps maintain population diversity, preventing premature convergence of the algorithm; and the fear behavior toward strangers ensures the algorithm retains the capability to escape from suboptimal valleys even in later stages. As shown in the convergence curves of Figure 3, PO approaches the global optimum within 200 generations, while algorithms such as GA and PSO still oscillate in the “plateau phase.” This capability is seamlessly transferred to weight space search, effectively equipping the BP neural network with a “global escape mechanism.” Furthermore, $Levy$ flights generate a step length sequence characterized by “a mix of fewer large steps and numerous small steps.” In the weight space, these $Levy$ steps enable individuals to escape saddle points or shallow minima in a single move, while the Gaussian perturbation used in traditional PSO often results only in local jitter, causing particles to oscillate near their original positions. The mechanism of dynamic boundary control combined with real-time ranking ensures that the search process “progressively improves” while maintaining numerical stability. In each generation, all individuals are sorted and replaced—the worst-positioned individual is immediately superseded by the best one—resulting in monotonically decreasing error. Boundary control promptly pulls any component that exceeds the parameter range back into the interval [lb, ub], thereby preventing the “exploding weights” issue common in BP networks, which often leads to gradient NaN errors. In contrast to the “fixed learning rate + manual early stopping” approach used in traditional BP, PO’s “adaptive boundary + ranking” mechanism results in a smoother and more monotonic training curve, as evidenced by the transition from the purple to the green curve in Figure 6, where oscillations are effectively eliminated.

Refer to equation (10) to equation (14) to calculate the evaluation metrics of the neural network, with the results outlined in Table 2. The correlation coefficients $R^{2}$ for BP neural network, GA-BP neural network, GWO-BP neural network, PSO-BP neural network, and PO-BP neural network were 0.885, 0.924, 0.902, 0.922, and 0.998, respectively. The corresponding RMSEs were 0.010, 0.008, 0.008, 0.007, and 0.001. Among these, the PO-BP model exhibited the highest $R^{2}$ value, which is very close to 1, significantly outperforming the traditional BP model and surpassing other optimization models. The RMSE of the PO-BP model was also the lowest when compared to other models. Beyond these two primary indicators, the PO-BP model demonstrated superior performance across all other metrics, indicating significant improvements in data fitting and prediction accuracy.

Table 2.

Comparison of prediction performance evaluation metrics.

Method	Evaluation criteria/indicators
	MAE	MAPE	MSE	RMSE	$R^{2}$
BP	0.0095126	8.2716e-05	0.00010743	0.010365	0.88549
PSO-BP	0.0059689	5.1896e-05	5.4768e-05	0.0074005	0.92232
GA-BP	0.0069337	6.0286e-05	7.0665e-05	0.0084062	0.92498
GWO-BP	0.0078321	6.8093e-05	8.0153e-05	0.0089528	0.90219
PO-BP	0.0008974	7.8022e-06	1.2085e-06	0.0010993	0.99855

To visually illustrate the differences between the standard BP and other BP models, the above indicators are plotted for comparison.

When comparing different models, a model with a high $R^{2}$ and low MAE is typically considered superior. Therefore, these two performance metrics are combined to create an $R^{2}$ -MAE two-dimensional plot, with MAE on the x-axis and $R^{2}$ on the y-axis, as shown in Figure 7(a). In the comparison of the five models shown in the figure, the standard BP model’s point is close to the bottom right corner, indicating a lower $R^{2}$ and a higher MAE, which signifies poor performance. In contrast, the PO-BP model’s point is located in the top left corner, demonstrating its superiority and indicating that it performs the best among the five compared models.

Figure 7.

Performance comparison of different algorithms: (a) $R^{2}$ -MAE two-dimensional comparison, (b) radar chart of performance evaluation, (c) bar chart of MAE and RMSE, and (d) polar coordinate comparison of MAE and MAPE.

The five performance indicators in Table 2 were normalized to a single dimension, and the $R^{2}$ value was converted to its inverse, so that all indicators expanded outward. This means that the smaller the polygon formed by the five indicators, the better the model performs in these metrics. A radar chart was created using the normalized data, as shown in Figure 7(b). The polygon formed by the five indicators of the BP model has the largest area. In contrast, the polygon formed by the PO-BP model’s indicators has a significantly smaller area compared to other models, indicating that the PO-BP model’s indicators are far superior to the other models.

Selecting MAE and RMSE indicators with relatively similar dimensions, a bar chart was created for comparison, as shown in Figure 7(c). The height differences in the bar chart more intuitively reflect the differences between the PO-BP prediction model and the other four models in the selected indicators.

A polar plot was created based on the values of MAE and MAPE, and converted into a vector in Cartesian coordinates to generate the polar coordinate graph, as shown in Figure 7(d).

Through the above visualization, it is evident that the prediction model optimized by the optimization algorithms significantly outperforms the traditional BP model. Among the various optimization algorithms, the PO-BP prediction model achieves the best prediction accuracy, with results far superior to the other models.

Generalization experiment

To validate the generalization capability of the proposed PO-BP model across diverse tobacco production scenarios, this study augmented the initial single-product experiments with a second product batch for comparative analysis. Both datasets were collected from the identical thin-plate heat exchange drying line while maintaining consistent process parameter frameworks, though differences in raw material blends, target moisture contents, and drying temperature settings authentically replicate operational variations during product changeovers. Key process differences are detailed in Table 3.

Table 3.

Comparison of key parameters between two experiments.

Parameter	Experiment 1	Experiment 2
IP4 steam film valve opening (%)	52–64	50–52
Target moisture content (%)	11.8 . 0.4	12.5 ± 0.5
Thin-plate temperature (°C)	105–115	140–150

Initially, 7500 records were collected. After applying the data-cleaning procedure described in Section “Data Sources” to remove missing values and outliers, 7186 valid samples remained. Consistent with the previous experiment, 2000 samples were randomly selected for this study and split into training and test sets at an 8:2 ratio. The network architecture continued to use the 7-9-1 topology determined in Section “Experimental Environment”, and all hyperparameters were kept identical. The evaluation metrics— $R^{2}$ and RMSE—were employed to enable a direct side-by-side comparison with the first experiment.

As shown in the Figure 8, the PO-BP-predicted curve almost perfectly overlaps the actual thin-plate temperature across the entire test range; only minute deviations are observed at a few extreme points. The calculated metrics are $R^{2}$ = 0.99899, indicating that the model explains 99.9% of the temperature variance, and RMSE = 0.00697 °C, meaning the average prediction error is less than 0.01°C—well below the industrial tolerance of ±0.5°C.

Figure 8.

Generalization experiment prediction result graph.

In the single-product scenario, PO-BP already outperforms the standard BP, PSO-BP, GA-BP, and GWO-BP networks in terms of accuracy and stability. The cross-product experiment on the second dataset further demonstrates that the PO-BP model retains its superiority: R² remains above 99.9%, RMSE stays below 0.007 °C. These consistent results confirm that the PO algorithm’s optimization of BP weights and thresholds not only elevates prediction accuracy but also ensures robust generalization across different tobacco products on the same production line.

Comparison with related studies

To evaluate the performance of PO-BP, we compared its metrics with those of representative studies published in recent years on tobacco drying or similar thermal processes, as summarized in Table 4. As shown, traditional mechanistic models rely on prior assumptions and achieve only 0.86–0.88 in R², while also lacking online update capability. Conventional GA-BP and PSO-BP approaches typically yield RMSE values greater than 0.01°C—roughly an order of magnitude higher than the 0.00110°C obtained in our Experiment 1. The latest QGA-XGBoost method performs well in moisture prediction (R² = 0.991), yet it requires 23 input dimensions, resulting in a computational load 3.4 times that of our study. By contrast, our model attains high accuracy with only seven low-redundancy features, demonstrating PO-BP’s practical value in “small-sample–high-dimension” scenarios. However, the current dataset is still confined to a single production line; our next step will involve collecting cross-line and cross-season data to validate model robustness. Future work will integrate transfer learning and incremental updating strategies to enable “plug-and-play” deployment.

Table 4.

Comparison with recent representative studies.

Reference	Method	Scenario	Inputs	$R^{2}$	RMSE
Martínez-Martínez et al.²⁰	ANN	Tobacco drying	5	0.87	0.56°C
Wu and Yang²³	LS-SVM+ANFIS	Curing barn	9	0.92	0.42°C
Zheng et al.²⁷	QGA-XGBoost	Moisture	23	0.991	0.079%
Exp-1 (this work)	PO-BP	Tobacco drying	7	0.9986	0.00110°C
Exp-2 (this work)	PO-BP	Tobacco drying	7	0.9990	0.00697°C

Discuss the applicable boundaries of alternative model architectures

Deep neural network

The multi-layer nonlinear mapping of deep neural networks (DNNs) enables them to approximate any continuous function, with representational capacity scaling linearly with depth. However, the hyperparameter space expands dramatically, making tuning costs significantly higher than for shallow networks. Should the factory subsequently incorporate high-dimensional nonlinear features such as high-frequency vibration (≥10 Hz), infrared thermal images, or steam pressure pulsation, the representational capacity of a shallow 9-node BP neural network would rapidly saturate. In such cases, replacing it with a DNN hybrid model—combined with PO-based layer-wise pre-training—could significantly enhance its fitting performance. When building joint models for multiple types of tobacco shreds, DNNs can automatically learn the nonlinear mapping between “raw material characteristics” and “operating conditions,” thereby reducing the need for manual feature engineering. However, the number of parameters tends to grow quadratically, requiring over 10,000 cross-seasonal data samples to prevent overfitting. The currently available dataset remains insufficient to adequately support DNN training. Furthermore, programmable logic controllers (PLCs) deployed on factory floors typically have less than 50 MB of memory, and 8-bit MCUs possess limited floating-point computation capabilities. As a result, the inference time of neural networks with more than three layers may exceed the hard real-time requirement of a 1-s control cycle.

To further evaluate the applicability of DNNs in the tobacco drying temperature prediction task studied here, we plotted the sample distribution of the training and test sets on this model (as shown in Figure 9). It can be observed from the figures that although the prediction curve performs nearly perfectly on the training set—achieving almost error-free fitting of all training samples—it performs very poorly on the unseen test set. While this discrepancy has a relatively minor impact in shallow BP networks, it triggers more severe overfitting in DNNs due to their larger parameter capacity and stronger fitting ability, especially under the current condition of a still limited dataset size. Therefore, before sufficient data is available and both sampling frequency and feature dimensionality are expanded, the premature introduction of a DNN architecture is unlikely to yield significant performance improvements. On the contrary, it may compromise practical deployment due to training instability and reduced generalization capability. Therefore, only when sufficient data is available and hardware is upgraded to ARM Cortex-A or edge GPUs would replacing the BP architecture with a DNN be worthwhile.

Figure 9.

The prediction result graph of the DNN model: (a) training set prediction results, (b) test set prediction results.

LSTM

The gating mechanism of LSTM automatically determines the length of the time window, enabling it to capture both minute-level long-lag dependencies and second-level short-term dynamics. However, its parameter count ≈4×(input_size + hidden_size)×hidden_size, which is 5–8 times that of a shallow BP network. Additionally, hardware memory requirements increase linearly with sequence length. If the factory reduces the sensor sampling interval to 1–5 s, the thin-plate temperature exhibits a mixed dynamic of “second-level inertia and minute-level lag.” In such cases, a fixed sliding window would need to be extended to over 30 steps. LSTM, by contrast, can adaptively extract the optimal lag order, significantly reducing the peak dynamic error. However, the training duration would extend from minutes to hours, rendering daily online retraining impractical. If the PO algorithm were applied directly to optimize all weights of the LSTM, the parameter vector dimension would expand to over a thousand dimensions, increasing the convergence cost by one to two orders of magnitude compared to PO-BP.

Figure 10 presents the prediction results of the Long Short-Term Memory (LSTM) model on the test set. The strong fitting capacity of LSTM enables it to not only capture the underlying physical patterns but also likely overfit stochastic noise and measurement errors present in the training data. As illustrated in the figure, the LSTM model has learned even the noise in the data as “signals,” reproducing them during prediction and thereby introducing unnecessary and intense high-frequency oscillations in the forecast curve. Therefore, replacing the BP neural network with an LSTM could introduce parameter redundancy and real-time performance.

Figure 10.

Prediction results of the LSTM model.

Conclusion

This paper proposes a temperature prediction framework for the thin plate of a tobacco-sheet drying machine, leveraging a Backpropagation Neural Network optimized via the Parrot Optimization Algorithm. The PO adaptively adjusts the BPNN’s weights and biases to minimize prediction errors in real-time thermal dynamics.

The results are as follows:

Incorporating PO for parameter optimization in the BP neural network has addressed the problem of the algorithm being trapped in local optima, significantly improving prediction accuracy.

Applying the PO-BP prediction model to predict the thin plate temperature of the roaster, an analysis of the prediction results on the test set shows that the correlation coefficient R² can reach 0.998.

Contrastive experiments show that, compared to the traditional BP neural network, the thin plate temperature prediction model based on PO-BP proposed in this paper has significantly improved the correlation coefficient R² from 0.885 to 0.998, and reduced the root mean square error RMSE from 0.01 to 0.001. This effectively enhances the accuracy of thin plate temperature prediction.

The improved thin plate temperature prediction model can significantly enhance the efficiency and product quality of tobacco drying processes, while reducing energy consumption and production costs. Furthermore, these research findings can be applied to other industrial sectors, such as food processing and chemical production, by enabling more accurate temperature prediction and control. This optimization of production processes can improve equipment reliability and service life, thereby generating greater economic benefits for enterprises.

Due to the limited number of production data samples, the model may suffer from overfitting. Future research will focus on expanding the dataset’s scale and diversity to verify the model’s generalization ability. Additionally, further exploration will be conducted to apply the model to other types of industrial equipment and production processes, thereby further validating its effectiveness and applicability.

Footnotes

ORCID iD

Shiqi Chen

Ethical Considerations

This study did not involve human participants or animal subjects. All data used in this research were obtained from industrial equipment monitoring systems, and the analysis did not involve any personal or sensitive information. The research complied with all applicable ethical standards in industrial data analysis.

Informed Consent

As this study exclusively utilized anonymized operational data from industrial equipment with no human involvement, informed consent was not required. All data were provided by the manufacturing facility under appropriate data usage agreements.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Hebei Provincial Higher Education Scientific Research Project (Grant No. CXZX2025039).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Study design

This is a retrospective analysis/experimental study, which does not require trial registration.

Data availability statement

The datasets generated and analyzed during this study are proprietary operational data collected from a local manufacturing facility under confidentiality agreements. Due to commercial sensitivity and contractual obligations, the raw data are not publicly available. However, anonymized subsets or processed data supporting the findings may be provided by the corresponding author upon reasonable request, subject to approval by the data owner.

References

Jiang

Chen

Lan

, et al. Research on rolling bearing state prediction of tobacco packaging machine transmission system based on LSTM-ICNN. Mach Des Manuf Eng 2024; 53(3): 97–101.

Chen

Huang

, et al. Research on the information fault prediction and detection technology of modern tobacco industry’s equipment. Off Inform 2023; 28(3): 29–31.

Application of artificial intelligence online monitoring technology in tobacco machinery fault diagnosis. Sci Technol Inf 2022; 20(20): 9–12.

Jian

Application and prospects of fault diagnosis technology for tobacco equipment. Sci Technol Innov Her 2019; 16(24): 65+67.

Guo

, et al. Optimizing equipment parameters of SH626 thin plate cut-tobacco dryer to reduce over-dried cut tobacco. Acta Tabac Sin 2022; 28(1): 39–43.

Yan

Zhang

Lan

, et al. Design and optimization of tobacco coiler control system based on particle swarm optimization. Mach Electron 2023; 41(3): 27–29.

Yang

Wei

, et al. Application of AI online monitoring technology in fault diagnosis of tobacco machinery. Machinery 2020; 58(11): 85–87.

Yang

Gong

The operating technique of three-stage curing of flue-cured tobacco. Tob Sci Technol 2003; 7: 46–48

Liang

Zhang

Wang

, et al. A review on health state assessment and remaining useful life prediction of mechanical equipment under intelligent manufacturing. J Ordn Equip Eng 2022; 43(7): 67–77.

10.

Hong

Yang

Lian

, et al. State of health prediction for proton exchange membrane fuel cells combining semi-empirical model and machine learning. Energy 2024; 291: 130364.

11.

Luo

Construction of battery charge state prediction model for new energy electric vehicles. Comput Electr Eng 2024; 119: 109561.

12.

Wang

Liu

Zou

, et al. GWTSP: a multi-state prediction method for short-term wind turbines based on GAT and GL. Procedia Comput Sci 2023; 221: 963–970.

13.

Lan

Song

Zhang

, et al. State prediction of hydro-turbine based on WOA-RF-adaboost. Energy Rep 2022; 8:13129–13137.

14.

Wang

Dai

, et al. Lithium-ion battery health state and remaining useful life prediction based on hybrid model MFE-GRU-TCA. J Energy Storage 2024; 95: 112442.

15.

Ding

Feng

Qiao

, et al. Experimental prediction model for the running-in state of a friction system based on chaotic characteristics and BP neural network. Tribol Int 2023; 188: 108846.

16.

Zheng

Yang

, et al. Prediction method of mechanical state of high-voltage circuit breakers based on LSTM-SVM. Elec Power Syst Res 2023; 218: 109224.

17.

Pei

Zhou

Huang

, et al. State recognition and temperature rise time prediction of tobacco curing using multi-sensor data-fusion method based on feature impact factor. Expert Syst Appl 2024; 237: 121591.

18.

Chen

Sun

Zhang

, et al. Attention mechanism based neural networks for structural post-earthquake damage state prediction and rapid fragility analysis. Comput Struct 2023; 281: 107038.

19.

Zhang

, et al. Dynamic prediction of cylinder wall temperature for drum dryer based on DGRU network. J Light Ind 2022; 37(6): 85–91, 100.

20.

Martínez-Martínez

Baladrón

Gomez-Gil

, et al. Temperature and relative humidity estimation and prediction in the tobacco drying process using artificial neural networks. Sensors 2012; 12(10): 14004–14021.

21.

Yang

Tian

A novel intelligent control system for flue-curing barns based on real-time image features. Biosyst Eng 2014; 123: 77–90.

22.

Research on recognition of flue-cured tobacco curing stage based on image features and GA-SVM. J Southwest Norm Univ 2016; 41: 100–106.

23.

Yang

SX.

Intelligent control of bulk tobacco curing schedule using LS-SVM- and ANFIS-based multi-sensor data fusion approaches. Sensors 2019; 19(8): 1778.

24.

, et al. An intelligent moisture prediction method for tobacco drying process using a multi-hierarchical convolutional neural network. Dry Technol 2022; 40(9): 1791–1803.

25.

Guo

Wang

, et al. Neural network assisted temperature and humidity field simulation method in tobacco curing process. Ind Crops Prod 2025; 225: 120508.

26.

Wang

Qin

Research on state prediction method of tobacco curing process based on model fusion. J Ambient Intell Human Comput 2022; 13(6): 2951–2961.

27.

Zheng

Sun

, et al. Research on inlet moisture control of tobacco dryer based on QGA-XGB prediction model. Acta Tabac Sin. Epub ahead of print 13 March 2025. DOI: 10.16135/j.issn1002-0861.20250313.0941.002.

28.

Wen

Liu

, et al. Moisture control in cut tobacco output from cylinder dryer with corrugated heating plate by collaborative optimization of multi-parameters. Tob Sci Technol 2025; 58(6): 82–91.

29.

Bai

Zhang

, et al. Reliability prediction-based improved dynamic weight particle swarm optimization and back propagation neural network in engineering systems. Expert Syst Appl 2021; 177: 114952.

30.

Wang

Hou

Jiao

, et al. Fault diagnosis of silk drying machine based on FA-BP neural network model. China Instrum 2024; 1: 40–44.

31.

Yang

, et al. Performance prediction of gasification-integrated solid oxide fuel cell and gas turbine cogeneration system based on PSO-BP neural network. Renew Energy 2024; 237: 121711.

32.

Wang

Xin

, et al. Surface roughness prediction model and surface topography analysis of 2.5D-Cf/SiC in two-dimensional ultrasonic assisted grinding based on GA-BP neural network. Tribol Int 2025; 201: 110272.

33.

Zhang

Long

Chen

, et al. Temperature field prediction for a PC beam bridge with corrugated steel webs using BP neural network and measured data. Structures 2024; 68: 107232.

34.

Wang

Chen

Yang

, et al. BP neural network multi-module green roof thermal performance prediction model optimized based on sparrow search algorithm. J Build Eng 2024; 96: 110615.

35.

Zhang

Wang

, et al. Oil spill area prediction model of submarine pipeline based on BP neural network and convolutional neural network. Process Saf Environ Prot 2025; 199: 107264.

36.

Han

Fang

Wang

, et al. Multi-objective optimization of passive design parameters for rural residences based on GA-BP-NSGA-II: a case study of cold regions in China. J Build Eng 2025; 112: 113624.

37.

Lian

Hui

, et al. Parrot optimizer: algorithm and applications to medical problems. Comput Biol Med 2024; 172: 108064.

38.

Huang

Wei

Yuan

, et al. Parrot optimization algorithm for improved multi-strategy fusion for feature optimization of data in medical and industrial field. Swarm Evol Comput 2025; 95: 101908.

39.

Abdel-Salam

Alomari

S A

Yang

, et al. Harnessing dynamic turbulent dynamics in parrot optimization algorithm for complex high-dimensional engineering problems. Comput Methods Appl Mech Eng 2025; 440: 117908.

40.

Huang

Fang

, et al. IPORF: a combined improved parrot optimizer algorithm and random forest for fault diagnosis in AUV. Ocean Eng 2024; 313: 119665.

41.

Gao

Wang

, et al. Research on millet origin identification model based on improved parrot optimizer optimized regularized extreme learning machine. J Food Compos Anal 2025; 141: 107354.

42.

Jadid

Fairbairn

DR.

Neural-network applications in predicting moment-curvature parameters from experimental data. Eng Appl Artif Intell 1996; 9: 309–319.

43.

Teng

Peng

Yan

, et al. An uncertainty quantification and accuracy enhancement method for deep regression prediction scenarios. Mech Syst Signal Process 2025; 227: 112394.

PO-BP neural network-based temperature prediction for tobacco drying machines

Abstract

Keywords

Introduction

Neural network model construction

The basic idea of BP algorithm

PO-BP neural network

Basic idea of the parrot algorithm

Parrot algorithm mathematical model

Population initialization

Foraging behavior

Stationary behavior

IV Communication behavior

Fear behavior toward strangers

PO-BP prediction model

Experiments and results

Experimental environment

Data sources

Parameter settings

Predictive performance evaluation metrics

Results and analysis

Generalization experiment

Comparison with related studies

Discuss the applicable boundaries of alternative model architectures

Deep neural network

LSTM

Conclusion

Footnotes

ORCID iD

Ethical Considerations

Informed Consent

Funding

Declaration of conflicting interests

Study design

Data availability statement

References