Sage Journals: Discover world-class research

Abstract

SO₂ emissions are known to pose great harm to both human health and atmospheric air, and flue gas generated from coal-fueled power plant is the prime source of sulfur dioxide. For this reason, flue gas desulfurization (FGD) technology has found wide applications in most coal-fired power stations. Correctly describe the dynamic behavior of an FGD process is the precondition of controlling it effectively. However, FGD process modeling is by no means an easy task, as the underlying process dynamics are highly nonlinear in nature, meanwhile time-delay effect is significant therein. Long short-term memory (LSTM) network possesses remarkable long-term memory capability, hence it is anticipated to have a powerful identification capability. In this paper, the connection between deep learning and system identification is established, further a unidirectional/bidirectional LSTM deep network is designed and employed to identify a real FGD process. Simulation results clearly demonstrate the effectiveness of deep learning-based identification approach, and the superiority of deep LSTMs over other conventional identification models is also verified.

Keywords

Flue gas desulfurization (FGD)process identification (SI)deep learning (DL)long short-term memory (LSTM)recurrent neural network (RNN)

Introduction

It is the fact that a large amount of sulfur dioxide (SO₂) is produced during the coal combustion process, which poses a great threat to both ecological environment and human health. For this reason, the SO₂ emission standard for coal-fired power stations in China is gradually becoming tighter in recent years, which states the SO₂ emission is not allowed to excess 35 mg/Nm³.¹ For most thermal power plants, FGD installations must be equipped for SO₂ removal purposes. Among various FGD techniques, limestone-based flue gas desulfurization (LFGD) technique is the most mature one and has been extensively applied in power plants. For the satisfactory control of outlet SO₂ concentration, FGD process dynamics must be adequately captured. In fact FGD modeling is a formidable problem, as the process has highly nonlinear dynamics while existing large time delay. Over the past few decades, there have been considerable numbers of studies in the area of FGD process modeling. Concerning mechanistic models, the heat and mass transfer theory is widely employed to describe an FGD process, the reader is referred to some representative works (e.g. Gage and Rochelle,² Brogren and Karlsson,³ and Eden and Luckas⁴) for more details. In recent years, there has also been a great deal of research activity in this direction. In Wang et al.,⁵ a mechanistic model was developed for a micro vortex flow scrubber based on heat and mass transfer characteristics therein, experimental results show validity of the proposed model under varying operating conditions. An analysis model that for the first time combines reaction kinetics and physical mass transfer theory was proposed in Zhang et al.,⁶ with which characteristics of bubble formation in liquid stage can be determined accurately.⁷ focuses on SO₂ absorption process in the desulfurization tower, and a comprehensive model that combines transfer process with chemical reactions through Eulerian-Lagrangian method was suggested, and it was shown that the model was able to improve the SO₂ removal efficiency for a 330 MW commercial-scale power unit. Due to space limitation interested readers are referred to Flagiello et al.,⁸ Zhao et al.,⁹ and Cui et al.¹⁰ for other first principle based modeling approaches. Given the high-level complexity of FGD process, identify an FGD process using merely input-output observations is not a trivial task. As a consequence, the black-box modeling is less frequently used for the FGD process modeling task as compared to mechanistic models. A regression model is established in Hou et al.¹¹ with the use of operating data of an FGD system, then the multi-objective programing is performed to find optimal parameters such that the system can operate both safely and economically. The universal approximation capability of neural networks makes it a prime candidate to perform the identification task of FGD process. A hybrid model is designed in Guo et al.¹² to provide a prediction for outlet SO₂ concentration of a 1000-MW coal-fired unit. The model contains a mechanism model and a multi-input network, simulation results show its effectiveness as compared to classical neural models. In Villanueva Perales et al.,¹³ the nonlinearity in FGD process is described by a static neural model, based on which a predictive controller is designed to regulate the slurry pH. A BP neural network optimized by the genetic algorithm is applied in Ren et al.¹⁴ to model the FGD process, and both SO₂ removal efficiency and operating performance are improved with the proposed model. As for other FGD modeling approaches concerning neural network, the reader is referred to Yang et al.¹⁵ and Liu et al.¹⁶ Owing to the complexity of chemical reaction in an FGD process, the focus in this research is on data-driven identification approach. It is interesting to note that the feedforward network is chosen as an FGD model in most works mentioned above, whereas the time sequential information incorporated in input-output observations cannot be captured by a static architecture. The highly nonlinear dynamics in FGD process calls for some cutting-edge techniques to further improve the identification performance. To fill this gap, DL is integrated into the SI field and applied to identify the FGD process, to the best of our knowledge, none of the earlier works ever attempted in this field. This study makes several significant technical contributions. Firstly, a unidirectional/bidirectional LSTM deep network are designed, and used for the first time successfully in a flue gas desulfurization process identification problem, and shows outstanding identification performance, which provides a promising solution for learning dynamics in an FGD process. Secondly, the connection between non-linear dynamic system identification and deep learning approach is established and verified through the case study of FGD. Finally, in a broader sense, the proposed LSTMs deep learning model can enrich the model selection in the system identification field, which are applicable to different process identification scenarios.

In the remainder of the paper, the FGD process is briefly described and the identification problem is formulated mathematically in next section. Thereafter, in Section 3, the relationship between DL and system identification is first provided, then the applicability of using deep learning to identify non-linear dynamics is analyzed theoretically. Section 4 contains the detailed information of LSTM-based dynamic model. Simulation studies are given in Section 5 and finally relevant conclusions are drawn in Section 6.

FGD principles and problem formulation

Aiming at familiarizing the reader with the FGD process and the identification problem under study, this section first deals with a brief overview of the FGD system, after that the identification problem concerning FGD process is discussed.

Principles of flue gas desulfurization

The installation of FGD units has the purpose of removing SO₂ emissions from large-scale electric utility boilers. According to Córdoba¹⁷ and Kumar and Jana¹⁸ with limestone chosen as the absorbent, available FGD techniques can be divided into three categories according to status of limestone, they are: limestone-based dry/semi-dry/wet FGD. Among above approaches, the wet limestone FGD is the most extensively used technique due to its superior desulfurization performance and high reliability. The underlying principle of FGD lies in the interaction between SO₂ contained in flue gas and limestone slurry, by sparging air into the reaction tank, gypsum is produced and the overall reaction is

\begin{matrix} S O_{2} + CaC O_{3} (s) + 0.5 O_{2} + 2 H_{2} O \\ \to CaS O_{4} \cdot 2 H_{2} O (s) + C O_{2} \end{matrix}

(1)

The whole procedure of LFGD process is depicted in Figure 1, we broadly classify the desulfurization process into three parts. The first part is the absorbent slurry preparation, where finely ground limestone slurry is produced and pumped to the storage tank. The second part is the core of the FGD process which serves to eliminate SO₂ in the flue gas, and the whole desulfurization process takes place in a spray tower. As shown in Figure 1, the untreated flue gas is fed into the spray tower from the bottom, on the other hand, the reagent slurry is pumped from the reaction tank to spray headers. In such a case, the flue gas is brought into countercurrent contact with the limestone reagent, and SO₂ absorption thus takes place. Finally the desulfurized flue gas is passed through mist eliminators followed by discharging into the air. The last part in FGD concerns gypsum dewatering, to be specific, the gypsum slurry filtrate is dewatered through hydrocyclone and the gypsum ( $CaS O_{4} \cdot 2 H_{2} O$ ) is finally made available.

Figure 1.

A schematic illustration of FGD process.

Problem formulation

In most cases, chemical processes exhibit non-linear and non-stationary behavior, and the FGD process under study is not an exception. The distinct nonlinearity in FGD process stems from dynamics in thermodynamic relations, chemical reactions, etc. Furthermore, a high time delay often exists in industrial processes, which makes FGD modeling really a challenging problem in the field of system identification. Considering the complicated chemistry concerned in the desulfurization process, the data-driven based modeling approach is more practically reasonable. Assuming the FGD process to be identified is governed by the following vector difference equation

\begin{matrix} Σ : \begin{matrix} x (k + 1) = f [x (k), u (k), θ] \\ y (k) = h [x (k), θ] \end{matrix} \end{matrix}

(2)

where the input $u (k)$ , state $x (k)$ , output $y (k)$ , and parameter vector $θ$ belong respectively to $R^{r}, R^{n}, R^{m}$ , and $R^{l}$ ; $f :$ $R^{r} \times R^{n} \times R^{l} \to R^{n}$ and $h :$ $R^{n} \times R^{l} \to R^{m}$ are general nonlinear smooth mappings. In this research, u is selected as the flow of feeding limestone slurry, while the output y is the slurry pH in the reaction tank. In this sense, the identified FGD process is a single-input/single-output (SISO) discrete-time system in form of (2), where $r = m = 1$ . In many real-world applications, there is only an access to input/output (IO) measurements and this in turn calls for the development of effective IO representation.¹⁹ For simplicity of presentation, consider only the SISO case, then the input-output model takes the form

\begin{matrix} Σ_{IO} : y (k + 1) = Φ [φ (k), θ] = Φ [y (k), \dots, \\ y (k - n_{y} + 1), u (k), \dots, u (k - n_{u} + 1), θ] \end{matrix}

(3)

Let $θ \overset{Δ}{=} [θ_{1}, θ_{2}, \dots, θ_{l}]^{T}$ , then process dynamics $Φ (\cdot) : R^{n_{y}} \times R^{n_{u}} \to R$ is an unknown nonlinear function defined on a compact set, $φ (k)$ denotes a regression vector that is made up of past inputs and outputs of the system. The configuration of an identification system for FGD process is schematically shown in Figure 2, in which a general IO model M is chosen to identify the FGD process of concern. For notational brevity, the input/output sequence of length n that starts at time instance k is respectively denoted by $U_{n} (k)$ $\overset{Δ}{=}$ $[u (k), u (k - 1),$ … $, u (k - n + 1)]$ and $Y_{n} (k)$ $\overset{Δ}{=}$ $[y (k), y (k - 1),$ … $, y (k - n + 1)]$ .

Figure 2.

Schematic description of FGD process identification system. The shaded region represents FGD process in form of equation (2), f_s(k) and pH(k) denote feeding flow of slurry and slurry pH value at instant k, respectively.

Deep learning and nonlinear dynamic system identification

In this section, the relationship between DL and system identification will be discussed, this will set the stage for applying deep learning techniques to FGD process modeling problem.

Description of neural network and deep neural network

Basically, a neural network (NN) is an architecture that is comprised of a number of interconnected elementary units called neurons. The term “neuron” refers to an operator that maps $R^{n} \to R$ and can be mathematically represented as

y = Γ (\sum_{j = 1}^{n} w_{j} u_{j} + w_{0})

(4)

where $U^{T} = [u_{1}, u_{2}$ ,… $, u_{n}]$ and $W^{T} = [w_{1}, w_{2}$ … $, w_{n}]$ respectively denote the input vector and weight vector of a neuron, $w_{0}$ is referred to as the bias. $Γ (\cdot)$ is a monotone continuous function named activation function, whose common choices include sigmoidal function, hyperbolic tangent function, rectified linear unit and identity function. The NN is built up of many neurons interconnected in a layered manner, suppose neurons are organized in layers l = 0,1,…, L and inputs of a neuron at the lth layer are outputs of all neurons at the (l − 1)th layer, which can be formulated as in equation (5):

y_{i}^{l} = Γ (\sum_{j} w_{i, j}^{l} y_{j}^{l - 1} + w_{i, 0}^{l})

(5)

where $y_{i}^{l}$ is the output of ith neuron in the lth layer, $[W_{i}^{l}]^{T} = [w_{i, 1}^{l}, w_{i, 2}^{l},$ …, $w_{i, n_{l}}^{l}]$ and $w_{i = 0}^{l}$ are corresponding weight vector and bias, respectively. By convention, l = 0 is referred to as the input layer and l = L as the output layer, and the remaining layers are known as hidden layers. The $w_{i 0}^{l}$ can also be regarded as a weight from a neuron whose output is identically unity. From a mathematical point of view, an NN defined above is equivalent to a specific family of parameterized mappings, let m and n are respectively the number of input variables and output variables, then a continuous mapping NN: $R^{m} \to R^{n}$ is defined by the network. Being a universal function approximator, the NN has been widely leveraged to model different nonlinear phenomena during the past several decades. In recent years, with the emergence of deep learning concept, different types of deep NN models have been introduced to address problems in many real-world applications. The deep NN can be constructed through stacking a multiplicity of hidden layers on top of each other, which can lead to a significant improvement in network’s modeling performance. As compared with a shallow architecture, the deep-layered architecture can better approximate functions with high level of complexity,^20,21 which is achieved through multiple computational layers within the deep NN. More specifically, the processing layer has the capability of automatically extracting discriminative features within its input data, and features in the original space can be transformed into a more informative feature space through layer-by-layer feature extraction, thereby making the modeling task much easier. From a mathematical perspective, a deep NN can generate a rather complicated composite function that is made up of many sub-functions through the use of connection weights and multiple layers.

Nonlinear system identification and network implementation

SI is a well-developed field in the control community, there have been an abundance of publications in this area (e.g. Prochazka et al.²² and Juang²³). Nonlinear system identification is a more challenging problem as compared to its linear counterpart. Despite the realization of f and h in (2) involves multiple choices, the neural network, as a universal approximator,²⁴ is typically the most favored method for this purpose. The feedforward network could be represented as

\begin{matrix} \hat{y} = z^{(N)} = φ^{(N)} (z^{(N - 1)}) \\ z^{(n)} = φ^{(n)} (z^{(n - 1)}), n = 1, 2, \\ z^{(0)} = x, \dots, N \end{matrix}

(6)

where x, z⁽ⁿ⁾, $\hat{y}$ represents external input, nth hidden layer output and output, respectively. The mapping in a hidden layer takes the form,hich comprises a linear mapping $W^{(n)} z + b^{(n)}$ followed by a component-wise nonlinear mapping $γ$ ( $γ (\cdot)$ is commonly a monotone continuous mapping, e.g. hyperbolic tangent function). The outermost layer is referred to as output layer, where a linear mapping is used in place of the nonlinearity, namely $φ^{(N)} (z) = σ (W^{(N)} z + b^{(N)})$ and $σ (\cdot)$ is a linear mapping. In above definitions, parameters can be compactly expressed as (W⁽ⁿ⁾, b⁽ⁿ⁾) for $n = 1, 2,$ … $, N$ , where W⁽ⁿ⁾ and b⁽ⁿ⁾ denote weights and biases in nth layer, respectively. As for system identification, feedback connections from the output to input side is adopted to provide the above network with the dynamical property. In this case, with regression vector $φ (k)$ in (3) taken as the input, the resulting model belongs to a non-linear auto-regressive exogenous (NARX) neural model. One major limitation of this type of network architecture is that only finite past input/output information is taken to predict output at the next instant. A potential remedy for this problem is to introduce “hidden state” by creating feedback connections among hidden units, and a network architecture of this form is referred to as the recurrent network. Let us consider an input sequence $x = {x_{1},$ … $, x_{T}}$ of $R^{n}$ , a RNN with a single hidden layer computes the hidden-layer sequence $h = {h_{1},$ … $, h_{T}}$ $\in R^{s}$ , and the output sequence $y = {y_{1},$ … $, y_{T}}$ $\in R^{m}$ (where n, s, m respectively denotes the number of input/hidden/output units) by recursively using (5) from $t = 1$ to T.

\begin{matrix} h_{t} = H (W_{xh} x_{t} + W_{hh} h_{t - 1} + b_{h}) t = 1, 2, \dots, T \\ y_{t} = W_{hy} h_{t} + b_{y}, \end{matrix}

(7)

where $W_{xh}$ , $W_{hh}$ , $W_{hy}$ represents input-hidden weight matrix, hidden-hidden and hidden-output weight matrix with appropriate dimensions, respectively. $b_{h}$ and $b_{y}$ respectively denote bias vector for hidden and output layer, while $H$ is a smooth nonlinearity and typically a sigmoidal function in an elementwise form. Figure 3 is a simplified diagram of a RNN, and it is shown that the hidden state $h_{t}$ at instant t is dependent upon both input $x_{t}$ and the feedback hidden state $h_{t - 1}$ .

Figure 3.

Schematic architecture of the recurrent multilayer perceptron. z⁻¹: one time lag.

The nonlinear mapping defined by (5) more closely matches the true system (2) in format, thus theoretically has advantages over the NARX network in system modeling.

Deep learning based nonlinear system identification

For modeling process with highly nonlinear dynamics, such as the FGD process studied in our paper, neither shallow RNN or NARX network is a good choice due to their structural limitations. It is for such cases that more advanced approaches are called for. The field of DL has gained considerable attention over the past few years, and DL techniques are now applied in wide spans of areas (such as text classification,²⁵ visual recognition,²⁶ etc.). Among many branches of DL, the most relevant one for system identification is temporal-sequence learning (TSL), and deep RNN, as the most frequently used TSL model, can also be applied in the system identification field. The word “deep” in deep learning means multiple layers (e.g. convolutional layer, recurrent layers, etc.) are stacked together to such that extremely complicated non-linear dynamics can be captured. As an example, RNN in (5) is modified to the deep version as below. Denoting the sequence in nth hidden layer by $h^{n} = {h_{1}^{n},$ … $, h_{T}^{n}}$ $\in R^{s_{n}}$ ( $s_{n}$ is the number of hidden units in nth hidden layer), then $h^{n}$ is calculated by iterating from $n = 1$ to N and $t = 1$ to T:

\begin{matrix} h_{t}^{n} = H (W_{h^{n - 1} h^{n}} h_{t}^{n - 1} + W_{h^{n} h^{n}} h_{t - 1}^{n} + b_{h}^{n}) t = 1, 2, \dots, \\ T n = 1, 2, \dots, N \end{matrix}

(8)

where $h^{0} = x$ , $W_{h^{n - 1} h^{n}}$ denotes weight matrix connecting the (n − 1)th hidden layer and the nth hidden layer, while the inner-loop weight matrix in the nth hidden layer is represented as $W_{h^{n} h^{n}}$ . $b_{h}^{n}$ is nth-hidden layer’s bias vector and the network response $y_{t}$ is calculated as

y_{t} = W_{h^{N} y} h_{t}^{N} + b_{y} t = 1, 2, \dots, T

(9)

$W_{h^{N} y}$ is the weight matrix on the connection from layer $h^{N}$ to the output layer, $b_{y}$ is the bias vector in the output layer. Then the deep mapping can be obtained by setting N a large scalar, in the same manner, DL-based NARX network can be constructed accordingly. From an implementation viewpoint, above two types of deep dynamic network suffer from the gradient exploding/vanishing problem which is formalized in Bengio et al.²⁷ As an alternative, LSTM network will be considered below and its connection to system identification is also discussed.

LSTMs deep dynamic neural network

This section consists of two parts, the first part deals with a brief overview of LSTMs network; and the use of deep LSTMs for system identification purpose is presented in the second part.

Long short-term memory

LSTM network that was originally introduced in Hochreiter and Schmidhuber²⁸ is a modified version of standard RNN, where gate units are added to guarantee constant error flow inside the cell and thus can avoid the gradient exploding/vanishing problem. LSTMs are appealing from a system identification viewpoint, due to its superior ability to learn long range dependencies as compared to simple RNNs.²⁹ An LSTM unit is made up of a cell and three gates (i.e. the input gate, the output gate and the forget gate), wherein the forget gate serves to determine how to preserve cell states in previous time steps, while the input/output gate plays the role of controlling how information is entered and removed from the LSTM unit at each time instant. With the combined effects of gating units, the regulation of information flow is achieved, which in turn enables long-term memory of the LSTM network. For illustrative purposes, the internal and external architecture of an LSTM unit is shown below.

The top plot in Figure 4 shows the inner structure of an LSTM unit, where $x_{t}$ is the input vector at time t, according to connections presented above, the LSTM unit can be mathematically stated as

\begin{matrix} f_{t} = σ (W_{f} x_{t} + R_{f} h_{t - 1} + P_{f} \times c_{t - 1} + b_{f}) \\ i_{t} = σ (W_{i} x_{t} + R_{i} h_{t - 1} + P_{i} \times c_{t - 1} + b_{i}) \\ o_{t} = σ (W_{o} x_{t} + R_{o} h_{t - 1} + P_{o} \times c_{t} + b_{o}) \\ z_{t} = \tanh (W_{z} x_{t} + R_{z} h_{t - 1} + b_{z}) \end{matrix}

(10)

where $f_{t}$ , $i_{t}$ , and $o_{t}$ denote respectively the activation vector of the forget, input and output gate at time instant t, $z_{t}$ is the input modification. Here, the notation “*” is introduced to refer to one of the following four quantities: f, i, z, and o, then $W_{*}$ , $R_{*}$ , and $P_{*}$ (peephole weights) are weight matrices associated with the current input $x_{t}$ , unit output in the last time step $h_{t - 1}$ and cell output in the last time step $c_{t - 1}$ , respectively. Additionally, $b_{*}$ denotes the corresponding bias vector with respect to f, i, z, or o, and the operator “×” denotes the Hadamard product. The cell state $c_{t}$ and hidden state $h_{t}$ are defined as

\begin{matrix} c_{t} = z_{t} \times i_{t} + c_{t - 1} \times f_{t} \\ h_{t} = o_{t} \times \tanh (c_{t}) \end{matrix}

(11)

Figure 4.

Schematic architecture of an LSTM unit. Top: Internal structure of LSTM. Bottom: External structure of LSTM (Z⁻¹: one time lag).

The bottom plot in Figure 4 provides the reader with the external structure of an LSTM unit, from the point of system identification, it is preferable to treat it as a black-box module with a recurrent structure. Specifically, two dimensions are introduced to describe signal flow relations therein. In the time dimension, both cell state $c (t)$ and hidden state $h (t)$ are sent back to the input side after a time lag, and another destination of $h (t)$ is the cell output. While in the layer dimension, an LSTM unit can simply be treated as an input-output model, and the input can either be the output of another unit or just an external signal.

Bidirectional LSTM network

It is interesting to note that a common feature of all above networks is that only previous context is utilized to calculate the current output, which to some extent poses limitations on the learning capability of the network. In order to overcome this limitation, the bidirectional RNN (BRNN) is proposed in Schuster and Paliwal³⁰ such that not only past sequence elements but future ones are taken into consideration. Unlike the standard RNN of the form (5), in BRNN the single hidden layer is replaced by a forward and backward hidden layer, and the backward hidden sequence $\overset{\leftarrow}{h}$ is computed by iterating from $t = T$ to 1, while the forward sequence $\vec{h}$ iterates from $t = 1$ to T. Then the output sequence y is updated by considering both $\vec{h}$ and $\overset{\leftarrow}{h}$ :

{\vec{h}}_{t} = H (W_{x \vec{h}} x_{t} + W_{\vec{h} \vec{h}} {\vec{h}}_{t - 1} + b_{\vec{h}})

{\overset{\leftarrow}{h}}_{t} = H (W_{x \overset{\leftarrow}{h}} x_{t} + W_{\overset{\leftarrow}{h} \overset{\leftarrow}{h}} {\overset{\leftarrow}{h}}_{t - 1} + b_{\overset{\leftarrow}{h}})

y_{t} = W_{\vec{h} y} {\vec{h}}_{t} + W_{\overset{\leftarrow}{h} y} {\overset{\leftarrow}{h}}_{t - 1} + b_{y}

(12)

The bidirectional LSTM (BLSTM) is obtained by integrating BRNN and LSTM,³¹ which has the property of memorizing long-span relations within a sequence in both directions. Since the forward part in BLSTM is identical to the LSTM unit introduced earlier, only the mathematical expression for backward one is given below, with reference to (8)–(10), it is derived as

\begin{matrix} {\overset{\leftarrow}{f}}_{t} & = σ (W_{\overset{\leftarrow}{f}} x_{t} + R_{\overset{\leftarrow}{f}} {\overset{\leftarrow}{h}}_{t - 1} + P_{\overset{\leftarrow}{f}} \times {\overset{\leftarrow}{c}}_{t - 1} + b_{\overset{\leftarrow}{f}}) \\ {\overset{\leftarrow}{i}}_{t} & = σ (W_{\overset{\leftarrow}{i}} x_{t} + R_{\overset{\leftarrow}{i}} {\overset{\leftarrow}{h}}_{t - 1} + P_{\overset{\leftarrow}{i}} \times {\overset{\leftarrow}{c}}_{t - 1} + b_{\overset{\leftarrow}{i}}) \\ {\overset{\leftarrow}{o}}_{t} & = σ (W_{\overset{\leftarrow}{o}} x_{t} + R_{\overset{\leftarrow}{o}} {\overset{\leftarrow}{h}}_{t - 1} + P_{\overset{\leftarrow}{o}} \times {\overset{\leftarrow}{c}}_{t - 1} + b_{\overset{\leftarrow}{o}}) \\ {\overset{\leftarrow}{z}}_{t} & = \tanh (W_{\overset{\leftarrow}{z}} x_{t} + R_{\overset{\leftarrow}{z}} {\overset{\leftarrow}{h}}_{t - 1} + b_{\overset{\leftarrow}{z}}) \\ {\overset{\leftarrow}{c}}_{t} & = {\overset{\leftarrow}{z}}_{t} \times {\overset{\leftarrow}{i}}_{t} + {\overset{\leftarrow}{c}}_{t - 1} \times {\overset{\leftarrow}{f}}_{t} \\ {\overset{\leftarrow}{h}}_{t} & = {\overset{\leftarrow}{o}}_{t} \times \tanh ({\overset{\leftarrow}{c}}_{t}) \end{matrix}

(13)

where parameters in (11) are defined in the same fashion as those in (8) and (9), and the only difference lies in notations. Next, deep versions of two above LSTMs are given and used for the system identification purposes.

Deep design of LSTM for dynamic system identification

In this part, the design work of deep layered vanilla LSTM and BLSTM is given first, then the validity of using LSTMs for system identification is analyzed theoretically. As with the way of constructing deep RNN/NARX network, deep LSTMs are realized by cascading multiple LSTM (BLSTM) units. From the system identification viewpoint, each unit can be considered as a forced dynamic system of the form (8) and (9), and a DL-based LSTMs essentially corresponds to multiple forced dynamical systems that are connected in cascade, thus it is expected to have a powerful dynamics approximation capability. The nth layer of a deep LSTM network at time t is formulated as

\begin{matrix} f_{t}^{n} = σ (W_{f}^{n} h_{t}^{n - 1} + R_{f}^{n} h_{t - 1}^{n} + P_{f}^{n} \times c_{t - 1}^{n} + b_{f}^{n}) \\ i_{t}^{n} = σ (W_{i}^{n} h_{t}^{n - 1} + R_{i}^{n} h_{t - 1}^{n} + P_{i}^{n} \times c_{t - 1}^{n} + b_{i}^{n}) \\ o_{t}^{n} = σ (W_{o}^{n} h_{t}^{n - 1} + R_{o}^{n} h_{t - 1}^{n} + P_{i}^{n} \times c_{t - 1}^{n} + b_{o}^{n}) \\ z_{t}^{n} = \tanh (W_{z}^{n} h_{t}^{n - 1} + R_{z}^{n} h_{t - 1}^{n} + b_{z}^{n}) \end{matrix}

(14)

where the subscript signifies the time instance, for $t = 1,$ … $, T$ ; while the superscript is the index of a layer, for $t = 1,$ … $, N$ , where N is the number of LSTM layer. Besides $h_{t}^{0}$ $\overset{Δ}{=}$ $x_{t}$ and $x_{t}$ is the external input at time t. The unit output is calculated as

\begin{matrix} c_{t}^{n} = z_{t}^{n} \times i_{t}^{n} + c_{t - 1}^{n} \times f_{t}^{n} \\ h_{t}^{n} = o_{t}^{n} \times \tanh (c_{t}^{n}) \end{matrix}

(15)

Likewise, a DL-based BLSTM is constructed by stacking BLSTM units introduced earlier, wherein the forward hidden layer sequence ( ${\vec{h}}_{t}^{n}, {\vec{c}}_{t}^{n}$ ) is computed by iterating from $t = 1$ to T, while the sequence of backward hidden layer ( ${\overset{\leftarrow}{h}}_{t}^{n}, {\overset{\leftarrow}{c}}_{t}^{n}$ ) is derived by reversely iterating from $t = T$ to 1. As before, only the unit in backward layer is considered and for layer n and time t, we have

\begin{matrix} {\overset{\leftarrow}{f}}_{t}^{n} & = σ (W_{\overset{\leftarrow}{f}}^{n} {\overset{\leftarrow}{h}}_{t}^{n - 1} + R_{\overset{\leftarrow}{f}}^{n} {\overset{\leftarrow}{h}}_{t + 1}^{n} + P_{\overset{\leftarrow}{f}}^{n} \times {\overset{\leftarrow}{c}}_{t + 1}^{n} + b_{\overset{\leftarrow}{f}}^{n}) \\ {\overset{\leftarrow}{i}}_{t}^{n} & = σ (W_{\overset{\leftarrow}{i}}^{n} {\overset{\leftarrow}{h}}_{t}^{n - 1} + R_{\overset{\leftarrow}{i}}^{n} {\overset{\leftarrow}{h}}_{t + 1}^{n} + P_{\overset{\leftarrow}{i}}^{n} \times {\overset{\leftarrow}{c}}_{t + 1}^{n} + b_{\overset{\leftarrow}{i}}^{n}) \\ {\overset{\leftarrow}{o}}_{t}^{n} & = σ (W_{\overset{\leftarrow}{o}}^{n} {\overset{\leftarrow}{h}}_{t}^{n - 1} + R_{\overset{\leftarrow}{o}}^{n} {\overset{\leftarrow}{h}}_{t + 1}^{n} + P_{\overset{\leftarrow}{o}}^{n} \times {\overset{\leftarrow}{c}}_{t + 1}^{n} + b_{\overset{\leftarrow}{o}}^{n}) \\ {\overset{\leftarrow}{z}}_{t}^{n} & = σ (W_{\overset{\leftarrow}{z}}^{n} {\overset{\leftarrow}{h}}_{t}^{n - 1} + R_{\overset{\leftarrow}{z}}^{n} {\overset{\leftarrow}{h}}_{t + 1}^{n} + b_{\overset{\leftarrow}{z}}^{n}) \end{matrix}

(16a)

Two types of states are mathematically expressed as

\begin{matrix} {\overset{\leftarrow}{c}}_{t}^{n} & = {\overset{\leftarrow}{z}}_{t}^{n} \times {\overset{\leftarrow}{i}}_{t}^{n} + {\overset{\leftarrow}{c}}_{t + 1}^{n} \times {\overset{\leftarrow}{f}}_{t}^{n} \\ {\overset{\leftarrow}{h}}_{t}^{n} & = {\overset{\leftarrow}{o}}_{t}^{n} \times \tanh ({\overset{\leftarrow}{c}}_{t}^{n}) \end{matrix}

(16b)

and the network response is calculated as

y_{t} = W_{\vec{y}} {\vec{h}}_{t} + W_{\overset{\leftarrow}{y}} {\overset{\leftarrow}{h}}_{t} + b_{y}

(17)

where ${\vec{h}}_{t}$ and ${\overset{\leftarrow}{h}}_{t}$ is respectively the final response of forward and backward layers, $W_{\vec{y}}$ and $W_{\overset{\leftarrow}{y}}$ are corresponding output weight matrices for two types of layers, $b_{y}$ is the bias vector. For illustrative purposes, the DL-based BLSTM is unfolded respectively along two dimensions (time dimension and layer dimension) and depicted in Figure 5, where the forward/backward LSTM unit is represented by a rectangular block, and multiple memory cells are allowed to be incorporated in an LSTM unit.

Figure 5.

An unfolded deep learning-based BLSTM. The circular block in an LSTM unit denotes a memory cell.

Next, the identification capability of an LSTMs-based network is analyzed theoretically. It follows from (12) and (13) that there exist two types of states in an LSTM unit, namely a cell state $c_{t}^{n}$ and a hidden state $h_{t}^{n}$ . Then the state vector for the nth layer can be represented as $(x_{t}^{n})^{T}$ $\overset{Δ}{=}$ $[(c_{t}^{n})^{T} (h_{t}^{n})^{T}]$ , with all $x_{t}^{n}$ ( $n = 1,$ … $, N$ , N is the hidden layer number) in an LSTM network aggregating together, we have state vector $x^{T} (t)$ $\overset{Δ}{=}$ $[(x_{t}^{1})^{T},$ … $, (x_{t}^{N})^{T}]$ . In addition, let $θ (N N)$ denote the parametric vector containing all weights and biases of the network NN; meanwhile the network input $x_{t}^{0}$ $\overset{Δ}{=}$ $u (t)$ and output $y_{t}$ $\overset{Δ}{=}$ $y (t)$ . From (12) and (13), it follows that for cell state $c_{t}^{n}$ and hidden state $h_{t}^{n}$ in arbitrary layer, they are dependent upon: (1) the state in the last layer ( $c_{t}^{n - 1}$ and $h_{t}^{n - 1}$ ), (2) the state in the current layer with one time instant earlier ( $c_{t - 1}^{n}$ and $h_{t - 1}^{n}$ ). With the iterative application of (12) and (13), eventually it is found that $x_{t}^{n}$ is dependent upon state vectors in all preceding layers, namely $(x_{t - 1}^{n},$ … $, x_{t - 1}^{1})$ . Along with the fact that the input to the first hidden layer is correlated with $x_{t}^{0}$ , using notations introduced above, we conclude that the state equation defined by (12) and (13) is exactly the same form of that in (2). Furthermore, the network response $y_{t}$ is directly dependent upon the hidden state $h_{t}^{N}$ , as $h_{t}^{N}$ is incorporated in $x (t)$ , the output y_t also matches the output equation defined in (2). From above analysis, it follows that an LSTMs-based network agrees well with the real system structurally.

At this point, the focus is on identification system design of the FGD process. As mentioned earlier, the FGD process to be identified is an SISO dynamical system, with feeding flow of slurry $u_{t}$ being the input while slurry pH $y_{t}$ treated as the output. The resulting identification system is shown in Figure 6, one can observe that the LSTMs-based network is a parallel connection of the identified FGD process. The realization of an LSTM-based identifier is illustrated in the lower part of the figure, where it is found to comprise three key components: the external input sequence ${u_{t}}$ , a chain of LSTM units and a fully-connected layer. At each time step t, the element $x_{t}$ in sequence is fed into the cascading LSTM units and the output of each layer is then successively computed until output $h_{t}^{N}$ is generated. Instead of taking $h_{t}^{N}$ as the input to the fully-connection layer directly, the dropout technique is adopted to prevent overfitting to the training data. In the fully-connected layer, ReLU and linear function are respectively utilized as the activation for hidden layer and output layer, and the final output ${\hat{y}}_{t}$ is thus calculated. By following the above-stated procedure, the output sequence of the same length as the input sequence is obtained. Furthermore, the approximation error $e_{t}$ $\overset{Δ}{=}$ $y_{t} - {\hat{y}}_{t}$ serves as a guide for training the LSTM-based identifier at each instant.

Figure 6.

Identification system for FGD process consisting of a deep learning–based LSTMs network schematic.

Experimental results

In this section, all identification models described earlier are evaluated in an FGD context. For comparison purposes, representative models from other literature in this direction are also considered, and a quantitative assessment of each candidate model is performed.

Experimental condition

The simulation study carried out in this section is based on a real FGD process at a 600 MW coal-fired power plant, which is located in the province of Hebei, China. Operation data (i.e. flow of feeding limestone slurry and slurry pH value in the reaction tank) are obtained from the on-line data logging system, which records measurements within a fixed time interval per day. With a sampling time of 1 min, observations from January 3, 2022 to January 15, 2022 are obtained. With the outlier/missing data removal procedure is performed, the resulting dataset is divided into three parts: 80% for training and 20% for testing. Here a distinction is made between two types of data sets, a training set is utilized for the model development purpose, while the test set plays the role of giving the final evaluation on the generalization performance of a certain model.

Network architecture and experimental design

In our experiment, all identifiers described earlier are evaluated including: (i) NARX-based network in equations (3) and (4), (ii) recurrent network (RNN) in equations (6) and (7), (iii) LSTM network in equations (12) and (13), and (iv) BLSTM network in equations (14) and (15). Furthermore, identifiers come from past works in FGD process modeling direction are also involved, they are ARX model in Li et al.,³² cascade forward neural network (CFNN) in Di Capaci and Scali,³³ NARX model in Li et al.³⁴ and RNN in Wu et al.³⁵ In all cases, parameters are estimated by minimizing the loss $J (θ) = \frac{1}{T} \sum_{t = 1}^{T} {‖ y_{t} - {\hat{y}}_{t} ‖}^{2}$ , where $y_{t}$ and ${\hat{y}}_{t}$ are respectively the real value and of model output at time t, T is the length of training sequence. As regards neural identifiers listed above, training was carried out using the Adam optimizer,³⁶ and mini-batch training with a batch size of 32 was used. The initial learning rate is set as 10⁻³ and decayed when a plateau in the validation set is detected (the patience number is set as 6 epochs).

The hyper-parameters play an essential role in the performance of a model, and the best-performing model depends upon the optimal combination of hyper-parameters. Each experimental model discussed in this section comprises multiple hyper-parameters (e.g. key hyper-parameters in a vanilla LSTM include: memory cell number, dropout rate and the number of stacked LSTM blocks). As a consequence, it is essential to choose an appropriate method to optimize hyper-parameters. Here the random search combined with five-fold cross-validation technique is adopted to optimize hyper-parameters in each experimental model. Due to the fact that multiple hyper-parameters are covered for each experimental model, the grid search and manual search method become computationally infeasible. Additionally, the random search possesses practical merits of grid search and shows more efficiency for hyper-parameter optimization in a high-dimensional search space.³⁷ As a consequence, the random search that makes use of a randomized search of the hyper-parameter space to minimize training time while ensuring performance is adopted. Besides, according to Browne,³⁸ the cross-validation technique has the benefits of making efficient use of data and preventing the over-fitting problem. In five-fold cross-validation, a training set is partitioned into four subsets for training and one subset for evaluating the performance of established model. This process is repeated for five times to obtain five different combinations and each subset is chosen once for validation. Then the results are averaged to obtain the final cross-validation result. To facilitate the performance evaluation of each experimental model, two evaluation metrics mean squared error (MSE) and mean absolute percentage error (MAPE) are introduced, they are written as

MSE = \sum_{t = 1}^{N} {(y_{t} - {\hat{y}}_{t})}^{2} / N

(18)

MAPE = \sum_{t = 1}^{N} | \frac{y_{t} - {\hat{y}}_{t}}{y_{t}} | / N

(19)

where $y_{t}$ and ${\hat{y}}_{t}$ are respectively actual and predicted values at instant t, N denotes the total number of samples. Among two evaluation metrics, MSE can reflect the degree of deviation between the predicted value of the model and the real value, and a lower MSE score implies a higher data approximation ability of the model. MAPE is the mean absolute percentage error, which indicates the average of relative errors defined in terms of actual and predicted values. It is one of the most extensively used measures of model performance which has benefits of scale-independency and interpretability,³⁹ and the lower MAPE value is, the better will be the model performance. In this research, the hyper-parameter optimization technique we used can be summarized as follows:

(1) Estimate the range of each hyperparameter in an experimental model with several preliminary runs, and assign corresponding values within it to build the hyper-parameter space.

(2) Search the hyper-parameter space using the random search technique, and the search is performed for a maximum of 50 iterations.

(3) Train the experimental model with searched hyper-parameters using five-fold cross validation, and evaluate its cross-validation performance using the metric MAPE defined in equation (17).

(4) Compare all obtained results and hyper-parameters of the best one (i.e. the lowest cross-validation MAPE score) are selected, which are then used to train the model on the whole training set.

(5) Test the established model on the test set.

Figure 7 depicts the whole procedure for hyper-parameter optimization for our experimental models. The hyper-parameter information of experimental models, namely the selection range and specific values of hyper-parameters, are presented in Table 1. Following the above procedure for hyper-parameters optimization, optimized hyper-parameters of each experimental model can thus be determined, which are also listed in Table 1.

Figure 7.

Diagram that illustrates the hyper-parameters optimization process.

Table 1.

Hyper-parameters selection ranges and optimized values for deep learning experimental models.

Model type	Hyper-parameter	Tuning range	Optimized value
ARX	Autoregressive order	{1, 2, 4, 6, 8}	6
ARX	Input order	{1, 2, 4, 6, 8}	4
NARX	Hidden layers number	{1, 2, 3, 4}	4
	Hidden units number per layer	{8, 16, 32, 64}	64, 32, 16, 16
	Input/output order	{1, 2, 4, 6, 8}	4
LSTM	Dropout rate	{0.1, 0.15, 0.2, 0.25,0.3}	0.15
	Memory cell number per layer	{16, 32, 64, 128}	64, 64, 32, 16
	Stacked LSTM layer number	{1, 2, 3, 4}	4
RNN	Hidden layers number	{1, 2, 3, 4}	3
RNN	Hidden units number per layer	{8, 16, 32, 64}	64, 32, 32
CFNN	Hidden layers number	{1, 2, 3, 4}	4
	Hidden units number per layer	{8, 16, 32, 64}	32, 32, 16, 16
	Input/output order	{1, 2, 4, 6, 8}	4
BLSTM	Dropout rate	{0.1, 0.15, 0.2, 0.25,0.3}	0.2
	Memory cell number per layer	{16, 32, 64, 128}	32, 64, 16
	Stacked LSTM layer number	{1, 2, 3, 4}	3

(1) ARX, CFNN, and NARX respectively represent autoregressive model with exogenous input, cascade forward neural network and nonlinear autoregressive model with exogenous input. (2) For the sake of computational simplicity, let the input order equal to the output order in CFNN and NARX.

Modeling results and discussion

In this part, with the experimental design stated above, all candidate models are tested sequentially. For notational brevity, suppose there exists m₀ input variables and m_L outputs, a family of neural models with m_l units at the lth layer is denoted as $N_{m_{0}, m_{1}, \dots, m_{L}}^{L}$ in cases below. In the following, a total of five cases are considered.

Case 1: ARX model.

Case 2: NARX model

Case 3: RNN model

Case 4: CFNN model

Case 5: LSTM and BLSTM model.

Case 1: ARX model

As the most commonly used linear dynamic model, ARX was adopted to identify the FGD process. Besides it can provide a benchmark for comparing model performance in other cases. According to the order range listed above, a total of 4 ARX models are constructed. It is found that when the autoregressive order and input order are respectively determined as 6 and 4, the model has the lowest cross-validation MAPE score and hence is selected. Performance of the selected model on test set is shown in Figure 8(a) below, and the resulting MSE and MAPE respectively reach high values up to 0.177 and 0.1827, implying ARX model is not a desirable choice.

Figure 8.

Identified results of different models. Red line: model output. Blue line: target: (a) ARX, (b) NARX, (c) RNN, (d) FCNN, (e) LSTM, (f) BLSTM.

Case 2: NARX model

In this case, NARX model is utilized to perform the identification task, and networks with all possible hyper-parameter combinations are separately trained and evaluated on validation set. Following the procedure stated in experimental design part, the model belongs to $N_{8, 64, 32, 16, 16, 1}^{5}$ has the minimum MSE (0.0039) and MAPE (0.0087) scores and thus be selected, its performance on test set is shown in Figure 8(b). In comparison with ARX model, both MSE and MAPE scores are significantly dropped from 0.177 to 0.0039 and from 0.1827 to 0.0087, respectively.

Case 3: RNN model

To evaluate RNN model’s identification capability for FGD process, the same procedure is followed to search for the optimal network architecture. The searching result indicates that a high proportion of MSE values range between 6 × 10⁻³ and 7 × 10⁻³, and the optimal architecture corresponds to the minimum among all results (3.2473 × 10⁻³) and is determined as $N_{1, 64, 32, 32, 1}^{4}$ . The test result of selected model is presented in Figure 8(c), and it is found that the RNN model is not shown significant advantages over NARX model in terms of FGD process identification.

Case 4: CFNN model

In this case, CFNN model in Alkhasawneh and Tay⁴⁰ is considered. As before, networks with different parameter combinations are trained and evaluated on validation set. By comparing all results, the model belongs to $N_{8, 32, 32, 16, 16, 1}^{5}$ is chosen, whose MSE value is the minimum and reaches 3.3951 × 10⁻³, Figure 8(d) shows its test result. It is found that three types of best selected neural models (i.e. NARX, RNN, and CFNN) perform closely to each other, since FGD process is a complicated process with high-time delay and a more efficient method is needed for achieving better performance.

Case 5: LSTM-based models

In this case, the FGD process is identified by LSTM model and its variant BLSTM model. Comparing with identifiers utilized above, LSTM and its variant have strong capability in capturing long range temporal dependencies between input and output sequences. With parameter combinations defined above, a total of 120 candidates are trained and evaluated on validation set, eventually the one with 64, 64, 32, and 16 memory cells in each stacked unit is chosen; While for the BLSTM model, the number of stacked units is 3, with 32, 64, 16 cells in each unit. As is clearly seen from Figure 8(e) and (f), the real slurry pH and model output are almost indistinguishable, suggesting the remarkable identification performance of two LSTM-based identifiers.

Aiming at comparing the performance of shallow network (single-layered architecture) and deep-layered one (refer to the case where hidden layer number is equal to or greater than three), in Table 2 we list best selected models in Case 1–5, along with their MSE/MAPE scores on training and test set, besides average results on validation set are also included. According to results shown in Table 2, two conclusions can be drawn: (1) Deep learning-based architecture enables improvement of model’s identification performance. Specifically, all deep learning models have lower MSE and MAPE scores on the test set as compared to their shallow structure counterparts, and the largest reduction rate for MSE and MAPE reach 81.90% (NARX) and 84.15% (LSTM), respectively. (2) The deep-layered BLSTM model performs best among all experimental models, and it achieves the lowest MSE value of 4.352 × 10⁻⁴ and lowest MAPE value of 0.0010, implying it has the minimum absolute and relative error defined in terms of actual and predicted values, and consequently deep-layered BLSTM is the most suitable model for identifying the studied flue gas desulfurization process.

Table 2.

Comparison results for different shallow/deep models.

Model type	MSE MAPE
Model type	Training testing average			Training testing average
NARX ( $N_{8, 64, 1}^{3}$ )	0.0061	0.021	0.058	0.0098	0.024	0.062
RNN ( $N_{1, 64, 1}^{3}$ )	0.0047	0.015	0.054	0.0079	0.019	0.057
CFNN ( $N_{6, 64, 1}^{2}$ )	0.0043	0.018	0.073	0.0065	0.021	0.077
LSTM ( $N_{1, 32, 1}^{2}$ )	0.00095	0.0031	0.0038	0.0014	0.0082	0.0093
BLSTM ( $N_{1, 32, 1}^{2}$ )	0.00039	0.00091	0.0012	0.00083	0.0043	0.0067
ARX	0.1365	0.1775	/	0.1391	0.1827	/
NARX ( $N_{8, 64, 32, 16, 16, 1}^{5}$ )	0.00049	0.0038	0.0047	0.00095	0.0087	0.0094
RNN ( $N_{1, 64, 32, 32, 1}^{4}$ )	0.00043	0.0031	0.0042	0.00094	0.0079	0.0091
CFNN ( $N_{8, 32, 32, 16, 16, 1}^{5}$ )	0.00045	0.0039	0.0045	0.00094	0.0087	0.0093
LSTM ( $N_{1, 64, 64, 32, 16, 1}^{5}$ )	0.000049	0.00078	0.00098	0.00020	0.0013	0.0015
BLSTM ( $N_{1, 32, 64, 16, 1}^{4}$ )	0.000048	0.00043	0.00062	0.00019	0.0010	0.0012
ARX	0.1365	0.1775	/	0.1391	0.1827	/

/ There is no validation stage for an ARX model, “Average” represents cross-validation results for MSE and MAPE and all results in table are rounded to two valid decimal places except for ARX.

Above boldfaced solid line: shallow model case. Below boldfaced solid line: deep model case.

The training objective function (also referred as training loss) to be minimized is chosen as the mean squared error, and the training loss related to the epoch number for each experimental model is shown in Figure 9, which is a convenient tool for observing the learning degree of a model during the training process.⁴¹ It is found that all losses decrease rapidly at the early stage of training, then vary little and converge to a small value after some epochs, and main distinctness among different models lies in the convergence epoch and convergence value.

Figure 9.

Comparison of learning curves of different neural models.

Table 3 shows details concerning above two convergence indicators for different models, we can clearly observe that LSTM and BLSTM respectively have the minimum convergence epoch and convergence value. More specifically, it takes 65 epochs for LSTM to converge to 4.926 × 10⁻⁵, meanwhile 71 epochs is required for BLSTM to reach the loss of 4.849 × 10⁻⁵.

Table 3.

Comparison of convergence indicators for different models.

Model type	Convergence epoch	Convergence value (10⁻⁴)
FNN	134	4.9751
RNN	132	4.3287
FCNN	90	4.4791
LSTM	65	0.4926
BLSTM	71	0.4849

Conclusion

In this work we attempt to explore the feasibility of applying DL-based technique to the system identification problem. FGD process, as a representative non-linear dynamic system with high level of complexity, is taken as the case study. We first analyzed the relationship between DL and system identification, and presented deep form of commonly used neural identifiers and particularly LSTMs. It is found that LSTMs have structurally advantages when used for system identification. Then systematic experiments are performed to evaluate the identification performance of selected identifiers in an FGD context. According to experimental results, it is found that MSE and MAPE for the best-performing deep BLSTM model respectively achieve 4.352 × 10⁻⁴ and 0.0010, and in comparison to those obtained in the second best deep LSTM model, the MSE and MAPE scores are decreased by 44.87% and 23.08%, respectively. Additionally, it is concluded that modeling performance is improved significantly when deep-layered structure is introduced, and LSTM/BLSTM is particularly suited to identify the FGD process, since they benefit from rapid learning speed and high identification accuracy. The integration of advanced DL network and system identification shows great success in our paper. In future investigations, a further step will be made by incorporating the controller design part.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by the National Natural Science Foundation of China (62373012 and 62303025).

ORCID iD

Quanbo Liu

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Zhang

Gao

Liu

, et al. Dynamic prediction of in-situ SO2 emission and operation optimization of combined desulfurization system of 300 MW CFB boiler. Fuel 2022; 324: 124435.

Gage

Rochelle

GT.

Limestone dissolution in flue gas scrubbing: effect of sulfite. J Air Waste Manag Assoc 1992; 42(7): 926–935.

Brogren

Karlsson

HT.

A model for prediction of limestone dissolution in wet flue gas desulfurization applications. Ind Eng Chem Res 1997; 36(9): 3889–3897.

Eden

Luckas

A heat and mass transfer model for the simulation of the wet limestone flue gas scrubbing process. Chem Eng Technol 1998; 21(1): 56–60.

Wang

Zhao

, et al. Wet flue gas desulfurization using micro vortex flow scrubber: characteristics, modeling and simulation. Sep Purif Technol 2020; 247: 116928.

Zhang

Lang

Wang

, et al. Chemical mass transfer mechanism and characteristics of flue gas desulfurization of basic aluminum sulfate by bubbles. Energy Fuels 2017; 31: 11043–11052.

, et al. Mass transfer process intensification for SO2 absorption in a commercial-scale wet flue gas desulfurization scrubber. Chem Eng Process Process Intensification 2021; 166: 108478–108491.

Flagiello

Erto

Lancia

, et al. Experimental and modelling analysis of seawater scrubbers for sulphur dioxide removal from flue-gas. Fuel 2018; 214: 254–263.

Zhao

Zhang

Gao

, et al. Simulation of SO2 absorption and performance enhancement of wet flue gas desulfurization system. Process Saf Environ Prot 2021; 150: 453–463.

10.

Cui

Song

, et al. Energy conservation and efficiency improvement by coupling wet flue gas desulfurization with condensation desulfurization. Fuel 2021; 285: 119221.

11.

Hou

Bai

Yin

On-line monitoring and optimization of performance indexes for limestone wet desulfurization technology. Appl Mech Mater 2013; 295-298: 1020–1028.

12.

Guo

Zheng

, et al. Modeling and optimization of wet flue gas desulfurization system based on a hybrid modeling method. J Air Waste Manag Assoc 2019; 69: 565–575.

13.

Villanueva Perales

Gutiérrez Ortiz

Vidal Barrero

, et al. Using neural networks to address nonlinear pH control in wet limestone flue gas desulfurization plants. Ind Eng Chem Res 2010; 49: 2263–2272.

14.

Ren

Sun

Deng

Modeling and optimization research of CFB-FGD based on improved genetic algorithms and BP neural network. Adv Mater Res 2012; 610–613: 1601–1614.

15.

Yang

Zhong

Sun

, et al. Dynamic optimization oriented modeling and nonlinear model predictive control of the wet limestone FGD system. Chin J Chem Eng 2020; 28: 832–845.

16.

Liu

Yang

Sun

Multi-objective economic model predictive control of wet limestone flue gas desulfurisation system. Process Saf Environ Prot 2021; 150: 269–280.

17.

Córdoba

Status of flue gas desulphurisation (FGD) systems from coal-fired power plants: overview of the physic-chemical control processes of wet limestone FGDs. Fuel 2015; 144: 274–286.

18.

Kumar

Jana

SK.

Advances in absorbents and techniques used in wet and dry FGD: a critical review. Rev Chem Eng 2022; 38: 843–880.

19.

Narendra

Mukhopadhyay

Adaptive control using neural networks and approximate models. IEEE Trans Neural Netw 1997; 8: 475–485.

20.

Seng

Zhang

, et al. Spatiotemporal prediction of air quality based on LSTM neural network. Alex Eng J 2021; 60(2): 2021–2032.

21.

Mohammed

Kora

A comprehensive review on ensemble deep learning: opportunities and challenges. J King Saud Univ - Comput Inf Sci 2023; 35(2): 757–774.

22.

Prochazka

Kingsbury

Payner

, et al. Signal analysis and prediction. Berlin: Springer Science & Business Media, 2013.

23.

Juang

JN.

Applied system identification. Hoboken, NJ: Prentice-Hall, Inc, 1994.

24.

Hong

Mitchell

Chen

, et al. Model selection approaches for non-linear system identification: a review. Int J Syst Sci 2008; 39: 925–946.

25.

Kowsari

Jafari Meimandi

Heidarysafa

, et al. Text classification algorithms: a survey. Information 2019; 10(4): 150.

26.

Grill-Spector

Kanwisher

Visual recognition: as soon as you know it is there, you know what it is. Psychol Sci 2005; 16: 152–160.

27.

Bengio

Simard

Frasconi

Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 1994; 5: 157–166.

28.

Hochreiter

Schmidhuber

Long short-term memory. Neural Comput 1997; 9: 1735–1780.

29.

Gers

Schmidhuber

Cummins

Learning to forget: continual prediction with LSTM. Neural Comput 2000; 12: 2451–2471.

30.

Schuster

Paliwal

KK.

Bidirectional recurrent neural networks. IEEE Trans Signal Process 1997; 45: 2673–2681.

31.

Graves

Schmidhuber

Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 2005; 18: 602–610.

32.

Miao

Jiang

, et al. Intelligent control model and its simulation of flue temperature in coke oven. Discrete Contin Dyn Syst Ser S 2015; 8: 1223–1237.

33.

Di Capaci

Scali

. Performance improvements of cascade and feed-forward control schemes for industrial processes. Chem Eng Trans 2017; 57: 985–990.

34.

Fan

, et al. Soft sensing of SO2 emission for ultra-low emission coal-fired power plant with dynamic model and segmentation model. Fuel 2023; 332: 125921–125933.

35.

Shen

Zhang

Study on nonlinear pH control strategy based on external recurrent neural network. Procedia Eng 2011; 15: 866–871.

36.

Khaire

Dhanalakshmi

High-dimensional microarray dataset classification using an improved Adam optimizer (iAdam). J Ambient Intell Humaniz Comput 2020; 11: 5187–5204.

37.

Mantovani

Rossi

Vanschoren

, et al. Effectiveness of random search in SVM hyper-parameter tuning. In: International joint conference on neural networks (IJCNN), 2015. Killarney: IEEE.

38.

Browne

MW.

Cross-validation methods. J Math Psychol 2000; 44: 108–132.

39.

Kim

A new metric of absolute percentage error for intermittent demand forecasts. Int J Forecast 2016; 32(3): 669–679.

40.

Alkhasawneh

Tay

LT.

A hybrid intelligent system integrating the cascade forward neural network with Elman neural network. Arab J Sci Eng 2018; 43: 6737–6749.

41.

Lara-Benítez

Carranza-García

Luna-Romera

, et al. Temporal convolutional networks applied to energy-related time series forecasting. Appl Sci 2020; 10: 2322–2338.

Dynamic learning of flue gas desulfurization process using deep LSTMs neural network

Abstract

Keywords

Introduction

FGD principles and problem formulation

Principles of flue gas desulfurization

Problem formulation

Deep learning and nonlinear dynamic system identification

Description of neural network and deep neural network

Nonlinear system identification and network implementation

Deep learning based nonlinear system identification

LSTMs deep dynamic neural network

Long short-term memory

Bidirectional LSTM network

Deep design of LSTM for dynamic system identification

Experimental results

Experimental condition

Network architecture and experimental design

Modeling results and discussion

Case 1: ARX model

Case 2: NARX model

Case 3: RNN model

Case 4: CFNN model

Case 5: LSTM-based models

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Data availability statement

References