Sage Journals: Discover world-class research

Abstract

A deep reinforcement learning application is investigated to control the emissions of a compression ignition diesel engine. The main purpose of this study is to reduce the engine-out nitrogen oxide $(N O_{x})$ emissions and to minimize fuel consumption while tracking a reference engine load. First, a physics-based engine simulation model is developed in GT-Power and calibrated using experimental data. Using this model and a GT-Power/Simulink co-simulation, a deep deterministic policy gradient is developed. To reduce the risk of an unwanted output, a safety filter is added to the deep reinforcement learning. Based on the simulation results, this filter has no effect on the final trained deep reinforcement learning; however, during the training process, it is crucial to enforce constraints on the controller output. The developed safe reinforcement learning is then compared with an iterative learning controller and a deep neural network–based nonlinear model predictive controller. This comparison shows that the safe reinforcement learning is capable of accurately tracking an arbitrary reference input while the iterative learning controller is limited to a repetitive reference. The comparison between the nonlinear model predictive control and reinforcement learning indicates that for this case reinforcement learning is able to learn the optimal control output directly from the experiment without the need for a model. However, to enforce output constraint for safe learning reinforcement learning, a simple model of system is required. In this work, reinforcement learning was able to reduce $N O_{x}$ emissions more than the nonlinear model predictive control; however, it suffered from slightly higher error in load tracking and a higher fuel consumption.

Keywords

Machine learning deep learning reinforcement learning safe learning iterative learning control diesel engine emission control

Introduction

Heavy-duty and medium-duty diesel engines are commonly used for public transportation and delivering goods. The high combustion efficiency and fuel conversion efficiency advantages (especially at full-load operation) along with the long lifetime and durability of diesel engines have made their usage widespread in a wide range of transportation applications.^1,2 Despite all the advantages, diesel engines contribute significantly to air pollution worldwide. Although hybridization and electrification are getting increasing market share for passenger vehicles, it is expected that this will occur more slowly for heavy-duty applications due to the limited battery range, high battery costs, and increased total cost of ownership.³ Therefore, strategies to minimize the effect of diesel engine emissions on the environment are still needed.

Traditionally, engine control units (ECUs) use a feedforward controller that is based on two-dimensional look-up tables, also known as calibration maps, that are generated on a test bench to ensure the engine performs optimally while meeting power demand and enhancing fuel economy, and durability. Additionally, engine emissions are also evaluated during the creation of the these calibration maps making the calibration complicated and time-consuming. These tables must then be further tested in real driving conditions to meet new real driving emission legislation.¹ The use of a feedback controller, especially a model-based optimal controller, is a promising method to help solve the ever-increasing calibration efforts. Model-based methods such as linear quadratic regulator (LQR),⁴ sliding mode controller (SMC),^5,6 adaptive,^7,8 and model predictive control (MPC)^9–11 have been previously investigated for engine applications. The two main drawbacks of these model-based controllers are their sensitivity to model accuracy and the required runtime especially for online optimization. There is often a trade-off between these two as improving model accuracy requires increased model complexity and these complex models usually exhibit nonlinear behavior requiring a more complicated control law such as nonlinear model predictive controller (NMPC).¹² Instead of using a model-based controller, the alternative option is using a model-free controller. Machine reinforcement learning (RL)¹³ is one of the powerful methods in generating optimal options without the requirement of the model. Another well-known model-free controller in control theory is iterative learning controller (ILC).¹⁴ As ILC is also a model-free learning-based control strategy, it will be compared with both RL and a model-based state-of-the-art deep learning–based MPC.

Machine learning (ML) is a powerful tool that has been used to address various engineering problems and has been shown to be particularly useful in control engineering, especially when deriving an exact system model is difficult.¹⁵ In general, supervised learning, unsupervised learning, and RL are the main categories of ML. Unlike supervised and unsupervised learning that operate using a static data set, RL works using dynamic data.¹⁶ The main goal of RL is generating the optimal outcome by finding the best sequence of actions. Unlike classical ML, RL uses an agent to explore, interact with, and learn from the defined system environment. The RL agent learns by receiving the environment observation and reward and generating a sequence of actions to reach a specific goal. RL has a similar structure to control theory. The goal is to determine the correct inputs into a system that would generate the desired system’s behavior. The controller is called the policy, the actuator command provides the actions, and the plant is the environment in RL. As we tune the controller using a tuning algorithm or adaptation law, the RL policy updates are based on the RL algorithm.¹⁷ The RL algorithm can be either model-free or model-based, and due to the model requirement, the model-free algorithm has been the main focus in engineering applications.^16,17 One common algorithm used for model-free RL is Q-learning. In Q-learning, the value of an action for a particular state is learned and the optimal policy is found by maximizing the expected value (Q-value) of the total reward.¹³

When an agent performs an action which has the highest reward without further exploring the environmental space, it is considered a greedy policy. In continuous spaces, obtaining a greedy policy to optimize the action at each time interval is extremely slow. Therefore, sometimes, it is not possible to apply Q-learning easily to continuous action systems. However, an actor-critic method based on the deterministic policy gradient (DPG) algorithm is a suitable choice for a system with a continuous space.¹⁸ The DPG learning procedure is robust and stable because of the off-policy network training; it takes samples from the replay buffer (which is a finite size cache used to store previous samples from the environment). This allows for the reduction of the correlation between samples.¹⁹ Off-policy learning is independent of the agent’s actions and it determines the optimal policy regardless of the agent’s motivation. It means that in contrast with on-policy learning, where the agent learns about the policy to generate the data, the off-policy estimates the reward for future actions and adds value to the new state without following any greedy policy.¹³ The deep deterministic policy gradient (DDPG) agent is a model-free and off-policy RL algorithm where an actor-critic RL agent calculates an optimal policy by maximizing the long-term reward. One of the differences between DDPG and DPG is that DDPG uses a deep neural network (DNN) as an approximator in DDPG to learn for large state and action pairs.¹⁹

Using DNN in RL is referred to as deep reinforcement learning (deep RL) and has allowed for a wide variety of complicated decision-making tasks that were previously unfeasible to be solved.¹³ For example, deep RL is of interest in applications such as robotics and autonomous driving.²⁰ Earlier versions of RL algorithms had challenges in the design of the feature selection. In contrast, deep RL has been able to successfully overcome complicated tasks even when a limited amount of previous information is available. This is possible because of the deep RL capability to learn various levels of abstractions from data.^20–23 Deep RL has also been used in computer science for many applications.¹³ Utilizing deep RL in real-world applications, especially in engineering applications, has started to increase in recent years. Deep RL has been successfully used for control of an unmanned aerial vehicle,²⁴ quadrotor system,²⁵ autonomous vehicles,^26,27 wind farm control,²⁸ torque distribution of electric vehicles,²⁹ and robotic applications.^30,31

RL has been used for automotive powertrain control systems especially in energy management of hybrid electric vehicles^32–34 and for internal combustion engines.^35–40 Q-learning RL is used as idle speed control of a spark-ignition (SI) engine by controlling the spark timing and intake throttle valve position.⁴¹ Similar studies have been carried out for diesel engine idle speed control by controlling the fuel injection timing.³⁶ RL has also been used for emission control of SI engines.^37,38 A very limited number of studies have been carried out utilizing RL for internal combustion control, and most of the existing work has focused on SI engines. To the authors’ knowledge, deep RL algorithms have not been previously implemented for diesel engine performance and emission control. Safety concerns and constraint violations of pure learning controllers in highly complex systems such as internal combustion engines have hindered the development of these learning controllers. Fortunately, recent studies have addressed output constraint enforcement in the learning-based controller using a safe learning filter. This method enforces the output constraints and provides a method to implement safe learning RL.^42–45 To implement the safety filter, a simplified version of a second optimization-based method to enforce output constraints is used. Instead of an MPC-based filter, an online optimization with a single-step optimization is used where the safe control action minimizes the deviation from the RL-generated control action subject to constraints determined using a quadratic programming (QP) solver during the training of the RL agent. Then, the RL agent must learn the constraints using the RL algorithm and prior knowledge of system constraints.

Although RL is now receiving attention from the control system community, a learning controller is not a new concept.^14,46 One of the well-known learning-based controllers is ILC which is used to improve the tracking performance of a system in the presence of repetitive input or disturbances.^47–49 ILC was first introduced in 1984 by Arimoto et al.¹⁴ and since then has been used for various control problems. ILC has a simple structure and is computationally efficient for real-time applications and can have stability guarantees. Different types of ILC have been implemented for internal combustion engine control. ILC has been used in SI engine load control,^50,51 a dual-fuel control of homogeneous charge compression ignition (HCCI) engine,⁵² SI engine speed and air-to-fuel ratio,⁵³ parameter optimization in a turbocharged SI engine,⁵⁴ variable injection rate control for compression ignition (CI) engines,⁵⁵ diesel $N O_{x}$ control,^56,57 and exhaust gas recalculation (EGR) control in a CI engine.⁵⁸ Although ILC has been used in literature as a model-free learning-based controller, it requires a repetitive environment to learn from the repetition. This is not feasible in an on-road vehicle applications as the engine usually operates under changing conditions. Diesel engines are often used in stationary applications where repetitive operation can occur, for example, used to power a pump or generate power. In those applications, using ILC seems promising.^56,57 Here, due to the similarity of the ILC concept to RL, and as it has been used in literature for emission control, it has been implemented in simulation and compared with the designed safe deep RL.

Safe learning in the content of deep RL used to control diesel engine emissions is not available in the literature. Therefore, a deep RL with and without safety filters is designed and compared to address this gap. Then to compare RL to ILC, ILC and safe ILC are also designed. Additionally, RL is compared with a deep recurrent neural network–based nonlinear model predictive controller that has been developed in our previous study.⁵⁹ The main contributions of this article are as follows:

Design of a deep RL controller for diesel engine $N O_{x}$ control by minimizing $N O_{x}$ and fuel consumption while maintaining the same output torque;

Design of a safe filter that provides safe RL for diesel engine emission control;

Comparison with a classical learning-based control, ILC, and a long-short-term memory-based nonlinear model predictive controller (LSTM-NMPC⁵⁹).

This article is organized into five sections. The first section provides an introduction, literature review, and main contributions of this article. In the “Engine simulation model” section, the experimental setup and detailed physics-based modeling are explained. The main methodology of the safe deep RL are discussed in section “Deep RL.” Details regarding the development of the ILC are explained in section “ILC.” The “Results and discussions” section illustrates the performance of designed controllers and provides a comparison of the controllers. Finally, the main conclusions of this article are summarized in section “Summary and conclusion.”

Engine simulation model

This study uses a 4.5-L diesel engine manufactured by Cummins and is located in the advanced internal combustion engine lab at the University of Alberta, Canada. The main specifications of this engine are presented in Table 1.

Table 1.

Engine specifications.

Parameter	Value
Engine type	In-line, four-cylinder
Displacement	4.5 L
Bore $\times$ Stroke	102 mm $\times$ 120 mm
Peak torque	624 N m @ 1500 r/min
Peak power	123 kW @ 2000 r/min
Aspiration	Turbocharged
Certification level	Tier 3/Stage IIIA

To train the deep RL used in this study and to compare with NMPC and ILC, a detailed physical model (DPM) was developed in GT-Power software and validated using experimental data in our previous studies.^60,61 This DPM is implemented in GT-Power and the model includes several physical and chemical sub-models for simulating the combustion phenomenond gas exchange process of diesel combustion. The DPM was calibrated using experimental in-cylinder pressure, injection timing, and intake air mass flow and temperature over the various engine operating range. Optimal parameter values are determined by means of the genetic algorithm (GA). Additional details of the DPM development and structure are presented in the authors’ previous works.^60,61

The developed DPM predicts the experimental in-cylinder pressure over the entire engine cycle (see Figures 5 and 6 in the work by Shahpouri et al.⁶⁰) with the maximum in-cylinder pressure and intake manifold pressure error of $\pm 5.8 %$ and $\pm 4.6 %$ . To model $N O_{x}$ emissions, a physical-based model was developed and added to the DPM with accuracy of $\pm 18.1 %$ . This model is parameterized by minimizing the error between the experimental $N O_{x}$ and model prediction of $N O_{x}$ from the model. In this work, the DPM is modeled using GT-Power, and using Matlab/Simulink and GT-Power co-simulation, controller is implemented. This GT-Power/Matlab/Simulink co-simulation called engine simulation model (ESM) and it will be used to test the developed concepts in simulation and the most promising methods will be then reported for real-time implementation in future work. The main inputs and outputs of ESM are schematically shown in Figure 1. The outputs of this model are $N O_{x}$ emissions, output torque $(T_{out})$ , intake manifold pressure $P_{man}$ , and the inputs of this model are start of injection (SOI) for main diesel fuel injection, fuel quantity (FQ), and variable geometric turbine (VGT) rate. For the sake of simplicity, the start of pilot injection is kept 8°CA before the main injection with a constant FQ of 9 mg per cycle.

Figure 1.

Input, output, and states of engine simulation model (ESM) to simulate engine torque $(T_{out})$ , intake manifold pressure $(P_{man})$ , $N O_{x}$ emissions.

Deep RL

DDPG agents algorithm

A DDPG agents algorithm is used to minimize the engine-out emissions and fuel consumption while maintaining the same load. DDPG is a model-free and off-policy RL algorithm where an actor-critic RL agent calculates an optimal policy by maximizing the long-term reward. When a DNN is used, the DDPG algorithms are referred to deep DPG. The psudo code of DDPG is shown in Algorithm 1.¹⁹ During training, the actor and critic are updated by the DDPG algorithm at each sample time, and the agent stores past experiences using an experience buffer. The actor and critic are then updated using a mini-batch of those experiences randomly sampled from the buffer. Also, the policy’s selected action is perturbed using a stochastic noise model at each training step.¹⁷

Algorithm 1: Deep deterministic policy gradient agents (DDPG) algorithm¹⁹
Initialize critic network randomly $Q (x, u \| θ^{Q})$ with weights $θ^{Q}$ Initialize actor network randomly $μ (x, u \| θ^{μ})$ with weights $θ^{μ}$ Initialize target network $Q^{'}$ and $μ^{'}$ with weights $θ^{Q^{'}} \leftarrow θ^{Q}$ and $θ^{μ^{'}} \leftarrow θ^{μ}$ Initialize replay buffer $R$ forepisode $= 1$ , $E_{f}$ do Initialize a random noise process $N$ to add action exploration Receive initial observation state $x (1)$ for $k = 1$ , $k_{f}$ do Select action $a_{t} = μ (x (k) \| θ^{μ}) + N (k)$ Execute action $u (k)$ and observe reward $r (k)$ and observe new state $x (k + 1)$ Store $(x (k), u (k), r (k), x (k + 1))$ in $R$ Sample a random mini-batch of $N$ transition $(x (k), u (k), r (k), x (k + 1))$ from $R$ Set $\hat{r} (k) = r (k) + γ Q^{'} (x (k + 1), μ^{'} (x (k + 1) \| θ^{μ^{'}}) \| θ^{Q^{'}})$ Update critic by minimizing the loss: $L = \frac{1}{N} Σ_{i} (\hat{r} (k) - Q (x (k), u (k) \| θ^{θ^{Q}}))^{2}$ Update actor based on the sampled policy gradient: $\begin{matrix} \nabla_{θ μ} J \approx \frac{1}{N} Σ \nabla_{a} Q (x, u \| θ^{Q}) \|_{x (k), μ (x (k))} \nabla_{θ μ} μ (s \| θ^{μ}) \|_{x (k)} \end{matrix}$ Update the target network: $θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}$ $θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}$ end end

Algorithm 1: Deep deterministic policy gradient agents (DDPG) algorithm¹⁹

Initialize critic network randomly

Q (x, u | θ^{Q})

with weights

θ^{Q}

Initialize actor network randomly

μ (x, u | θ^{μ})

with weights

θ^{μ}

Initialize target network

Q^{'}

and

μ^{'}

with weights

θ^{Q^{'}} \leftarrow θ^{Q}

and

θ^{μ^{'}} \leftarrow θ^{μ}

Initialize replay buffer

R

forepisode

= 1

E_{f}

do
Initialize a random noise process

N

to add action exploration
Receive initial observation state

x (1)

for

k = 1

k_{f}

do
Select action

a_{t} = μ (x (k) | θ^{μ}) + N (k)

Execute action

u (k)

and observe reward

r (k)

and observe new state

x (k + 1)

Store

(x (k), u (k), r (k), x (k + 1))

R

Sample a random mini-batch of

N

transition

(x (k), u (k), r (k), x (k + 1))

from

R

Set

\hat{r} (k) = r (k) + γ Q^{'} (x (k + 1), μ^{'} (x (k + 1) | θ^{μ^{'}}) | θ^{Q^{'}})

Update critic by minimizing the loss:

L = \frac{1}{N} Σ_{i} (\hat{r} (k) - Q (x (k), u (k) | θ^{θ^{Q}}))^{2}

Update actor based on the sampled policy gradient:

\begin{matrix} \nabla_{θ μ} J \approx \frac{1}{N} Σ \nabla_{a} Q (x, u | θ^{Q}) |_{x (k), μ (x (k))} \nabla_{θ μ} μ (s | θ^{μ}) |_{x (k)} \end{matrix}

Update the target network:

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

end
end

In the DDPG algorithm (Algorithm 1), first, a copy of the actor $Q^{'} (x, u | θ^{Q^{'}})$ and the critic network $μ^{'} (x, u | θ^{μ^{'}})$ has been created. Then, these target network weights are updated “gently” to follow the learned networks: $θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}$ with $τ << 1$ . The target value is constrained to change at a slow rate to improve the stability of learning. Exploration is a significant challenge of learning when the action spaces are continuous. As exploration is an off-policy algorithm, such as DDPG, independent of the learning algorithm, exploration policy $μ^{'}$ can be formed by combining a noise process $N$ with the actor policy. In the DDPG algorithm, the Ornstein–Uhlenbeck process noise model is used to create a noise process for agent exploration.^17,62,63

Safe DDPG

Despite all the advantages of deep RL, it relies on the experience and interaction with the environment (here ESM). To enforce output constraints, the following optimization-based filter is added to DDPG algorithm

\begin{matrix} \underset{x}{Minimize :} | | u (k) - u_{RL} (k) | |_{2}^{2} \\ subject to : y (k) < y_{\max} \\ u_{\min} < u (k) < u_{\max} \end{matrix}

(1)

where $u (k)$ is a safe action and $u_{RL} (k)$ is the DDPG-generated action. The goal of this optimization is to enforce that the output does not exceed the defined output maximum value $y_{\max}$ given lower $(u_{\min})$ and upper bound $(u_{\max})$ of actions while minimizing the difference between the DDPG-generated action and the safe action. The optimization of equation (1) uses QP to find the control action $u (k)$ that minimizes the function $| | u (k) - u_{RL} (k) | |_{2}^{2}$ . The QP solver applied the following constraints to the optimization

\begin{matrix} f (x (k)) + g (x (k)) u (k) < y_{\max} \\ u_{\min} < u (k) < u_{\max} \end{matrix}

(2)

where $f (x (k))$ and $g (x (k))$ are coefficients of the constraint function which depend on the modeled plant states $x (k)$ . Linear plant dynamics developed in our previous study are used in the optimization¹¹

\begin{matrix} x (k + 1) = Ax (k) + Bu (k) \\ y (k) = Cx (k) \end{matrix}

(3)

where $A$ and $B$ are state-space matrices developed using a autoregressive with extra input (ARX) model¹¹ as follows

\begin{matrix} A = [\begin{matrix} 0.7286 & 7.1252 & - 0.0019 \\ 0.0002 & 0.9859 & 8.9878 \times 10^{- 6} \\ - 0.6105 & 33.94287 & 0.9076 \end{matrix}] \\ B = [\begin{matrix} 1.2639 & - 1.0899 & 1.0084 \times 10^{- 5} \\ - 0.0007 & 0.0014 & - 1.01397 \times 10^{- 5} \\ 2.9360 & - 8.2453 & - 0.0106 \end{matrix}] \\ C = [\begin{matrix} 1 & 0 & 0 \\ 0 & 0 & 1 \end{matrix}] \end{matrix}

(4)

where the constrained output $y (k)$ , states $x (k)$ , and control actions $u (k)$ are defined as follows

\begin{matrix} y (k) = [N O_{x} (k) P_{man} (k) T_{out} (k {)]}^{T} \\ y (k) = [N O_{x} (k) T_{out} (k {)]}^{T} \\ u (k) = [FQ (k) SOI (k) VGT (k {)]}^{T} \end{matrix}

(5)

where $FQ (k)$ , $SOI (k)$ , and $VGT (k)$ are injected FQ, start of main injection, and variable-geometry turbocharger (VGT) valve rate (percentage opening), respectively. The states are defined as engine-out $N O_{x} (k)$ emission, intake manifold pressure $P_{man} (k)$ , and output torque $T_{out} (k)$ . By substituting equation (3) in equation (2), $f (x (k))$ and $g (x (k))$ can be found as follows

\begin{matrix} f (x (k)) = CAx (k) \\ g (x (k)) = CB \end{matrix}

(6)

Substituting system matrices (equation (4)) in equation (6) results in the following

\begin{matrix} f (x (k)) = [\begin{matrix} 0.7286 & 7.1252 & - 0.0019 \\ - 0.6105 & 33.94287 & 0.9076 \end{matrix}] x (k) \\ g (x (k)) = [\begin{matrix} 1.2639 & - 1.0899 & 1.0084 \times 10^{- 5} \\ 2.9360 & - 8.2453 & - 0.0106 \end{matrix}] \end{matrix}

(7)

To simplify the control problem, the pre-injection is kept constant at 9 mg that is injected 8°CA before the main injection.

The upper bound of $N O_{x}$ is used to regulate peak $N O_{x}$ engine-out emission levels. This value depends on government legislation limits. Here, the experimental maximum $N O_{x}$ level of 500 ppm is observed for the production Tier 3 engine during standard operation load range and this value is used as the upper bound of engine-out $N O_{x}$ . To avoid high loads beyond the defined operating range, a 500 N m torque is used as the upper bound for load. To regulate the amount of injected fuel and avoid large fuel injections, a constraint is imposed for injected fuel amount of 10 to 90 mg/cycle. To avoid late injections that cause combustion inefficiency and high exhaust gas temperatures, a lower limit of SOI is also imposed. Due to the physical limitations, the VGT is limited between 70% and 100%. To avoid increased combustion noise and causing low combustion efficiency, SOI is also limited using an upper bound. Therefore, the constraints can be summarized as follows

\begin{matrix} y_{\min} = {[\begin{matrix} N O_{x, \min} (k) & T_{out, \min} (k) \end{matrix}]}^{T} = {[\begin{matrix} 0 & 0 \end{matrix}]}^{T} \\ y_{\max} = {[\begin{matrix} N O_{x, \max} (k) & T_{out, \max} (k) \end{matrix}]}^{T} = {[\begin{matrix} 500 & 500 \end{matrix}]}^{T} \\ u_{\min} = {[\begin{matrix} F Q_{\min} (k) & SO I_{\min} (k) & VG T_{\min} (k) \end{matrix}]}^{T} = {[\begin{matrix} 10 & - 2 & 70 \end{matrix}]}^{T} \\ u_{\max} = {[\begin{matrix} F Q_{\max} (k) & SO I_{\max} (k) & VG T_{\max} (k) \end{matrix}]}^{T} = {[\begin{matrix} 90 & 11 & 100 \end{matrix}]}^{T} \end{matrix}

(8)

A schematic of safe DDPG for minimizing diesel engine emissions and fuel consumption while maintaining load is shown in Figure 2. The states of the system for the DDPG algorithm are defined as follows

\begin{matrix} x (k) = [N O_{x} (k) e_{T_{out}} (k) T_{out} (k) P_{man} (k {)]}^{T} \end{matrix}

(9)

where $P_{man} (k)$ is intake manifold pressure and $e_{T_{out}} (k)$ is output torque tracking error defined as follows

\begin{matrix} e_{T_{out}} (k) = T_{out, r} (k) - T_{out} (k) \end{matrix}

(10)

where $T_{out, r} (k)$ is requested load reference.

Figure 2.

Safe deep deterministic policy gradient schematic to minimize diesel engine fuel consumption and reducing $N O_{x}$ while maintaining the required output torque.

To achieve the control objective, and output torque error, its derivatives, the FQ and $N O_{x}$ values, are added to the reward function, $r (k + 1)$ (equation (11)). Here, the agent will be penalized when the system produces more than 500 ppm $N O_{x}$ . The DDPG agent is designed to minimize load tracking error, engine-out $N O_{x}$ , and fuel consumption by maximizing the following reward function

\begin{matrix} r (k + 1) = - (k_{1} e_{T_{out}} (k) + k_{2} \frac{e_{T_{out}} (k) - e_{T_{out}} (k - 1)}{T_{s}} \\ + k_{3} FQ (k) + k_{4} N O_{x} (k) + k_{5} (N O_{x} (k) > 500) \end{matrix}

(11)

where $r (k)$ is reward and $T_{s}$ is the sampling time; in this application, it is each engine cycle or 0.08 s at a constant engine speed of 1500 r/min. $k_{1}$ to $k_{5}$ represents positive integer reward weights.

Figure 2 shows the network structure where the actor has nine fully connected layers (FCs) with a layer size of 64. The critic has 12 FCs with the same layer size (64) as the actor in each layer. The activation function of both the critic and actor FCs is rectified linear unit (ReLU) other than the output layers (FC12 in critic and FC9 in actor). The scaling layer is used in the output layers to standardize the output values. To train both the DDPG and safe DDPG, a mini-batch size of 64 and a smoothing factor of 0.001 are used. For training RL, the Adam optimizer with a learn rate of 0.0001 is used. A noise model has been implemented with a variance of 5.66, 0.42, and 0.01 for $FQ (k)$ , $SOI (k)$ , and $VGT (k)$ , respectively. To force the RL to explore a larger region, the variance decay rate is chosen as a small value $(10^{- 6})$ . The ESM and the implementation of a safety filter to enforce the provided constraints are also shown in Figure 2.

Safe RL versus RL

In this study, two agents have been developed, a traditional DDPG implementation, called RL, and a DDPG with a safety filter to constrain the output, called safe RL. In both agents, the structure of actor and critic are kept same. The episodic reward that the agent receives versus the episode number is shown in Figure 3. A 40 s simulation (500 engine cycles) with a random load request, $T_{ref} (k)$ , is provided to the agent in which the load reference is randomly changed for each episode. On an Intel Core i7-6700K–based PC with 32.0 GB RAM, running each episode takes an average of 346.84 s for the total ESM simulation and RL algorithm to update the networks. For the training of both agents, the simulation is run to a maximum of 5000 episodes. Both RL agents are run with different initialization of both the critic and actor networks over different random seeds. The best RL and best safe RL networks are chosen based on their maximum final reward value, as shown in Figure 3. These two agents are selected as they represent the best agents when compared to all of the saved agents with a reward higher than −150. Due to file size constraints, only agents with a reward higher than −150 were saved and then compared to select the agent that had the maximum reward. As shown in Figure 3, safe RL takes almost two times longer to reach the maximum reward compared to regular RL. This is due to the fact that the safe RL has more space which needs to be explored. Additionally, due to the use of a safety filter in safe RL, it reaches a larger reward which can be seen by comparing the agent at episode 1572 of RL and the agent at episode 3189 of safe RL (dashed line in Figure 3 is used to highlight this comparison).

Figure 3.

Episodic reward versus episode for safe RL and RL.

The comparison between the selected agents for both the safe RL and RL is presented in Figure 4. As shown, regardless of the training process, both agents are capable of maintaining load and minimizing $N O_{x}$ emissions and FQ. Even the RL tries to obey the constraints as they are included in the reward function. According to the results presented, the safety filter is not useful in final episode, and without the safety filter, RL can learn the constraints as well as minimize the tracking error and $N O_{x}$ .

Figure 4.

Safe RL versus RL: comparison between two agents that reach to maximum reward for safe RL (agent 3189 in Figure 3) and RL (agent 1571 in Figure 3) at engine speed of 1500 r/min. (a) engine-out $N O_{x}$ , (b) intake manifold pressure $(P_{man})$ , (c) engine output torque $(T_{out})$ , (d) fuel quantity $(FQ)$ , (e) start of injection $(SOI)$ , (f) variable geometry turbine $(VGT)$ rate.

As shown in Figures 4 and 5, the NO_x emissions spike following each step. This is a result of SOI oscillating at the beginning of each cycle and jumping to advance combustion for a couple of engine cycles. Therefore, in those engine cycles, an increased amount of NO_x is formed.

Figure 5.

RL during training: comparison between agent in middle of training (agent 947 in Figure 3) and agent that reaches to maximum reward (agent 1571 in Figure 3) at engine speed of 1500 r/min. (a) engine-out $N O_{x}$ , (b) intake manifold pressure $(P_{man})$ , (c) engine output torque $(T_{out})$ , (d) fuel quantity $(FQ)$ , (e) start of injection $(SOI)$ , (f) variable geometry turbine $(VGT)$ rate.

The two final selected agents perform well; however, a more interesting comparison can be made during the training of the agents. Figure 5 shows the two agents of the RL during training. These agents are also presented in Figure 3. One agent is in the middle of the training process at episode 924 and the other is the final agent that has reached the maximum reward at episode 1571. The oscillation observed from the controller during the early stages of training (episode 924) is due to the white noise used to excite the system to allow for increased learning. When compared to agent 924, the fully trained agent 1571 is significantly better at observing all constraints. For the $N O_{x}$ output, the fully trained agent 1571 only exceeds the imposed limit for two cycles. For online training, the presence of safety filter is crucial in observing the constraints throughout training. However, if training is carried out in simulation, the use of a safety filter is not necessary, as the final agent is able to meet constraints while providing a stable output without the increased training time of using a safety filter.

ILC

One of the fast learning–based controllers that has common elements with RL is ILC. ILC has a simpler structure than RL as its control law update includes two main filters and can be defined as follows

u_{t} (k) = Q (u_{j} (k - 1)) + L (e_{j} (k - 1))

(12)

where $L (e_{j} (k))$ is the $L - filter$ or learning filter and $Q (u_{j} (k))$ is the $Q - filter$ . In this equation, $k$ represents the time interval. One of the simplest types of ILC is P-type ILC where the learning filter is $P e_{j} (k)$ and $Q - filter$ is identity matrix where $P$ is a proportional gain. Similar to safe RL, to enforce the output constraints, a safety filter is added to ILC. Figure 6 shows a block diagram of the safe ILC. As shown, ILC learns from the previous error and control input to generate the current control action.

Figure 6.

Safe iterative learning control block diagram.

For implementation purposes, this is slightly different compared to RL. Because of the nature of repetitive input requirements, a repetitive reference has been implemented and the error between the actual state and reference is provided to the ILC. The error can be defined as follows

\begin{matrix} e (k) = [\begin{matrix} N O_{x, ref} (k) - N O_{x} (k) \\ T_{out, ref} (k) - T_{out} (k) \\ P_{in, ref} (k) - P_{ref} (k) \end{matrix}] \end{matrix}

(13)

where $N O_{x, ref} (k)$ , $T_{out, ref} (k)$ , and $P_{in, ref} (k)$ are the respective reference values where each of these references are repetitive with the same frequency. As the only tracking problem is the load output from the engine, this reference is the actual reference and the other two are implemented to satisfy the repetition requirements. For $N O_{x}$ , the reference value changes from 20 to 40 ppm for minimizing it (instead of simply putting 0, a small variation is required). Similarly, a reference for intake manifold pressure the set point is changed from 2 to 2.1 bar. All of the references are repeated every 300 cycles, that is, for $N O_{x}$ , the set point is 20 ppm for 150 cycles, then changes to 40 ppm for 150 cycles and then repeated. ILC and safe ILC training are shown in Figure 7. This figure presents 46 ILC iterations (total of 13,800 engine cycles). As shown after cycle 33 (9900 engine cycle), both the safe ILC and ILC learn to track desired references. As shown, the safe ILC is able to observe the output constraints; however, the ILC fails to remain within the constraints. Here, unlike the RL implementation, the presence of a safety filter for both the final stage and during training is necessary. As shown, the safe ILC tends to require late injections as SOI remains saturated at the upper limit. Here, the existence of upper limit is necessary to avoid very late injection timing.

Figure 7.

Training ILC and safe ILC at engine speed of 1500 r/min: reference is repeated every 300 cycles and 46 ILC iterations are shown: (a) engine-out $N O_{x}$ , (b) intake manifold pressure $(P_{man})$ , (c) engine output torque $(T_{out})$ , (d) fuel quantity $(FQ)$ , (e) start of injection $(SOI)$ , (f) variable geometry turbine $(VGT)$ rate.

Results and discussions

In this section, the two developed controllers, safe RL and safe ILC, will be compared to a previously developed LSTM-NMPC. The NMPC controller previously developed in the work by Norouzi et al.⁵⁹ is used to compare the RL controllers here with MPC controller. All developed controllers are compared to a Cummins-calibrated ECU which modeled the simulation environment with the DPM and denoted “Benchmark (BM).”

The comparison between the RL, LSTM-NMPC, and BM controllers is presented in Figure 8. Both controllers solve a similar optimization problem but the reward function in RL has a slightly different cost function compared to the LSTM-NMPC. The cost function of LSTM-NMPC is defined as follows

\begin{matrix} J (u (\cdot | k), s (k)) = \sum_{i = 0}^{N_{p} - 1} [\underset{Torque output tracking}{\underset{︸}{| | e_{T_{out}} | |_{w_{T_{out}}}^{2}}} \\ + \underset{N O_{x} minimizing}{\underset{︸}{| | N O_{x} (k + i) | |_{w_{N O_{x}}}^{2}}} + \underset{Fuel consumption minimizing}{\underset{︸}{| | FQ (k + i) | |_{w_{FQ}}^{2}}} \\ + \underset{Control effort penalty}{\underset{︸}{| | u (k + i | k) - u (k + i - 1 | k) | |_{w_{Δ u}}^{2}}} \\ + \underset{Constraint violation penalty}{\underset{︸}{w_{s} s {(k)}^{2}}}] \end{matrix}

(14)

where

| | . | |_{w}^{2} = [.]^{T} w [.]

(15)

where $s (k)$ is a slack variable that is added for penalizing possible constraint violations.⁵⁹ By comparing the LSTM-NMPC cost function (equation (14)) and the RL reward function (equation (11)), the torque tracking, fuel consumption, and $N O_{x}$ minimization are the same. In RL, the goal is maximizing the reward function and thus the negative sign is the reward function. The main difference between RL and NMPC in the reward/cost function is $δ u$ term in the NMPC that is added to resolve the oscillatory response of NMPC. The same relative weights between different terms of cost/reward function have been implemented to keep the tuning of the two controllers similar.

Figure 8.

Safe reinforcement learning compared with long-short-term memory based nonlinear model predictive controller (LSTM-NMPC)⁵⁹ and Cummins-calibrated ECU which is modeled in GT-Power at engine speed of 1500 r/min: (a) engine-out $N O_{x}$ , (b) intake manifold pressure $(P_{man})$ , (c) engine output torque $(T_{out})$ , (d) fuel quantity $(FQ)$ , (e) start of injection $(SOI)$ , (f) variable geometry turbine $(VGT)$ rate.

As shown in Figure 8, the safe RL is capable of accurately tracking the output torque with similar performance to the LSTM-NMPC. Both controllers outperform the BM feedforward production controller. Here the safe RL controller suffers from slightly increased overshoot when compared to the NMPC.

The controllers maintain $N O_{x}$ emissions levels below the defined 500 ppm $N O_{x}$ constraint. One clear trend in both the NMPC and RL is that the average $N O_{x}$ value is significantly lower than the BM. This is expected as both controllers minimize $N O_{x}$ and fuel consumption. One interesting trend is that the $N O_{x}$ emissions of the safe RL controller follow a similar trend to the BM but at a lower level. When comparing the RL to the NMPC, overall, the $N O_{x}$ emissions are generally below the NMPC values and a significant reduction can be seen during the couple cycles after a change in load where NMPC controller focuses on the load change resulting in a spike of $N O_{x}$ emissions.

Comparing the controllers, the values of cumulative $N O_{x}$ , FQ, and execution time are compared in Table 2. For determining the execution time of the NMPC, an open-source package acados^64,65 is used for implementation. For the execution time, the idea is to examine the feasibility for real-time implementation and thus the deployment time of RL is only considered and the training time has been excluded. In this study, RL has almost three times faster execution time than the online NMPC optimization.

Table 2.

Comparison between safe RL, Benchmark (BM), and nonlinear model predictive control.⁵⁹

Controller	Cumulative $N O_{x}$ (ppm)	Average $N O_{x}$ (ppm)	Load error (%)	Cumulative FQ (g)	Average	Execution time (ms)^a
Controller	Cumulative $N O_{x}$ (ppm)	Average $N O_{x}$ (ppm)	Load error (%)	Cumulative FQ (g)	FQ (mg)	Execution time (ms)^a
BM	5.6 $\times 10^{5}$	376.8	3.95	67.9	46.0	–
Safe RL	3.8 $\times 10^{5}$	260.4	3.85	69.1	46.8	4.5
LSTM-NMPC⁵⁹	4.3 $\times 10^{5}$	290.2	1.90	65.6	44.4	12.20^b

FQ: fuel quantity; RL: reinforcement learning; LSTM-NMPC: long-short-term memory-based nonlinear model predictive controller.

Per engine cycle of simulation.

Average acados execution time.

As shown, RL has significantly lower $N O_{x}$ in comparison with both the BM and NMPC. The drawback of the RL controllers is slightly increased load error and FQ. However, the improvement in $N O_{x}$ reduction using a RL controller is more significant than the loss in load error and FQ.

The safe RL controller performs comparably to the NMPC; however, it is also of interest to compare with another learning control strategy such as ILC. The developed safe RL controller is compared to safe ILC and the BM in Figure 9. As shown, both learning controllers are capable of tracking the desired output torque with similar performance to the BM. ILC tracks the reference more closely than safe RL control. The ILC tracking performance is almost perfect with very little overshoot which is one of the benefits of ILC since the repetitive input requirements allow the ILC learn by repetition. The RL controller suffers from slight torque overshoot but the performance is still acceptable.

Figure 9.

Safe reinforcement learning compared with safe ILC and Cummins-calibrated ECU which is modeled in GT-Power at engine speed of 1500 rpm: (a) engine-out $N O_{x}$ , (b) intake manifold pressure $(P_{man})$ , (c) engine output torque $(T_{out})$ , (d) fuel quantity $(FQ)$ , (e) start of injection $(SOI)$ , (f) variable geometry turbine $(VGT)$ rate.

All the controllers tested were able to remain below the defined 500 ppm $N O_{x}$ constraint. The $N O_{x}$ reduction using the ILC is slightly better than the RL controller, however, both controllers significantly outperform the BM. When comparing the controller in terms of FQ used, both the ILC and RL controllers use slightly less fuel than the BM. However, very little difference can graphically be seen between the RL and ILC controllers.

The controller performance results and the values of cumulative $N O_{x}$ , FQ, execution time are summarized in Table 3. The execution time is determined by measuring the turn around time of the controllers in Simulink where all controllers are testing on the same system. Around 75% of the execution time, on average, comes from solving the QP problem and the RL runs quite quickly as it evaluates a DNN (forward propagation). The QP solver in our simulation experiment shows the possibility of implementing this in the real-time experiment. This has been done in previous studies^59,66 and good real-time performance has been achieved using acados^64,65 solver.

Table 3.

Comparison between safe RL, BM, and ILC.

	Cumulative $N O_{x}$ (ppm)	Average $N O_{x}$ (ppm)	Load error (%)	Cumulative FQ (g)	Average FQ (mg)	Execution time (ms)^a
BM	3.17 $\times 10^{5}$	317.0	6.65	32.6	32.6	–
Safe RL	1.71 $\times 10^{5}$	171.0	5.23	31.4	31.4	4.5
Safe ILC	1.55 $\times 10^{5}$	155.4	0.51	29.9	29.9	0.08

FQ: fuel quantity; BM: Benchmark; RL: reinforcement learning; ILC: iterative learning controller.

Per engine cycle of simulation.

As shown, both RL and ILC are able to reduce $N O_{x}$ emissions significantly compared to the BM. Although a comparison between the RL and ILC FQ showed a better FQ for ILC, the execution time of ILC is two orders of magnitude faster than the RL with significantly better load tracking performance. The fast learning time of the ILC indicates that it could be used for real-time online training. However, its main drawback is that it requires a repetitive reference or disturbance. This condition may be possible for stationary engines; however, it is not feasible in most of the ICE applications especially for on-road engines. Therefore, the slight performance loss of the RL compared to the ILC provides the flexibility to remove the requirement of a repetitive reference or disturbance.

Summary and conclusion

A deep RL–based controller is developed to minimize the $N O_{x}$ emissions and fuel consumption of a diesel engine while tracking the required torque. Using a detailed ESM, a GT-Power/Matlab co-simulation of two learning-based controllers is investigated. The first is an RL controller utilizing a DDPG based on a deep network for both the actor and the critic. This is then extended with the addition of a safety filter. This safety filter is added to the manipulated control action and used to enforce output constraints. The second learning-based controller is based on ILC. The same safety filter is also applied to ILC to enforce the output constraints.

The learning-based controllers with safety filter are compared to their standard versions to better understand the effect of adding a safety filter. It was found that for deep RL, both the safe and standard controllers result in almost the same controller performance once training is completed. Even the standard RL is able to learn to enforce the output constraints. However, during training, there are large violations of the constraints suggesting that using safe learning is crucial when working with a real engineering system for real-time learning. For ILC, the safety filter implementation shows a significant effect during both training and final controller performance. This suggests that ILC requires a safety filter to enforce output constraints.

The safe RL is then compared to safe ILC to evaluate which controller has better performance as they both share a similar learning-based controller approach. This comparison shows that the deployment time of ILC is two orders of magnitude faster than RL and ILC has the ability to take advantage of online learning. Although ILC has a 4% better torque tracking and 16 ppm lower average $N O_{x}$ emissions than the RL-based controller, it is limited to repetitive references and disturbances. This makes ILC only feasible for stationary ICE applications which utilize a repetitive setpoint. However, few ICE applications are repetitive and thus using ILC for most on-road vehicle applications is not feasible.

To compare the safe RL to a state-of-the-art controller, a comparison is made to a model-based LSTM-NMPC.⁵⁹ This comparison shows that the deep RL is capable of reducing the average $N O_{x}$ emissions by 30 ppm more than the LSTM-NMPC at a cost of 2% higher load error and 4.5% average fuel consumption increase. These performance differences between the models are very small. However, the LSTM-NMPC is a model-based controller which requires an accurate model for online MPC optimization. In contrast, the RL learns directly from experimental data but could violate constraints especially in training phase. Therefore, by adding a simple model as safety filter helps RL enforce output constraints. The summary of these comparisons is presented in Tables 4 and 5.

Table 4.

Summary of comparison for developed controllers.

Method	Constraint enforcement	Execution time	Model requirement	Limitation
RL	–	50×	–	Time-consuming in training
ILC	–	1×	–	Repetitive reference requirement
Safe RL	✓	50×	✓	Time-consuming in training
Safe ILC	✓	1×	✓	Repetitive reference requirement
LSTM-NMPC⁵⁹	✓	150×	✓	High-accuracy model requirement

RL: reinforcement learning; ILC: iterative learning controller; LSTM-NMPC: long-short-term memory-based nonlinear model predictive controller.

Table 5.

Summary of comparison for developed controllers—controller performance compared to Benchmark.

Method	Load tracking error	Average fuel consumption reduction^a	Average NO_x reduction^a
Safe RL	$3 % - 5 %$	$0 % - 4 %$	$30 % - 45 %$
Safe ILC	$\leq 1 %$	$\geq 8 %$	$\geq 50 %$
LSTM-NMPC⁵⁹	$\leq 2 %$	$\geq 3 %$	$\geq 22 %$

RL: reinforcement learning; ILC: iterative learning controller; LSTM-NMPC: long-short-term memory-based nonlinear model predictive controller.

Range is used in safe RL as it is compared with BM using both repetitive and random reference twice with different reference.

Reduction calculated relative to BM.

The application of using a safe learning–based control is demonstrated in simulation; however, for next-generation AI-powered engine controllers, these methods require extensive real-time data. Implementing either of these model-free learning-based controllers in real-time requires detailed testing on real hardware. Future work includes testing these methods on the engine in real time.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by Natural Sciences and Engineering Research Council of Canada (NSERC) grant numbers 2016-04646, and 2020-04403, and Canada First Research Excellence Fund (CFREF) grant number T01-P04.

ORCID iD

Armin Norouzi

References

Norouzi

Heidarifar

Shahbakhti

, et al. Model predictive control of internal combustion engines: a review and future directions. Energies 2021; 14(19): 6251.

Norouzi

Aliramezani

Koch

. A correlation-based model order reduction approach for a diesel engine NOx and brake mean effective pressure dynamic model using machine learning. Int J Eng Res 2020; 22: 0936949.

McKinnon

. Decarbonizing logistics: distributing goods in a low carbon world. London: Kogan Page Publishers, 2018.

López

Espinosa

Agudelo

. LQR control for speed and torque of internal combustion engines. IFAC Proc Vol 2011; 44(1): 2230–2235.

Norouzi

Ebrahimi

Koch

. Integral discrete-time sliding mode control of homogeneous charge compression ignition (HCCI) engine load and combustion timing. IFAC PapersOnLine 2019; 52(5): 153–158.

Puleston

Monsees

Spurgeon

. Air-fuel ratio and speed control for low emission vehicles based on sliding mode techniques. Proc IMechE, Part I: J Systems and Control Engineering 2002; 216(2): 117–124.

Souder

Hedrick

. Adaptive sliding mode control of air–fuel ratio in internal combustion engines. Int J Robust Nonlin Control 2004; 14(6): 525–541.

Thornhill

Thompson

. Adaptive fuzzy logic control of engine idle speed. Proc IMechE, Part I: J Systems and Control Engineering 1999; 213(2): 145–155.

Irdmousa

Rizvi

Velni

, et al. Data-driven modeling and predictive control of combustion phasing for RCCI engines. In: Proceedings of the 2019 American control conference (ACC), Philadelphia, PA, 10–12 July 2019, pp.1–6. New York: IEEE.

10.

Basina

Irdmousa

Velni

, et al. Data-driven modeling and predictive control of maximum pressure rise rate in RCCI engines. In: Proceedings of the 2020 IEEE conference on control technology and applications (CCTA), Montreal, QC, Canada, 24–26 August 2020, pp.94–99. New York: IEEE.

11.

Norouzi

Shahpouri

Gordon

, et al. Machine learning integrated with model predictive control for imitative optimal control of compression ignition engines. IFAC PapersOnLine 2022; 55(24): 19–26.

12.

Liao-McPherson

Huang

Kim

, et al. Model predictive emissions control of a diesel engine airpath: design and experimental evaluation. Int J Robust Nonlin Control 2020; 30(17): 7446–7477.

13.

Sutton

Barto

. Reinforcement learning: an introduction. Cambridge, MA: MIT Press, 2018.

14.

Arimoto

Kawamura

Miyazaki

. Bettering operation of robots by learning. J Robot Syst 1984; 1(2): 123–140.

15.

Alpaydin

. Introduction to machine learning. Cambridge, MA: MIT Press, 2020.

16.

Aliramezani

Koch

Shahbakhti

. Modeling, diagnostics, optimization, and control of internal combustion engines via modern machine learning techniques: a review and future directions. Prog Energ Combust Sci 2022; 88: 100967.

17.

MathWorks. Reinforcement learning with MATLAB. Natick, MA: MathWorks, 2019.

18.

Silver

Lever

Heess

, et al. Deterministic policy gradient algorithms. In: Proceedings of the 31st international conference on machine learning (eds Xing

Jebara

), Beijing, China, vol. 32, pp.387–395, http://proceedings.mlr.press/v32/silver14.pdf

19.

Lillicrap

Hunt

Pritzel

, et al. Continuous control with deep reinforcement learning. In: Proceedings of the 4th international conference on learning representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016, https://www.researchgate.net/publication/281670459_Continuous_control_with_deep_reinforcement_learning

20.

François-Lavet

Henderson

Islam

, et al. An introduction to deep reinforcement learning. Found Trend Mach Learn 2018; 11(3–4): 219–354.

21.

Munos

Moore

. Variable resolution discretization in optimal control. Mach Learn 2002; 49(2–3): 291–323.

22.

Bellemare

Naddaf

Veness

, et al. The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 2013; 47: 253–279.

23.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518(7540): 529–533.

24.

Gandhi

Pinto

Gupta

. Learning to fly by crashing. In: Proceedings of the 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), Vancouver, BC, Canada, 24–28 September 2017, pp.3948–3955. New York: IEEE.

25.

Liu

Zhao

Lewis

, et al. Data-based formation control for underactuated quadrotor team via reinforcement learning. In: Proceedings of the 2020 39th Chinese control conference (CCC), Shenyang, China, 27–29 July 2020, pp.6816–6821. New York: IEEE.

26.

Wang

Zhang

Feng

, et al. Autonomous underwater vehicle path planning based on actor-multi-critic reinforcement learning. Proc IMechE, Part I: J Systems and Control Engineering 2021; 235(10): 1787–1796.

27.

Song

Sun

, et al. Autonomous reconfiguration of homogeneous pivoting cube modular satellite by deep reinforcement learning. Proc IMechE, Part I: J Systems and Control Engineering 2021; 235(10): 1777–1786.

28.

Zhao

Qiu

, et al. Cooperative wind farm control with deep reinforcement learning and knowledge-assisted learning. IEEE Trans Ind Inform 2020; 16(11): 6912–6921.

29.

Jin

Tian

Zhang

, et al. Optimal torque distribution control of multi-axle electric vehicles with in-wheel motors based on DDPG algorithm. Energies 2020; 13(6): 1331.

30.

Lesort

Lomonaco

Stoian

, et al. Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. Inform Fus 2020; 58: 52–68.

31.

Liu

Zhang

, et al. Motion control for biped robot via DDPG-based deep reinforcement learning. In: Proceedings of the 2018 WRC symposium on advanced robotics and automation (WRC SARA), Beijing, China, 16 August 2018, pp.40–45. New York: IEEE.

32.

Johri

Salvi

Filipi

. Optimal energy management for a hybrid vehicle using neuro-dynamic programming to consider transient engine operation. In: Proceedings of the 2011 dynamic systems and control conference and Bath/ASME symposium on fluid power and motion control, 2011, vol. 2, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/89876/draft_01.pdf;jsessionid=BFA4D8E5767B8D2EA88B596A8744F86A?sequence=1

33.

, et al. Energy management strategy for a hybrid electric vehicle based on deep reinforcement learning. Appl Sci 2018; 8(2): 187.

34.

Boriboonsomsin

, et al. A novel blended real-time energy management strategy for plug-in hybrid electric vehicle commute trips. In: Proceedings of the 2015 IEEE 18th international conference on intelligent transportation systems, Gran Canaria, 15–18 September 2015, pp.1002–1007. New York: IEEE.

35.

Smith

. Applications of the self-organising map to reinforcement learning. Neur Netw 2002; 15(8–9): 1107–1124.

36.

Rahman

Masjuki

Kalam

, et al. Impact of idling on fuel consumption and exhaust emissions and available idle-reduction technologies for diesel vehicles: a review. Energ Convers Manage 2013; 74: 171–182.

37.

Shih

Kaul

Jagannathan

, et al. Reinforcement-learning-based output-feedback control of nonstrict nonlinear discrete-time systems with application to engine emission control. IEEE Trans Syst Man Cyb Part B Cyb 2009; 39(5): 1162–1179.

38.

Shih

Kaul

Jagannathan

, et al. Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation. IEEE Trans Neurl Netw 2008; 19(8): 1369–1388.

39.

Howell

Best

. On-line PID tuning for engine idle-speed control using continuous action reinforcement learning automata. Control Eng Pract 2000; 8(2): 147–154.

40.

Xue

Gao

. Reinforcement learning for engine idle speed control. In: Proceedings of the 2010 international conference on measuring technology and mechatronics automation, Changsha, China, 13–14 March 2010, vol. 2, pp.1008–1011. New York: IEEE.

41.

Czarnigowski

. A neural network model-based observer for idle speed control of ignition in SI engine. Eng Appl Artif Intell 2010; 23(1): 1–7.

42.

Wabersich

Zeilinger

. Safe exploration of nonlinear dynamical systems: a predictive safety filter for reinforcement learning, 2018, https://www.researchgate.net/publication/329641554_Safe_exploration_of_nonlinear_dynamical_systems_A_predictive_safety_filter_for_reinforcement_learning

43.

Wabersich

Hewing

Carron

, et al. Probabilistic model predictive safety certification for learning-based control. IEEE Trans Autom Control 2022; 67(1): 176–188.

44.

Zanon

Gros

. Safe reinforcement learning using robust MPC. IEEE Trans Autom Control 2021; 66(8): 3638–3652.

45.

Gros

Zanon

Bemporad

. Safe reinforcement learning via projection on a safe set: how to achieve optimality? IFAC PapersOnLine 2020; 53(2): 8076–8081.

46.

Åström

Wittenmark

. Adaptive control. 2nd ed. Chelmsford, MA: Courier Corporation, 2013.

47.

Norouzi

Koch

. Integration of PD-type iterative learning control with adaptive sliding mode control. IFAC PapersOnLine 2020; 53(2): 6213–6218.

48.

Chen

. Lateral robust iterative learning control for unmanned driving robot vehicle. Proc IMechE, Part I: J Systems and Control Engineering 2020; 234(7): 792–808.

49.

Shakeri

Rahmani

Noei

, et al. Direct methanol fuel cell modeling based on the norm optimal iterative learning control. Proc IMechE, Part I: J Systems and Control Engineering 2021; 235(1): 68–79.

50.

Heinzen

Gillella

Sun

. Iterative learning control of a fully flexible valve actuation system for non-throttled engine load control. Control Eng Pract 2011; 19(12): 1490–1505.

51.

Nagata

Tomizuka

. Robust engine torque control by iterative learning control. In: Proceedings of the 2009 American control conference, St. Louis, MO, 10–12 June 2009, pp.2064–2069. New York: IEEE.

52.

Slepicka

Koch

. Iterative learning on dual-fuel control of homogeneous charge compression ignition. IFAC PapersOnLine 2016; 49(11): 347–352.

53.

Hedinger

Zsiga

Salazar

, et al. Model-based iterative learning control strategies for precise trajectory tracking in gasoline engines. Control Eng Pract 2019; 87: 17–25.

54.

Noack

Jeinsch

Sari

AHA

, et al. Data-driven self-tuning control by iterative learning control with application to optimize the control parameter of turbocharged engines. In: Proceedings of the 2014 19th international conference on methods and models in automation and robotics (MMAR), Miedzyzdroje, 2–5 September 2014, pp.839–844. New York: IEEE.

55.

Zweigel

Thelen

Abel

, et al. Iterative learning approach for diesel combustion control using injection rate shaping. In: Proceedings of the 2015 European control conference (ECC), Linz, 15–17 July 2015, pp.3168–3173. New York: IEEE.

56.

Norouzi

Gordon

Aliramezani

, et al. Machine learning-based diesel engine-out NOx reduction using a plug-in PD-type iterative learning control. In: Proceedings of the 2020 IEEE conference on control technology and applications (CCTA), Montreal, QC, Canada, 24–26 August 2020, pp.450–455. New York: IEEE.

57.

Norouzi

Aliramezani

Koch

. Diesel engine NOx reduction using a PD-type fuzzy iterative learning control with a fast response NOx sensor. In: Proceedings of the Combustion Institute Canadian Section (CICS), 2019, https://www.researchgate.net/publication/333224129_Diesel_Engine_NOx_Reduction_Using_a_PD-type_Fuzzy_Iterative_Learning_Control_with_a_Fast_Response_NOx_Sensor

58.

Min

Sunwoo

Han

. Iterative learning control algorithm for feedforward controller of EGR and VGT systems in a CRDI diesel engine. Int J Automot Technol 2018; 19(3): 433–442.

59.

Norouzi

Shahpouri

Gordon

, et al. Deep learning based model predictive control for compression ignition engines. Control Eng Pract 2022; 127: 105299.

60.

Shahpouri

Norouzi

Hayduk

, et al. Hybrid machine learning approaches and a systematic model selection process for predicting soot emissions in compression ignition engines. Energies 2021; 14(23): 7865.

61.

Shahpouri

Norouzi

Hayduk

, et al. Soot emission modeling of a compression ignition engine using machine learning. IFAC PapersOnLine 2021; 54: 826–833.

62.

Uhlenbeck

Ornstein

. On the theory of the brownian motion. Phys Rev 1930; 36(5): 823.

63.

Wawrzynski

. Control policy with autocorrelated noise in reinforcement learning for robotics. Int J Mach Learn Comput 2015; 5(2): 91.

64.

Verschueren

Frison

Kouzoupis

, et al. Acados—a modular open-source framework for fast embedded optimal control, 2021, https://www.syscop.de/files/2021ss/MPCRL/paper3-acados-math-prog.pdf

65.

Verschueren

Frison

Kouzoupis

, et al. Towards a modular software package for embedded optimization. IFAC PapersOnLine 2018; 51(20): 374–380.

66.

Norouzi

Gordon

Winkler

, et al. End-to-end deep neural network based nonlinear model predictive control: experimental implementation on diesel engine emission control. Energies 2022; 15: 9335.

Safe deep reinforcement learning in diesel engine emission control

Abstract

Keywords

Introduction

Engine simulation model

Deep RL

DDPG agents algorithm

Safe DDPG

Safe RL versus RL

ILC

Results and discussions

Summary and conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References