Sage Journals: Discover world-class research

Abstract

In biopharmaceutical manufacturing, fermentation processes play a critical role in productivity and profit. A fermentation process uses living cells with complex biological mechanisms, leading to high variability in the process outputs, namely, the protein and impurity levels. By building on the biological mechanisms of protein and impurity growth, we introduce a stochastic model to characterize the accumulation of the protein and impurity levels in the fermentation process. However, a common challenge in the industry is the availability of only a very limited amount of data, especially in the development and early stages of production. This adds an additional layer of uncertainty, referred to as model risk, due to the difficulty of estimating the model parameters with limited data. In this article, we study the harvesting decision for a fermentation process (i.e., when to stop the fermentation and collect the production reward) under model risk. We adopt a Bayesian approach to update the unknown parameters of the growth-rate distributions, and use the resulting posterior distributions to characterize the impact of model risk on fermentation output variability. The harvesting problem is formulated as a Markov decision process model with knowledge states that summarize the posterior distributions and hence incorporate the model risk in decision-making. Our case studies at MSD Animal Health demonstrate that the proposed model and solution approach improve the harvesting decisions in real life by achieving substantially higher average output from a fermentation batch along with lower batch-to-batch variability.

Keywords

Biomanufacturing model uncertainty Bayesian reinforcement learning optimal stopping problem data-driven stochastic optimization

1. Introduction

The biomanufacturing industry has developed several innovative treatments for cancer, adult blindness, and COVID-19 among many other diseases. Despite its increasing success, biomanufacturing is a challenging production environment. Different from classical pharmaceutical manufacturing, biomanufacturing methods use living organisms (e.g., bacteria, viruses, or mammalian cells) during the production processes. These living organisms are custom-engineered to produce highly complex active ingredients for biopharmaceutical drugs. However, the use of living organisms also introduces several operational challenges related to batch-to-batch variability in the production outcomes.

The drug substance manufacturing can be broadly categorized into two main steps: fermentation and purification operations. During the fermentation process, the living organisms grow and produce the desired active ingredients. Specific characteristics of active ingredients (e.g., monoclonal antibodies, biomass, proteins, antigens, etc.) could vary across different drugs. In the remainder of this paper, we refer to the resulting target active ingredient as protein. After fermentation, the batch continues with a series of purification operations to comply with stringent regulatory requirements on safety and quality. Our main focus in this study is the fermentation process.

The fermentation process is typically carried out inside a stainless steel vessel called bioreactor. Bioreactors are equipped with advanced sensors to achieve a highly controlled environment via monitoring of critical process parameters (e.g., cell growth rate, protein accumulation, impurity accumulation, etc.). Figure 1 uses industry data to illustrate the main dynamics of a batch fermentation process. As the fermentation continues, we observe from Figure 1(a) that the amount of protein produced during fermentation increases exponentially over time. Hence, this specific phase of fermentation is known as the exponential growth phase. However, the exponential growth phase continues only for a finite period of time (e.g., several hours or days depending on the application) because of the inherent limitations of biological processes (e.g., limitations in media, cell viability, and growth). After the exponential growth phase, the fermentation enters a stationary phase in which the protein production stops and the batch needs to be harvested. In addition, we observe from Figure 1(b) that unwanted impurities accumulate inside the bioreactor along with the desired proteins. The specific nature of impurities varies across applications but impurities often represent unwanted byproducts, such as ammonia, dead cells, etc. These impurities are subsequently filtered and eliminated through a series of purification operations.

Figure 1.

Illustration of fermentation dynamics using industry data from MSD.

1.1. The Harvesting Problem: Trade-offs and Challenges

The simultaneous growth of desired proteins and unwanted impurities, as shown in Figure 1, is often known as the purity–yield trade-off in fermentation processes. From a practical perspective, the purity–yield trade-off presents a critical challenge in fermentation harvesting (stopping) decisions. In order to achieve a high protein content, the decision maker may be inclined to harvest the fermentation as late as possible. However, waiting too long to harvest can result in higher levels of impurity. As a result, the difficulty (cost) of subsequent purification operations may increase. Therefore, the purity-yield trade-off has financial implications (e.g., expected revenue increases as protein yield increases, but expected cost increases as impurity levels increase). These trade-offs motivate our main research question: (1) What is an optimal harvesting policy (i.e., when should we stop the fermentation) to maximize the expected profit obtained from a batch?

In addition to the purity–yield trade-off, process “uncertainty” imposes another critical challenge on harvesting decisions. In particular, two types of process uncertainty are commonly encountered in biomanufacturing practice: (i) inherent stochasticity and (ii) model risk. In our problem setting, inherent stochasticity represents the uncertainty in the amounts of protein and impurity produced throughout fermentation, and it is often caused by the inherent complexity of biological systems. Because living organisms are used during fermentation, the rate at which proteins and impurities accumulate is random (although fermentation is carried out under identical conditions). Therefore, the inherent stochasticity of biological processes motivates our second research question: (2) How can we develop an analytical model to learn the inherent stochasticity of fermentation processes and incorporate it into optimal harvesting decisions?

Most often, the inherent stochasticity can not be controlled but can be predicted through historical process data. However, building a reliable prediction model can be challenging when there is only a limited amount of historical process data. We refer to the resulting uncertainty in the prediction model itself as the model risk. The problem of decision-making under limited data (i.e., under model risk) is a critical concern for both research and development (R&D) projects and industry-scale applications. In biopharmaceutical R&D projects, each protein is unique such that the scientists re-engineer and manufacture it for the first time. This implies that harvesting decisions are typically made under limited R&D data. In industry-scale applications, the problem of limited data becomes relevant every time a change occurs in equipment or raw materials. For example, the supplier of raw materials (i.e., medium or seed cells) might change their formulations, the management might purchase a new bioreactor, etc. Such changes have a substantial impact on the output of fermentation which often makes the historical process data obsolete or unreliable. Thus, it is of practical importance to have a harvesting strategy that accounts for the model risk, leading to our third research question: (3) Given a limited amount of historical data, how can we develop a learning mechanism to simultaneously account for inherent stochasticity and model risk while making harvesting decisions? In addition, the harvesting strategies commonly used in the industry do not account for model risk and may lead to suboptimal decisions. The industry needs a better understanding of how to manage model risk under small data, and how to exploit the structural properties of an optimal policy to facilitate its implementation. These observations motivate our final research question: (4) What are the structural characteristics of optimal harvesting policies? How does their performance compare to the alternative harvesting policies used in practice?

1.2. Contributions

To address the aforementioned research questions, we build a solution framework based on a reinforcement learning model using the theory of Bayesian statistics and Markov decision processes. A key aspect of our work is that we build analytical models that combine the knowledge from life sciences and operations research (OR) to support biomanufacturing decisions under limited historical data and the inherent stochasticity of biological systems. In life sciences research, there are well-known mechanistic models to predict the evolution of fermentation (Doran, 2013). However, existing models do not mathematically capture both aspects of limited process data and the inherent stochasticity of fermentation. Our study is a first attempt to optimize fermentation harvesting decisions in biomanufacturing under limited data, and combines the knowledge from life sciences and stochastic modeling to derive guidelines that improve industry practices.

We characterize the control-limit structure of the optimal policy with respect to the impurity level. We also show that the myopic policy which makes the harvesting decisions by looking only one period ahead is optimal under a perfect-information setting and some practically-relevant sufficient conditions. Furthermore, we study how the posterior predictive distributions of the growth rates affect the harvesting decisions under the myopic policy in the presence of model risk. Our framework enables decision makers to do a rigorous assessment of the impact of limited data on harvesting decisions, and provides managerial insights on the value of collecting additional data. This research is an outcome of a multi-year collaboration with MSD Animal Health in Boxmeer, Netherlands. The facility in Boxmeer is a leading biomanufacturing hub in Europe that conducts both biopharmaceutical R&D and large-scale production. Since September 2019, the developed framework has been used in daily operations to support harvesting decisions. The implementation has resulted in around $50 %$ improvement in batch yield on average. The research outcomes have been recognized as a finalist of the 2022 INFORMS Franz Edelman Prize.

2. Literature Review

There is a wide range of papers in operations management with a focus on pharmaceutical industry. For example, Plante et al. (1999) maximized the expected quality in a pharmaceutical production process by explicitly modeling the quality parameters of raw materials, the parameters of the production process, and the interactions between them. In real-world case studies, Martagan et al. (2016a) optimized the purification-related decisions for engineer-to-order proteins, and Sahling and Hahn (2019) determined a master production schedule for weekly demand for a multi-level biopharmaceutical manufacturing process. Subramanian et al. (2020) investigated how pharmaceutical manufacturers switch from early-stage drug discovery and late-stage drug development. At supply chain level, Zhu et al. (2021) proposed a forecasting method to predict demand for pharmaceutical products, and Zhao (2023) provided an overview of complex pharmaceutical supply chains. Also at the supply chain level, Xu et al. (2023) investigated a pharmaceutical manufacturer’s usage of different types of distributors for speciality drugs. The operations management literature on food processing and agribusiness is also relevant to our work due to the biological nature of the products (Lowe and Preckel, 2004; Azoury and Miyaoka, 2013; Bansal and Nagarajan, 2017). For example, Rajaram and Karmarkar (2004) considered the scheduling of multiproduct batch operations in the food-processing industry to minimize setup and quality costs, and Jahandideh et al. (2020) considered learning across batches and decay in the performance of the catalysts used in the production process. Blackburn and Scudder (2009) studied supply chain design strategies for fresh produce after their harvesting. More recently, Bansal et al. (2024) addressed a crop harvesting problem in the presence of a time-quality trade-off. In the remainder of our review, we restrict our scope to studies related to biomanufacturing applications. In particular, our work is most closely related to two streams of research: (1) modeling and control of fermentation based on a known model that describes the dynamics of the fermentation process, and (2) reinforcement learning approaches to predict and control fermentation processes.

A vast body of life sciences literature focuses on modeling the biological dynamics of fermentation processes. In particular, predictive models are built to estimate the evolution of fermentation, and then these models are used to guide the search for optimal control strategies. In this context, most studies develop deterministic or stochastic models to predict and control fermentation. Deterministic models typically build kinetic process models (i.e., differential equations) on cell growth and product formation (McNeil and Harvey, 2008; Doran, 2013; Putra and Abasaeed, 2018). These kinetic models are also integrated with optimization models. For example, Chang et al. (2016) constructed a dynamic flux balance model for a fermentation process. They developed a closed-loop control for feed rate and dissolved oxygen concentration profiles to maximize yield production.

Data-driven stochastic optimization is relatively understudied to predict and control fermentation processes. Existing studies typically focus on the inherent stochasticity of fermentation. For example, Peroni et al. (2005) used approximate dynamic programming to maximize yield and minimize process time in fed-batch fermentation. Xing et al. (2010) adopted a Markov chain Monte Carlo approach to optimize the kinetics of a fermentation process. More recently, Martagan et al. (2016b) and Martagan et al. (2020) developed a Markov decision processes (MDP) model to optimize fermentation operating decisions. However, their optimization models are built based on sufficiently large historical data, and hence are not equipped to capture the impact of model risk (limited historical data) on biomanufacturing decisions. Martagan et al. (2023) developed a portfolio of decision support tools to reduce biomanufacturing costs. Koca et al. (2023) optimized the timing of so-called bleed-feed decisions, enabling pharmaceutical manufacturers skip intermediary biorector setups in batch fermentation processes. However, they did not consider the model risk. Xie et al. (2022) considered model risk in an interpretable semantic bioprocess probabilistic knowledge graph for production stability control, but they do not optimize the harvesting decisions. To the best of our knowledge, this paper is the first to simultaneously capture the inherent stochasticity of biological systems and model risk to optimize harvesting decisions in biomanufacturing systems.

Reinforcement learning approaches have been recently developed for bioprocess control. For example, Treloar et al. (2020) and Nikita et al. (2021) developed model-free deep-Q-network-based reinforcement learning approaches to maintain the cells at target populations by controlling the feeding profiles and maximize the yield by controlling the flow rate. Zheng et al. (2020) and Zheng et al. (2023) constructed a model-based reinforcement learning for biomanufacturing control, with a predictive distribution of system response. After that, Zheng et al. (2020) proposed a simulation-assisted policy gradient algorithm that can efficiently reuse the previous process outputs to facilitate the learning and the search for the optimal policy. Our paper is different from these studies in two ways. First, we study the harvesting problem in fermentation, while Zheng et al. (2020) focused on a chromatography problem. Second, we explicitly incorporate the posterior distributions of unknown fermentation-process parameters as knowledge states of the MDP model and study the structure property of optimal policy, which can enhance interpreability and feasibility of the proposed approach for real manufacturing practice.

To be specific, we adopt model-based Bayesian reinforcement learning as our solution approach. Compared to model-free approaches, it allows us to interpret model risk by quantifying the uncertainty in the process parameters. Furthermore, although it is often less computationally efficient than a model-free approach, it can incorporate known process dynamics and any prior information about model risk into decision making. A comprehensive review of Bayesian reinforcement learning methodologies can be found by Ghavamzadeh et al. (2016). Since solving the original model-based Bayesian reinforcement learning is notoriously complex due to potentially huge state space, various approximation algorithms have been developed, including offline value approximation (Poupart et al., 2006); online near-myopic value and tree search approximation that focus on realized knowledge states in planning (Ross et al., 2008; Osband et al., 2013; Fonteneau et al., 2013); and exploration bonus based methods where an agent acts according to an optimistic model of MDP under uncertainty (Kolter and Ng, 2009; Asmuth and Littman, 2011; Asmuth et al., 2012). Motivated by those studies, we develop a model-based Bayesian reinforcement learning approach, which can account for model risk in guiding fermentation harvesting decisions.

3. Model

Section 3.1 introduces a stochastic mechanistic model to represent the protein and impurity accumulation in the fermentation process. Section 3.2 presents a Bayesian approach to capture the uncertainty in the unknown parameters of this model, and describes how this uncertainty can be updated with new protein and impurity observations collected during the fermentation process. Finally, Section 3.3 presents an MDP model formulation, accounting for both inherent stochasticity and model risk, to optimize the harvesting decision in the fermentation process. An overview of the mathematical notation is provided in Appendix EC.1.

3.1. Fermentation Process Modeling

The accumulation of protein and impurity amount in the exponential-growth phase of a fermentation process is commonly modeled with the so-called cell-growth kinetics mechanism (Doran, 2013). The cell-growth kinetics mechanism is often represented as an ordinary differential equation. To be specific, the protein amount at time $t$ , denoted by $p_{t}$ , is given by the functional form $\frac{d p_{t}}{d t} = ϕ p_{t}$ , where $ϕ$ is referred to as the specific growth rate of protein. Therefore, it is common to assume that the protein amount at time $t$ follows the functional form $p_{t} = p_{0} e^{ϕ t}$ , where $p_{0}$ is the starting amount of protein (seed). Similarly, the impurity amount at time $t$ follows the functional form $i_{t} = i_{0} e^{Ψ t}$ , where $Ψ$ is the specific growth rate of impurity and $i_{0}$ is the starting amount of impurity. In our model, we consider that measurements are performed at discrete time points $T = {t : 0, 1, \dots, T}$ , where $T$ denotes the time point chosen by the decision maker to harvest the fermentation process. The time between two measurements is fixed as one time unit and it can be any finite amount of time (e.g., an hour, a day, or longer), depending on the process characteristics and practical constraints. This leads to a recursive representation $p_{t + 1} = p_{t} e^{ϕ}$ for the protein amount and $i_{t + 1} = i_{t} e^{Ψ}$ for the impurity amount for $t \in {0, 1, \dots, T - 1}$ .

Because living biological systems (e.g., cells) are used in the fermentation process, their specific growth rates are random; see, for example, Templeton et al. (2013) and Odenwelder et al. (2021). We let $Φ_{t}$ , $t \in {0, 1, \dots, T - 1}$ , denote the normally distributed independent random variables that represent the specific growth rates of protein in the time interval from time $t$ to $t + 1$ . Let $μ_{c}^{(p)}$ and $σ_{c}^{(p) 2}$ denote the true (unknown) mean and the variance of $Φ_{t}$ . Similarly, the random variables $Ψ_{t}$ , $t \in {0, 1, \dots, T - 1}$ , are independent and normally distributed with true mean $μ_{c}^{(i)}$ and variance $σ_{c}^{(i) 2}$ . The modeling of growth rates as normally distributed random variables leads to the recursive representations of the protein and impurity amount given by

\begin{aligned} p_{t + 1} & = p_{t} \cdot e^{Φ_{t}}, Φ_{t} \sim N (μ_{c}^{(p)}, σ_{c}^{(p) 2}), \\ i_{t + 1} & = i_{t} \cdot e^{Ψ_{t}}, Ψ_{t} \sim N (μ_{c}^{(i)}, σ_{c}^{(i) 2}), \end{aligned}

(1)

for

t \in {0, 1, \dots, T - 1}

, accounting for the inherent stochasticity in the accumulation of protein and impurity in a fermentation process. In (1), we use the notation

\sim

to mean “distributed as” and

N (a, b)

to represent a normal distribution with mean

a

and variance

b

The protein and impurity growth rates are assumed independent because there are many biological and chemical factors that randomly influence the “production speed” for impurity and protein, that is, the rate of generating metabolic wastes and antibody proteins (Tsao et al., 2005; Xing et al., 2010). The independent and normally distributed growth rates are commonly used in the literature to model their random variations over time (Wechselberger et al., 2013; Mockus et al., 2015; Möller et al., 2020). We also validated this assumption using historical data in our case study, as described in Section 5.1. The stationarity assumption for the growth rate distributions is linked to the fact that the fermentation process has a well-controlled cell culture condition, where the so-called metabolic quasi-steady state is achieved. Thus, the metabolic flux (and hence the corresponding distribution of the protein and impurity growth rates during the exponential growth phase) does not change over time. This assumption is also validated in our case study (Section 5.1).

3.2. Bayesian Learning for the Fermentation Process

The true parameters of the underlying stochastic model for the protein and impurity growth rates, denoted by $θ θ^{c} = {μ_{c}^{(p)}, σ_{c}^{(p) 2}, μ_{c}^{(i)}, σ_{c}^{(i) 2}}$ , are unknown and often need to be estimated from a very limited amount of real-world data (especially for new products that are not yet in production). We adopt a Bayesian approach and model the mean and variance of the protein and impurity growth rates as random variables, denoted by $θ θ = {μ^{(p)}, σ^{(p) 2}, μ^{(i)}, σ^{(i) 2}}$ . In the remainder of this section, we describe how we specify the prior distribution for $θ θ$ , obtain its posterior distribution by updating the prior with new data on protein and impurity accumulation, and characterize the posterior predictive distributions for the protein and impurity growth rates.

3.2.1 Specification of Prior Distribution

For the protein growth rate, we build the joint prior distribution of $(μ^{(p)}, σ^{(p) 2})$ in the following way. First, the marginal distribution of variance $σ^{(p) 2}$ is chosen as an inverse-gamma distribution with prior parameters $λ_{0}^{(p)}$ and $β_{0}^{(p)}$ , denoted as $σ^{(p) 2} \sim Inv Γ (λ_{0}^{(p)}, β_{0}^{(p)})$ . Next, given the value of $σ^{(p) 2}$ , the conditional distribution of mean $μ^{(p)}$ is assumed to be $N (α_{0}^{(p)}, σ^{(p) 2} / ν_{0}^{(p)})$ , where $α_{0}^{(p)}$ and $ν_{0}^{(p)}$ are also prior parameters. It then follows that the joint prior distribution of $(μ^{(p)}, σ^{(p) 2})$ has a normal-inverse-gamma distribution (Gelman et al., 2004), that is,

(μ^{(p)}, σ^{(p) 2}) \sim N (α_{0}^{(p)}, σ^{(p) 2} / ν_{0}^{(p)}) \cdot Inv Γ (λ_{0}^{(p)}, β_{0}^{(p)}) .

(2)

For the impurity growth rate, we obtain the joint prior distribution of

(μ^{(i)}, σ^{(i) 2})

in a similar way by using the prior parameters

α_{0}^{(i)}

ν_{0}^{(i)}

λ_{0}^{(i)}

and

β_{0}^{(i)}

; i.e.,

(μ^{(i)}, σ^{(i) 2}) \sim N (α_{0}^{(i)}, σ^{(i) 2} / ν_{0}^{(i)}) \cdot Inv Γ (λ_{0}^{(i)}, β_{0}^{(i)}) .

(3)

It is well known that normal-inverse-gamma distribution is conjugate when combined with normally distributed observations (Gelman et al., 2004; Powell and Ryzhov, 2012). This enables us to efficiently update the prior distributions with the arrival of new data to obtain the posterior distributions.

3.2.2 Characterization of the Posterior Distribution

Suppose that $(α_{t}^{(p)}, ν_{t}^{(p)}, λ_{t}^{(p)}, β_{t}^{(p)})$ represent our belief on the distribution of protein growth rate at time point $t$ , and we make an observation of the protein amount $p_{t + 1}$ at time point $t + 1$ . That is, we make an observation $ϕ_{t} = \ln (p_{t + 1} / p_{t})$ as the realization of the normally distributed protein growth rate $Φ_{t}$ . Then, the posterior distribution of $(μ^{(p)}, σ^{(p) 2})$ follows a normal-inverse-gamma distribution, that is,

(μ^{(p)}, σ^{(p) 2}) \sim N (α_{t + 1}^{(p)}, σ^{(p) 2} / ν_{t + 1}^{(p)}) \cdot Inv Γ (λ_{t + 1}^{(p)}, β_{t + 1}^{(p)}),

(4)

with updated parameters

\begin{aligned} α_{t + 1}^{(p)} & = α_{t}^{(p)} + \frac{ϕ_{t} - α_{t}^{(p)}}{ν_{t + 1}^{(p)}}, ν_{t + 1}^{(p)} = ν_{t}^{(p)} + 1, \\ λ_{t + 1}^{(p)} & = λ_{t}^{(p)} + \frac{1}{2}, β_{t + 1}^{(p)} = β_{t}^{(p)} + \frac{ν_{t}^{(p)} (ϕ_{t} - α_{t}^{(p)})^{2}}{2 ν_{t + 1}^{(p)}} . \end{aligned}

(5)

Given our belief

(α_{t}^{(i)}, ν_{t}^{(i)}, λ_{t}^{(i)}, β_{t}^{(i)})

about the distribution of impurity growth rate at time point

t

, since the observation

Ψ_{t} = \ln (i_{t + 1} / i_{t})

is a random realization of the normally distributed impurity growth rate

Ψ_{t}

, the posterior distribution of

(μ^{(i)}, σ^{(i) 2})

is also a normal-inverse-gamma; that is,

(μ^{(i)}, σ^{(i) 2}) \sim N (α_{t + 1}^{(i)}, σ^{(i) 2} / ν_{t + 1}^{(i)}) \cdot Inv Γ (λ_{t + 1}^{(i)}, β_{t + 1}^{(i)}),

(6)

with the updated parameters

\begin{aligned} α_{t + 1}^{(i)} & = α_{t}^{(i)} + \frac{Ψ_{t} - α_{t}^{(i)}}{ν_{t + 1}^{(i)}}, ν_{t + 1}^{(i)} = ν_{t}^{(i)} + 1, \\ λ_{t + 1}^{(i)} & = λ_{t}^{(i)} + \frac{1}{2}, β_{t + 1}^{(i)} = β_{t}^{(i)} + \frac{ν_{t}^{(i)} (Ψ_{t} - α_{t}^{(i)})^{2}}{2 ν_{t + 1}^{(i)}} . \end{aligned}

(7)

For further details on the Bayesian update procedure, we refer the reader to Gelman et al. (2004).

3.2.3 Posterior Predictive Distribution of the Growth Rates

Given the Bayesian model described above, the density of the protein growth rate at time $t$ , conditional on the historical protein data (i.e., summarized by the belief parameters $(α_{t}^{(p)}, ν_{t}^{(p)}, λ_{t}^{(p)}, β_{t}^{(p)})$ ), is given by

\begin{aligned} p (ϕ_{t} | α_{t}^{(p)}, ν_{t}^{(p)}, λ_{t}^{(p)}, β_{t}^{(p)}) \\ = \int \int p (ϕ_{t} | μ^{(p)}, σ^{(p) 2}) p (μ^{(p)}, σ^{(p) 2} | α_{t}^{(p)}, ν_{t}^{(p)}, λ_{t}^{(p)}, \\ β_{t}^{(p)}) d μ^{(p)} d σ^{(p) 2}, \end{aligned}

(8)

where

p (ϕ_{t} | μ^{(p)}, σ^{(p) 2})

is the density of the normally distributed protein growth rate

Φ_{t}

and

p (μ^{(p)}, σ^{(p) 2} | α_{t}^{(p)}, ν_{t}^{(p)}, λ_{t}^{(p)}, β_{t}^{(p)})

is the joint posterior density of the normal-inverse-gamma distributed

(μ^{(p)}, σ^{(p) 2})

. In (8), the integral marginalizes out the variables

μ^{(p)}

and

σ^{(p) 2}

, leading to the predictive density of the future observation of the protein growth rate given the current belief parameters on the underlying protein growth model. We let

{\tilde{Φ}}_{t}

denote the predictive protein growth-rate random variable at time point

t

, and it has the density given in (8), accounting for both the model risk and the inherent stochasticity in the protein accumulation. It can be shown that the random variable

{\tilde{Φ}}_{t}

follows a generalized t-distribution (Murphy, 2007), that is,

{\tilde{Φ}}_{t} \sim t_{2 λ_{t}^{(p)}} (α_{t}^{(p)}, \frac{β_{t}^{(p)} (1 + ν_{t}^{(p)})}{ν_{t}^{(p)} λ_{t}^{(p)}}),

(9)

where

{\tilde{Φ}}_{t} \sim t_{v} (a, b)

means that

({\tilde{Φ}}_{t} - a) / \sqrt{b}

follows a standard t-distribution with

v

degrees of freedom. We refer to the term

β_{t}^{(p)} (1 + ν_{t}^{(p)}) / (ν_{t}^{(p)} λ_{t}^{(p)})

in (9) as the predictive variance of the protein growth rate, and denote it with

{\tilde{σ}}_{t}^{(p) 2}

The same result holds for the predictive impurity growth-rate ${\tilde{Ψ}}_{t}$ at time point $t$ :

{\tilde{Ψ}}_{t} \sim t_{2 λ_{t}^{(i)}} (α_{t}^{(i)}, \frac{β_{t}^{(i)} (1 + ν_{t}^{(i)})}{ν_{t}^{(i)} λ_{t}^{(i)}}),

(10)

where we refer to the term

β_{t}^{(i)} (1 + ν_{t}^{(i)}) / (ν_{t}^{(i)} λ_{t}^{(i)})

in (10) as the predictive variance of the impurity growth rate and denote it with

{\tilde{σ}}_{t}^{(i) 2}

. In the remainder of the article, we use

f_{t}^{(p)} (\cdot)

and

f_{t}^{(i)} (\cdot)

to denote the posterior predictive density functions of the random variables

{\tilde{Φ}}_{t}

and

{\tilde{Ψ}}_{t}

, respectively.

3.3. Markov Decision Process Model

It is of practical importance to optimize when to harvest the fermentation process under limited historical data. We will formulate this problem as a Markov decision processes (MDP) model with Bayesian updates on the parameters of the protein and impurity growth-rate distributions.

3.3.1 Decision Epochs

We consider a finite-horizon discrete-time model with decision epochs $T = {t : 0, 1, \dots, \bar{T}}$ , representing the time points at which the protein and impurity amounts are measured.¹ The parameter $\bar{T}$ is the time point at which the fermentation must be harvested, if not done yet. We consider $\bar{T}$ as an upper bound on the time of harvest because it is often known when the growth stops (i.e., there are no incentives for continuing the fermentation beyond that point). Also, it gives some level of certainty in the planning of the bioreactor. Note that $T \in T$ , that is, the time point at which the fermentation is harvested must be a decision epoch and it is at most $\bar{T}$ .

3.3.2 Physical States

The levels of protein and impurity during the fermentation process constitute the physical states. In practice, there is an upper limit on the cell density that can be accommodated by a bioreactor with a certain volume. Thus, it is undesired to continue fermentation beyond a certain level of protein accumulation. We let $\bar{P}$ represent this upper limit on the accumulated protein level at which the fermentation must be harvested. On the other hand, we let $\bar{I}$ denote the maximum impurity value at which the batch is considered as failed. If the accumulated impurity level reaches $\bar{I}$ , a predefined value in accordance with regulatory standards on batch quality, the fermentation process must be terminated. At decision epoch $t$ , the physical state $S_{t}$ is specified by the current protein amount $p_{t} \in [0, \bar{P}]$ and the current impurity amount $i_{t} \in [0, \bar{I}]$ in the fermentation process, that is, $S_{t} = (p_{t}, i_{t})$ .

3.3.3 Action Space

At a decision epoch before reaching the stationary phase, we can either continue the fermentation process one more time period (denoted by action $C$ ) or terminate the fermentation process by harvesting it (denoted by action $H$ ). The harvest action is the only possible action if: (1) the current protein amount reaches the harvesting limit $\bar{P}$ ; (2) there is a batch failure, caused by the impurity level reaching the threshold level $\bar{I}$ ; or (3) the fermentation process reaches the decision epoch $\bar{T}$ . The action space can be formalized as ${H}$ if $p_{t} = \bar{P}$ or $i_{t} = \bar{I}$ or $t = \bar{T}$ . On the other hand, the action space is given by ${C, H}$ if $p_{t} < \bar{P}$ , $i_{t} < \bar{I}$ , and $t < \bar{T}$ .

3.3.4 Knowledge State

Since the true parameters $θ θ^{c}$ of the underlying model are unknown and estimated from real-world data, we use the knowledge state, specified by the parameters of the posterior distribution of $θ θ$ , to quantify our current belief about $θ θ^{c}$ . That is, we specify the posterior-distribution parameters (from Section 3.2) as the knowledge state at decision epoch $t$ , denoted by $I_{t} = {α_{t}^{(p)}, ν_{t}^{(p)}, λ_{t}^{(p)}, β_{t}^{(p)}, α_{t}^{(i)}, ν_{t}^{(i)}, λ_{t}^{(i)}, β_{t}^{(i)}}$ .

3.3.5 Hyper States & Hyper State Transition

We introduce the hyper states $H_{t} \equiv (S_{t}, I_{t})$ , including both physical state $S_{t} = (p_{t}, i_{t})$ and knowledge state $I_{t}$ . If the action at decision epoch $t$ is to continue the fermentation, that is, $a_{t} = C$ , the hyper state transition probability can be specified as

\begin{aligned} Pr (S_{t + 1}, I_{t + 1} | S_{t}, I_{t}; C) = Pr (S_{t + 1} | S_{t}, I_{t}) Pr (I_{t + 1} | S_{t + 1}, S_{t}, I_{t}), \end{aligned}

(11)

where

Pr (S_{t + 1} | S_{t}, I_{t})

represents the probability that the physical state transits to

S_{t + 1}

(i.e., conditioned on the current physical state and knowledge state), and

Pr (I_{t + 1} | S_{t + 1}, S_{t}, I_{t}, C)

represents the probability that the knowledge state transits to

I_{t + 1}

given the realization of

S_{t + 1}

(i.e., conditioned on the current knowledge state as well as the realized physical-state transition).

In (11), the first term $Pr (S_{t + 1} | S_{t}, I_{t})$ can be determined by the protein and impurity transition equations $p_{t + 1} = p_{t} e^{{\tilde{ϕ}}_{t}}$ and $i_{t + 1} = i_{t} e^{{\tilde{Ψ}}_{t}}$ , where ${\tilde{ϕ}}_{t}$ and ${\tilde{Ψ}}_{t}$ are the realizations of the random variables ${\tilde{Φ}}_{t}$ and ${\tilde{Ψ}}_{t}$ with distributions specified in (9) and (10), respectively. The second term $Pr (I_{t + 1} | S_{t + 1}, S_{t}, I_{t})$ follows the Bayesian updates for the knowledge states as specified in (5) and (7) given the realization of the physical states $(p_{t + 1}, i_{t + 1})$ or equivalently the growth rate samples $(ϕ_{t}, Ψ_{t}) = (\ln (p_{t + 1} / p_{t}), \ln (i_{t + 1} / i_{t}))$ .

At any decision epoch $t$ with physical states $(p_{t}, i_{t})$ , if the action is to harvest (i.e., $a_{t} = H$ ), the fermentation process ends. We model this situation by assuming that, if the harvest action is taken, the state of the MDP makes a transition to an absorbing stopping state $Δ$ . Thus, the counterpart of (11) for the harvest action can be written as $Pr (Δ | S_{t}, I_{t}; H) = 1$ .

3.3.6 Reward

At any decision epoch $t$ , if the decision is to continue, the immediate cost $c_{u}$ is charged. The cost $c_{u}$ represents the cost of resources allocated to continue the fermentation process one more time step (e.g., a fixed energy cost for the bioreactor, operator cost, clean room charges). Since the time periods are of equal length, it is natural to assume a constant operating cost $c_{u}$ per time unit. On the other hand, if the decision is to harvest, the fermentation is terminated and a reward is collected. The reward of the harvest decision depends on the current physical state. Specifically, if the harvest decision is taken because of a failure (i.e., if $i_{t} = \bar{I}$ ), then the failure penalty $r_{f}$ is charged as the cost of losing the batch due to the failure. On the other hand, if there is no failure at the harvesting moment (i.e., if $i_{t} < \bar{I}$ ), the harvest reward

r_{h} (p_{t}, i_{t}) = c_{0} + c_{1} p_{t} - c_{2} i_{t}

(12)

is collected as an immediate reward. In (12),

c_{0} > 0

represents the lump-sum reward collected per fermentation batch, while

c_{1} > 0

and

c_{2} > 0

represent the marginal reward collected per unit of protein and the marginal cost encountered per unit of impurity, respectively. We shall assume that

r_{f} > c_{2} \bar{I}

, reflecting the fact that a failure is costlier than even the worst harvesting outcome. To summarize, given the physical states

(p_{t}, i_{t})

and the action

a_{t}

, the reward

R (p_{t}, i_{t}; a_{t})

at decision epoch

t

can be written as follows:

R (p_{t}, i_{t}; a_{t}) = {\begin{cases} - c_{u}, & a_{t} = C, i_{t} < \bar{I} \\ r_{h} (p_{t}, i_{t}), & a_{t} = H, i_{t} < \bar{I} \\ - r_{f}, & a_{t} = H, i_{t} = \bar{I} \end{cases} .

(13)

3.3.7 Policy

Let $π$ denote a nonstationary policy ${π_{t} (\cdot); t = 0, 1, \dots, \bar{T}}$ , which is a mapping from any hyper state $H_{t}$ to an action $a_{t}$ , that is, $a_{t} = π_{t} (H_{t})$ . Given the policy $π$ , the expected total discounted reward is

ρ (π) = E [\sum_{t = 0}^{T} γ^{t} R (p_{t}, i_{t}; π_{t} (H_{t})) | H_{0}, π],

(14)

where

γ \in (0, 1]

is the discount factor. Notice that (14) represents the expected total discounted reward under the policy

π

from time point

0

until the termination of the fermentation process. The stopping time

T

in (14) is the decision epoch at which the harvest action is taken; that is, if

π_{t} (H_{t}) = H

, then

T = t

. Our objective is to find the optimal policy

π^{*}

that maximizes the expected total discounted reward, that is,

π^{⋆} = \arg max_{π} ρ (π)

3.3.8 Value Function

The value function $V_{t} (H_{t})$ is defined as the expected total discounted reward starting from the decision epoch $t$ with hyper state $H_{t}$ under the optimal policy $π^{*}$ , that is,

V_{t} (H_{t}) = E [\sum_{ℓ = t}^{T} γ^{ℓ} R (p_{ℓ}, i_{ℓ}; π_{ℓ}^{*} (H_{ℓ})) | H_{t}] .

The value function

V_{t} (H_{t})

, or equivalently

V_{t} (p_{t}, i_{t}, I_{t})

, represents the maximum expected total discounted reward starting from decision epoch

t

with physical state

(p_{t}, i_{t})

and knowledge state

I_{t}

, and it can be recursively written as follows:

\begin{aligned} V_{t} (p_{t}, i_{t}, I_{t}) = {\begin{cases} max {r_{h} (p_{t}, i_{t}), - c_{u} + γ E [V_{t + 1} (p_{t + 1}, i_{t + 1}, I_{t + 1})]} & if p_{t} < \bar{P} and i_{t} < \bar{I} \\ r_{h} (p_{t}, i_{t}) & if p_{t} = \bar{P} and i_{t} < \bar{I} \\ - r_{f} & if i_{t} = \bar{I} \end{cases}, \end{aligned}

(15)

for

t = 0, 1 \dots, \bar{T} - 1

At the decision epoch $\bar{T}$ , the value function is equal to

V_{\bar{T}} (p_{\bar{T}}, i_{\bar{T}}, I_{\bar{T}}) = {\begin{cases} r_{h} (p_{\bar{T}}, i_{\bar{T}}), & if i_{\bar{T}} < \bar{I} \\ - r_{f}, & if i_{\bar{T}} = \bar{I} \end{cases},

(16)

because the only feasible action is to harvest if the time point

\bar{T}

is reached, and either the harvesting reward or the failure cost is charged depending on the impurity amount at decision epoch

\bar{T}

. Recall that the hyper state transits to the absorbing stopping state

Δ

after the harvest action, and the value function is equal to zero for a process already at the stopping state, that is,

V_{t} (H_{t}) = 0

H_{t} = Δ

at any

t

. So, the transition to the stopping state

Δ

is omitted in (15) and (16).

4. Analysis

Section 4.1 presents a characterization of the variability in the posterior predictive distribution of the growth rates. Section 4.2 provides some analytical properties of the optimal policy. Note that our MDP model is a variant of the classical optimal stopping problem (Ferguson, 2000). A well-known class of policies for optimal-stopping problems is look-ahead policies. Motivated by its simplicity for applying in practice, we consider the one-step look-ahead policy (referred to as myopic policy) in Section 4.3. Finally, Section 4.4 discusses our solution approach to obtain the optimal policy. All the proofs and the algorithm procedure are provided in the Appendix in the E-Companion.

4.1. Growth-Rate Variability Under Model Risk

Recall that the uncertainty in the protein and impurity growth rates comes from two sources: the inherent stochasticity of the fermentation and the model risk. Conditional on the historical data collected until the decision epoch $t$ , the predictive protein growth rate ${\tilde{Φ}}_{t}$ and the predictive impurity growth rate ${\tilde{Ψ}}_{t}$ are the random variables that the decision maker uses to model the growth rates, and these random variables account for both sources of uncertainty (see Section 3.2). The objective of this section is to quantify the contribution of each source of uncertainty to the predictive variance of the random variables ${\tilde{Φ}}_{t}$ and ${\tilde{Ψ}}_{t}$ , denoted with ${\tilde{σ}}_{t}^{(p) 2}$ and ${\tilde{σ}}_{t}^{(i) 2}$ , respectively.

Let $D_{t} = {(ϕ^{(0)}, Ψ^{(0)}), (ϕ^{(1)}, Ψ^{(1)}), \dots, (ϕ^{(J_{t})}, Ψ^{(J_{t})})}$ denote the historical data on past realizations of the growth rates available at the $t$ -th decision epoch of the fermentation process. It is possible that the data size $J_{t}$ can be greater than $t$ as the historical data $D_{t}$ may also include the growth-rate realizations from the previous fermentation processes. Recall from Section 3.2 that the knowledge states can be recursively written as a function of the historical data $D_{t}$ ; see equations (5) and (7). By applying the commonly used improper prior that assumes the initial belief states $α_{0}^{(p)}, ν_{0}^{(p)}, λ_{0}^{(p)}, β_{0}^{(p)}, α_{0}^{(i)}, ν_{0}^{(i)}, λ_{0}^{(i)}, β_{0}^{(i)}$ are all equal to 0, the knowledge states can be obtained as

\begin{aligned} α_{t}^{(p)} & = \bar{ϕ}, ν_{t}^{(p)} = J_{t}, λ_{t}^{(p)} = \frac{J_{t}}{2}, β_{t}^{(p)} = \frac{1}{2} \sum_{j = 1}^{J_{t}} (ϕ^{(j)} - \bar{ϕ})^{2}, \\ α_{t}^{(i)} & = \bar{Ψ}, ν_{t}^{(i)} = J_{t}, λ_{t}^{(i)} = \frac{J_{t}}{2}, β_{t}^{(i)} = \frac{1}{2} \sum_{j = 1}^{J_{t}} (Ψ^{(j)} - \bar{Ψ})^{2}, \end{aligned}

where

\bar{ϕ} = \sum_{j = 1}^{J_{t}} ϕ^{(j)} / J_{t}

and

\bar{Ψ} = \sum_{j = 1}^{J_{t}} Ψ^{(j)} / J_{t}

. The posterior predictive variances are then given by

\begin{aligned} {\tilde{σ}}_{t}^{(p) 2} & = \frac{J_{t} + 1}{(J_{t} - 2) J_{t}} \sum_{j = 1}^{J_{t}} (ϕ^{(j)} - \bar{ϕ})^{2}, \\ {\tilde{σ}}_{t}^{(i) 2} & = \frac{J_{t} + 1}{(J_{t} - 2) J_{t}} \sum_{j = 1}^{J_{t}} (Ψ^{(j)} - \bar{Ψ})^{2} \end{aligned}

(17)

for

J_{t} > 2

. Next, by considering the randomness in the historical data

D_{t}

, Proposition 1(i) establishes the expectation and variance of

{\tilde{σ}}_{t}^{(p) 2}

(i.e., similar to the characterization of the expectation and variance for the sample variance of a set of realizations from a specific population). For a particular realization of the historical data set

D_{t}

, Proposition 1(ii) characterizes the predictive variance

{\tilde{σ}}_{t}^{(p) 2}

as the sum of two closed-form terms that represent the variability in the growth rate due to the inherent stochasticity of the fermentation process and the model risk, respectively.

Proposition 1

(i)

$E [{\tilde{σ}}_{t}^{(p) 2}] = σ_{c}^{(p) 2} + \frac{(2 J_{t} - 1) σ_{c}^{(p) 2}}{(J_{t}^{2} - 2 J_{t})}$ and $Var [{\tilde{σ}}_{t}^{(p) 2}] = \frac{2 (J_{t}^{3} + J_{t}^{2} - J_{t} - 1) σ_{c}^{(p) 4}}{J_{t}^{4} - 4 J_{t}^{3} + 4 J_{t}^{2}}$ .

(ii)

Conditional on the historical data $D_{t}$ , the predictive variance ${\tilde{σ}}_{t}^{(p) 2}$ for the protein growth rate can be decomposed into two components ${\hat{σ}}_{t}^{(p) 2} = \frac{β_{t}^{(p)}}{λ_{t}^{(p)} - 1}$ and ${\overset{ˇ}{σ}}_{t}^{(p) 2} = \frac{β_{t}^{(p)}}{(λ_{t}^{(p)} - 1) ν_{t}^{(p)}}$ , representing the variability of the protein growth rate due to inherent stochasticity and the model risk, respectively; that is, ${\tilde{σ}}_{t}^{(p) 2} = {\hat{σ}}_{t}^{(p) 2} + {\overset{ˇ}{σ}}_{t}^{(p) 2}$ with ${\hat{σ}}_{t}^{(p) 2} = \frac{\sum_{j = 1}^{J_{t}} (ϕ^{(j)} - \bar{ϕ})^{2}}{J_{t} - 2} and {\overset{ˇ}{σ}}_{t}^{(p) 2} = \frac{\sum_{j = 1}^{J_{t}} (ϕ^{(j)} - \bar{ϕ})^{2}}{J_{t}^{2} - 2 J_{t}}$ .

We notice from Proposition 1(i) that the bias $E [{\tilde{σ}}_{t}^{(p) 2} - σ_{c}^{(p) 2}] = \frac{(2 J_{t} - 1) σ_{c}^{(p) 2}}{J_{t}^{2} - 2 J_{t}} > 0$ for $J_{t} > 2$ . Therefore, under model risk, on average the predictive variance ${\tilde{σ}}_{t}^{(p) 2}$ will be greater than the true variance $σ_{c}^{(p) 2}$ . Furthermore, as the amount of historical data $J_{t}$ increases, $Var [{\tilde{σ}}_{t}^{(p) 2}]$ converges to zero, and the predictive variance ${\tilde{σ}}_{t}^{(p) 2}$ will converge to $σ_{c}^{(p) 2}$ , which represents the protein growth-rate variance under perfect information. Given a particular realization of the historical data set $D_{t}$ , Proposition 1(ii) is useful in practice as it allows making a judgment on how the overall uncertainty in the protein growth rate is affected from the inherent stochasticity of the fermentation process and from the model risk. Notice that the ratio of model risk to the inherent stochasticity, that is, ${\overset{ˇ}{σ}}_{t}^{(p) 2} / {\hat{σ}}_{t}^{(p) 2} = 1 / ν_{t}^{(p)}$ , only depends on the shape parameter $ν_{t}^{(p)}$ which is equal to $J_{t}$ . This intuitively shows that the model risk becomes smaller (relative to the inherent stochasticity of the process) as the size of the historical data increases.

Notice that Proposition 1 also applies to the impurity growth model. To be specific, the same results hold for the predictive variance ${\tilde{σ}}_{t}^{(i) 2}$ of the impurity growth rate, as the functional form and the underlying modeling assumptions are the same as in the protein growth model. For brevity, we do not repeat those results in the paper.

4.2. Analytical Properties of the Optimal Policy

We start our analysis by first showing the monotonicity of the value function, and then present sufficient conditions for the existence of a control-limit policy with respect to the impurity level.

Proposition 2
Given the knowledge state $I_{t}$ , the value function $V_{t} (p_{t}, i_{t}, I_{t})$ is a non-increasing function of the impurity level $i_{t}$ and a non-decreasing function of the protein level $p_{t}$ .

Based on the monotonicity properties presented in Theorem 2, we can derive sufficient conditions for the existence of a control-limit policy with respect to the impurity level as follows.
Proposition 3
At any decision epoch $t$ with a given protein level and knowledge state, there exists a critical threshold $i_{t}^{⋆}$ such that the optimal decision is to harvest for the impurity level $i_{t} \geq i_{t}^{⋆}$ if the following condition holds for all $i_{t}^{+} > i_{t}^{-} \geq 0$ :
$\begin{aligned} c_{2} (i_{t}^{+} - i_{t}^{-}) & \leq γ r_{f} [Pr (i_{t + 1} \leq \bar{I} | i_{t}^{-}) - Pr (i_{t + 1} \leq \bar{I} | i_{t}^{+})] \\ - γ c_{1} \bar{P} Pr (i_{t + 1} \leq \bar{I} | i_{t}^{-}) \end{aligned}$
(18)
where $Pr (i_{t + 1} < \bar{I} | i_{t}) = Pr (i_{t} e^{{\tilde{Ψ}}_{t}} < \bar{I}) = \int_{- \infty}^{\ln \bar{I} - \ln i_{t}} f_{t}^{(i)} (Ψ_{t}) d Ψ_{t}$ is the probability that the process failure does not occur in the time period that starts with the impurity level $i_{t}$ at decision epoch $t$ .

Proposition 3 presents the existence of a critical threshold $i_{t}^{*}$ with respect to the impurity level: for a given protein and knowledge state, if we harvest at a certain impurity level, we will also harvest at any higher level of impurity. That is, at a given knowledge state, the physical-state space can be split into a harvest zone and a continue zone indicating the optimal action. The notations $i_{t}^{+}$ and $i_{t}^{-}$ in Theorem 3 represent any two distinct values of the impurity state that must satisfy the sufficient condition (18) to assure the optimality of a control-limit policy with respect to the impurity level. If the condition (18) is satisfied for all $i_{t}^{+} > i_{t}^{-} \geq 0$ , the optimal policy is guaranteed to be a control-limit policy with a critical threshold on the impurity level.

Note that the condition (18) is more likely to be satisfied as the relative value of the failure penalty $r_{f}$ increases compared to $c_{1}$ and $c_{2}$ . In practice, it is common that $r_{f}$ is much larger compared to $c_{1}$ and $c_{2}$ , reflecting the fact that failures are undesired because of strict safety concerns, loss of reputation, and extra rework. In fact, the condition (18) holds for the realistic instances we studied in our case study, where the optimal policy indeed follows a threshold-type policy with respect to the impurity level (see Figure 4). However, we observe that the optimal policy does not follow a threshold-type structure with respect to the protein level in realistic instances (see Section 5.3 for an additional discussion on optimal policies based on an industry case study).
4.3. Myopic Policy

In this section, we consider a one-step look-ahead policy as a practically appealing alternative policy. We refer to it as the myopic policy because it makes the harvesting decisions by only comparing the reward of harvesting at the current decision epoch with the expected reward of harvesting at the next decision epoch. We study the myopic policy because it can be implemented by only maintaining a posterior distribution of the unknown growth-rate distribution parameters. It can be relevant to many small-sized biomanufacturing companies that may not have the necessary infrastructure or expertise to compute the optimal policy.

4.3.1. Myopic Policy Under Perfect Information

We first consider the case where the true parameters of the underlying stochastic model are known, referred to as the perfect-information case. That is, the parameters $μ_{c}^{(p)}$ and $σ_{c}^{(p) 2}$ for the protein growth rate and the parameters $μ_{c}^{(i)}$ and $σ_{c}^{(i) 2}$ for the impurity growth rate are known by the decision maker. The analysis in this section assumes that the probability of a negative growth rate is negligible (i.e., the realizations of the growth rate random variables $Φ_{t}$ and $Ψ_{t}$ are always non-negative), which is often the case in practice with a standard deviation of the normally distributed growth rates expected to be much smaller than their mean values.

Let $A$ denote the physical state space at which harvesting is at least as good as continuing for exactly one more time period and then harvesting:

A = {(p, i) : r_{h} (p, i) \geq - c_{u} + γ E [R (p^{'}, i^{'}; H) | p, i]} .

(19)

The expectation in (19) is with respect to the underlying true growth model (1), and

(p^{'}, i^{'})

denotes the protein and impurity levels of the next decision epoch after the continue action is taken in the current decision epoch at state

(p, i)

Definition 1 (Myopic Policy Under Perfect Information)

The policy that takes the harvest decision the first time the protein and impurity levels enter a state in $A$ is defined as the myopic policy under perfect information.

Our objective is to establish when the myopic policy is optimal in the perfect-information setting. Proposition 4 establishes the optimality of the myopic policy under some specific conditions.

Proposition 4
Consider a fermentation process starting with protein level $p_{0}$ and impurity level $i_{0}$ . The myopic policy is the optimal policy under perfect information, if $p_{0} \in [\underline{p}, \bar{P}]$ with
$\underline{p} = \exp {\ln \bar{P} - μ_{c}^{(p)} - σ_{c}^{(p) 2} - σ_{c}^{(p)} Φ^{- 1} [\frac{e^{- μ_{c}^{(p)} - σ_{c}^{(p) 2} / 2}}{γ Φ (\frac{\ln \bar{I} - \ln i_{0} - μ_{c}^{(i)}}{σ_{c}^{(i)}})}]},$
(20)
and
$\begin{aligned} \frac{c_{2}}{γ} (i^{+} - i) & \leq r_{f} [Φ (\frac{\ln \bar{I} - \ln i - μ_{c}^{(i)}}{σ_{c}^{(i)}}) - Φ (\frac{\ln \bar{I} - \ln i^{+} - μ_{c}^{(i)}}{σ_{c}^{(i)}})] \\ - c_{2} \bar{I} e^{μ_{c}^{(i)} + σ_{c}^{(i) 2} / 2} [Φ (\frac{\ln \bar{I} - \ln i - μ_{c}^{(i)} - σ_{c}^{(i) 2}}{σ_{c}^{(i)}}) - Φ (\frac{\ln \bar{I} - \ln i^{+} - μ_{c}^{(i)} - σ_{c}^{(i) 2}}{σ_{c}^{(i)}})], \end{aligned}$
(21)
for all $i, i^{+} \in [i_{0}, \bar{I}]$ with $i < i^{+}$ .

The conditions in Proposition 4 assure that the physical state space $A$ is a closed set, meaning that when the process enters into set $A$ it never leaves the set $A$ . Therefore, the myopic policy turns out to be optimal. The conditions in Proposition 4 hold when the starting protein level and the failure cost are sufficiently large. Intuitively, this result can be interpreted as follows: It is wise to harvest now and not to take a costly failure risk in future periods if a sufficient amount of protein is already accumulated.

While equation (20) is the closed-form characterization of a starting protein-level lower bound for the optimality of the myopic policy, the condition (21) is not immediately intuitive. To make it easier to interpret, we simplify (21) by applying a Taylor series approximation. To be specific, by applying first-order Taylor series approximation of the nonlinear terms on both sides of inequality (21), it can be approximated as
$r_{f} \geq c_{2} \bar{I} [\frac{\sqrt{2 π} σ_{c}^{(i)}}{γ} + e^{μ_{c}^{(i)} + σ_{c}^{(i) 2} / 2}] .$
(22)
Note that the inequality (22) is more likely to hold as the failure cost $r_{f}$ increases or the maximum purification cost $c_{2} \bar{I}$ decreases. We provide the details of this approximation in Appendix B in the E-Companion.
4.3.2. Myopic Policy Under Model Risk

In this section, the true parameters of the underlying stochastic model for the growth rates are not known anymore. That is, there is a model risk. Our objective is to investigate how the model risk affects the harvesting decisions of the myopic policy. When there is model risk, the expected reward of harvesting at the next decision epoch is calculated by using the posterior predictive distributions of the growth rates (i.e., the distributions characterized in (9) and (10)). Let

\begin{aligned} \tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i) \\ = r_{h} (p, i) + c_{u} - γ E [R (p^{'}, i^{'}; H) | p, i, α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}], \end{aligned}

(23)

where

α^{(p)}

and

α^{(i)}

are the means and

{\tilde{σ}}^{(p)}

and

{\tilde{σ}}^{(i)}

are the standard deviations of the posterior predictive distribution for the protein and impurity growth rates, respectively. We first define the myopic policy under model risk.

Definition 2 (Myopic Policy Under Model Risk)

The policy that takes the harvest decision the first time the inequality $\tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i) \geq 0$ holds is defined as the myopic policy under model risk.

We will study the effect of model risk on the harvesting decisions of the myopic policy by investigating how the so-called harvest boundary, which is given by

{(p, i) : \tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i) = 0},

is influenced by the parameters of the posterior predictive distributions. Recall that each posterior predictive distribution was t-distributed. To be able to generate some further insights in the remainder of this section, we use the normal approximations of these posterior predictive distributions (we later numerically investigate the effect of normal approximation for smaller than 30 data points, and find that the impact of this approximation on the performance of the myopic policy is negligible). When the posterior predictive distributions are normal, since the true growth rates also follow normal distribution, we can focus on the difference between the posterior predictive distribution parameters and their true counterparts to generate insights on the affect of this difference on the shape of the harvest boundary. We present Figure 2 as an illustration.

Figure 2 plots the harvest boundary for various levels of model risk. To be specific, Figure 2 introduces a scaling factor $k$ that links each parameter to its true counterpart, and plots the harvest boundaries for some relevant values of $k$ . For example, $k = 1$ represents the case where each predictive-distribution parameter reduces to its true counterpart, for example, $α_{t}^{(p)}$ , the mean of the predictive distribution of the protein growth rate, becomes equal to $μ_{c}^{(p)}$ , the true mean of the protein growth rate. Therefore, the resulting harvest boundary becomes equivalent to the harvest boundary of the myopic policy under perfect information. Recall from Proposition 4 that the myopic policy is optimal under perfect information when certain conditions hold. For the illustrative example in Figure 2, we know $\underline{p}$ from (20) is equal to $17.34$ , and condition (21) is satisfied. Thus, for protein levels > $17.34$ , Figure 2 already gives an insight about how close the harvest boundary of the myopic policy under model risk is to the harvest boundary of the optimal policy under perfect information.

Figure 2.

Illustration of how the harvest boundary is affected by the model risk (based on the case study parameters presented in Section 5.1).

Figure 2 (top, left) shows that as the predictive mean of protein $α_{t}^{(p)}$ increases, the harvest boundary moves up, indicating that it becomes less beneficial to harvest immediately when the potential future gain in protein amount is high. Also, the difference on the left part (i.e., small amount of protein $p_{t}$ ) is larger than on the right (i.e., large amount of protein $p_{t}$ ), since as protein approaches the harvesting limit, an increase in $α_{t}^{(p)}$ becomes less influential on the harvesting decision. Figure 2 (top, right) shows that as the predictive mean of impurity $α_{t}^{(i)}$ increases, the harvest boundary moves down. This is intuitive because a higher $α_{t}^{(i)}$ implies a higher probability of a fermentation failure, so a smaller amount of current impurity level is sufficient to trigger a harvest action. According to Figure 2 (bottom, left), as the parameter ${\tilde{σ}}_{t}^{(p)}$ increases, the decision boundary shifts to the left. That is, with higher uncertainty of protein growth, we tend to harvest with a smaller amount of protein especially when it is close to the maximum protein level. On the other hand, as the parameter ${\tilde{σ}}_{t}^{(i)}$ increases, Figure 2 (bottom, right) shows that the decision boundary moves down. A larger ${\tilde{σ}}_{t}^{(i)}$ implies higher uncertainty in the impurity growth rate, and we tend to harvest with a smaller amount of impurity to avoid a batch failure.

Proposition 5 formalizes our observations from Figure 2 by establishing the monotonicity of the function $\tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i)$ that characterizes the harvest boundary with respect to the parameters of the predictive growth-rate distributions.

Proposition 5

The function $\tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i)$ is (i)

increasing in $α^{(i)}$ ;

(ii)

decreasing in $α^{(p)}$ ;

(iii)

increasing in ${\tilde{σ}}^{(i)}$ , if $\ln \bar{I} - \ln i > α^{(i)}$ and

\begin{aligned} {\tilde{σ}}^{(i)} Φ (\frac{\ln \bar{I} - \ln i - α^{(i)} - {\tilde{σ}}^{(i) 2}}{{\tilde{σ}}^{(i)}}) \\ - ϕ (\frac{\ln \bar{I} - \ln i - α^{(i)} - {\tilde{σ}}^{(i) 2}}{{\tilde{σ}}^{(i)}}) > 0; \end{aligned}

(24)

(iv)

decreasing in ${\tilde{σ}}^{(p)}$ , if and only if

\begin{aligned} {\tilde{σ}}^{(p)} Φ (\frac{\ln \bar{P} - \ln p - α^{(p)} - {\tilde{σ}}^{(p) 2}}{{\tilde{σ}}^{(p)}}) \\ - ϕ (\frac{\ln \bar{P} - \ln p - α^{(p)} - {\tilde{σ}}^{(p) 2}}{{\tilde{σ}}^{(p)}}) > 0. \end{aligned}

(25)

Proposition 5 can be useful in understanding how the model risk would influence the harvest boundary of the myopic policy. For example, if the predictive mean $α^{(i)}$ of the impurity growth rate increases, since $\tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i)$ is increasing by Proposition 5(i), the harvest zone of the physical-state space, denoted by ${(p, i) : \tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i) \geq 0}$ , will enlarge, while the continue zone ${(p, i) : \tilde{h} (α^{(p)}, {\tilde{σ}}^{(p)}, α^{(i)}, {\tilde{σ}}^{(i)}; p, i) < 0}$ will shrink. In other words, the decision maker tends to be more willing to take the harvest decisions for larger $α^{(i)}$ . Similar insights can be derived for all other predictive parameters. Notice that the harvest boundary can behave differently as a function of the predictive standard deviations (i.e., ${\tilde{σ}}^{(i)}$ and ${\tilde{σ}}^{(p)}$ ) at different parts of the physical-state space, since the monotonicity properties with respect to ${\tilde{σ}}^{(i)}$ and ${\tilde{σ}}^{(p)}$ depend on whether the conditions (24) and (25) hold for particular values of $(p, i)$ in the physical-state space.

Proposition 6

As the size of the historical data $D_{t}$ approaches infinity (i.e., as $J_{t} \to \infty$ ), the myopic policy under model risk becomes equivalent to the myopic policy under perfect information.

Proposition 6 is useful in practice as it tells us that the myopic policy under model risk becomes similar to the myopic policy under perfect information, which we already know to be optimal under the conditions in Proposition 4. Intuitively, this represents the situation with a sufficient amount of historical data such that the underlying fermentation process is already learned accurately.

4.4. Reinforcement Learning Under Model Risk (RL with MR)

In this section, we introduce our solution approach to compute the optimal policy that minimizes the objective function in (14). Different from the myopic policy in Section 4.3, we now consider the effect of learning from future data on harvesting decisions (in a forward-looking manner). Recall that the harvest action at physical states $(p_{t}, i_{t})$ ends the fermentation with a deterministic (known) reward. Therefore, it is only needed to estimate the total reward associated with the continue action and following the optimal policy thereafter, denoted by the corresponding Q-function

\begin{aligned} Q_{t} (p_{t}, i_{t}, I_{t}; C) ≜ - c_{u} + γ E [max_{a_{t + 1} \in A} Q_{t + 1} (p_{t + 1}, i_{t + 1}, I_{t + 1}; a_{t + 1})], \end{aligned}

(26)

for

t = 0, 1, \dots, \bar{T} - 1

. On the other hand, the Q-function associated with the harvest action is denoted with

Q_{t} (p_{t}, i_{t}, I_{t}; H) ≜ {\begin{cases} r_{h} (p_{t}, i_{t}), & if i_{t} < \bar{I} \\ - r_{f}, & if i_{t} = \bar{I} \end{cases},

(27)

for

t = 0, 1, \dots, \bar{T}

. Recall that harvesting is the only feasible action for

i_{t} = \bar{I}

p_{t} = \bar{P}

, or

t = \bar{T}

, and the harvest action takes the state of the system to a cost-free absorbing state. For a fermentation process that has not yet been harvested at decision epoch

t

, the value function is given by

V_{t} (p_{t}, i_{t}, I_{t}) = max_{a_{t}} Q_{t} (p_{t}, i_{t}, I_{t}; a_{t})

. Theoretically, the optimal policy that maps any possible hyper state to an action can be obtained through backward dynamic programming (this is also referred to as offline planning).

However, computing this dynamic program is notoriously difficult and also not necessary in practice, given that the optimal policy is only needed starting from a specific physical state (which evolves by visiting certain states more likely than others) in the real-life execution of the fermentation process. Therefore, we adopt a solution approach that executes the policy in an online manner, which means that we focus on estimating the Q-function in (26) at a particular current state $(p_{t}, i_{t}, I_{t})$ , and decide to continue or harvest the fermentation process by comparing it with the harvesting reward in (27). After the selected action is executed, the next decision epoch starts with a new hyper-state at which the entire procedure is repeated. The details of the solution procedure is provided in Appendix EC.2.1.

5. Numerical Analysis

We present a case study motivated by the implementation at MSD Animal Health. To protect confidentiality, we disguised MSD’s original data and used representative values to generate insights.

5.1. Experiment Setting and Analysis Overview

The starting protein and impurity for each batch are $p_{0} = 1.5, i_{0} = 2.0$ g. The harvesting limit on protein is $\bar{P} = 30$ grams, the batch failure impurity threshold is $\bar{I} = 50$ g, and the maximum time for a batch to reach the stationary phase is $\bar{T} = 8$ h. To represent general practice, we considered the following cost and reward structures: harvest reward $r_{h} (p, i) = 10 p - i$ , the one-step operation cost $c_{u} = 2$ , and the batch failure penalty $r_{f} = 880$ with no discounting. These cost structures are identified based on input received from our industry partners (see Appendix EC.3.1 for a sensitivity analysis of costs). Based on historical production data, we consider the following protein and impurity growth parameters $μ_{c}^{(p)} = 0.488, σ_{c}^{(p)} = 0.144$ , $μ_{c}^{(i)} = 0.488, σ_{c}^{(i)} = 0.144$ as underlying truth. The realized values of growth rates ranged between $0.2$ and $0.6$ in our production data. The protein and impurity generation model in (1) was validated with real-world fermentation data.² Consistent with practice, the length of a time period in our discrete-time model is equal to six hours (i.e., eight decision epochs during a maximum of 48 h of fermentation). Appendix EC.3.2 presents an additional analysis related to the frequency of decision epochs.

5.1.1 Analysis Overview

We use the case study to generate insights for practitioners. For this purpose, we consider various practically relevant strategies as a benchmark and compare their performance:

(1)
Common practice (CP). This strategy represents a common rule of thumb used in the industry. It uses a “fixed threshold” approach, harvesting when protein and impurity levels exceed certain predetermined threshold values. In our case study, CP harvests when the impurity amount exceeds $60 %$ of the maximum amount permitted $\bar{I}$ or when the protein amount reaches the limit $\bar{P}$ or when the fermentation process transitions to the stationary phase.
(2)
Reinforcement learning with model risk (RL with MR). The decision-maker considers both the inherent stochasticity of fermentation processes and the model risk caused by limited historical data. RL with MR represent the optimal policy under our proposed Bayesian decision-making framework based on both model risk and inherent uncertainty (Section 4.4).
(3)
Perfect information MDP (PI-MDP). To establish a benchmark, we consider the setting where the decision-maker has perfect information on the underlying true model of fermentation dynamics (i.e., no model risk, but only inherent stochasticity). Theoretically, this setting represents the best possible performance that can be achieved with an infinite amount of historical data. We obtain the PI-MDP policy by using the same approach of RL with ML (Appendix EC.2.1) with one key difference: the true model parameters $μ_{c}^{(p)}$ , $μ_{c}^{(i)}$ , $σ_{c}^{(p)}$ and $σ_{c}^{(i)}$ are assumed known, and hence the state space only includes the physical states and there is no sampling from the information states (i.e., sampling of the growth rates is done from the true fermentation model).
(4)
Reinforcement learning ignoring model risk (RL ignoring MR). This strategy represents the case where the decision maker ignores the model risk by simply using the point estimates of the unknown true parameters $μ_{c}^{(p)}$ , $μ_{c}^{(i)}$ , $σ_{c}^{(p)}$ , and $σ_{c}^{(i)}$ obtained from limited historical data. That is, the point estimates are used as if they were the true parameters. We obtain the “RL ignoring MR” policy similar to how we solve the PI-MDP, but by replacing the unknown true parameters with their maximum likelihood estimates. We consider this policy because it represents a common approach to using historical data for model calibration, but ignoring the effects of estimation errors and statistical learning on decisions.
(5)
Myopic policy. The decision-maker considers both the inherent stochasticity and the model risk (as described in Section 4.3.2). However, the harvesting decision is made by only comparing the harvesting reward at the current decision epoch with the expected reward of harvesting in the next decision epoch. We propose the myopic policy as a relevant benchmark because it is easy to implement, especially for companies that do not have the infrastructure for Bayesian learning implementations.

Table 1.
The mean and standard deviation of the total reward achieved by different strategies.

${\hat{ρ}}^{c} (π)$ % of PI-MDP ${\hat{SD}}^{c} (π)$ % of PI-MDP

PI-MDP 177.23 100.0% 125.86 100.0%

CP 97.40 55.0% 318.19 252.8%

$J_{0} = 3$

RL ignoring MR 132.00 74.5% 221.68 176.1%

Myopic policy 151.22 85.3% 166.14 132.0%

RL with MR 159.16 89.8% 128.76 102.3%

$J_{0} = 10$

RL ignoring MR 143.47 81.0% 196.39 156.0%

Myopic policy 168.21 94.9% 166.31 132.1%

RL with MR 171.63 96.8% 126.68 100.7%

$J_{0} = 20$

RL ignoring MR 153.98 86.9% 165.79 131.7%

Myopic policy 175.34 98.9% 127.74 101.5%

RL with MR 176.30 99.5% 126.26 100.3%

PI-MDP = perfect information Markov decision processes; CP = common practice; RL = reinforcement learning; MR = model risk; SD = standard deviation.

5.2. Performance Comparison

	${\hat{ρ}}^{c} (π)$	% of PI-MDP	${\hat{SD}}^{c} (π)$	% of PI-MDP
PI-MDP	177.23	100.0%	125.86	100.0%
CP	97.40	55.0%	318.19	252.8%
$J_{0} = 3$
RL ignoring MR	132.00	74.5%	221.68	176.1%
Myopic policy	151.22	85.3%	166.14	132.0%
RL with MR	159.16	89.8%	128.76	102.3%
$J_{0} = 10$
RL ignoring MR	143.47	81.0%	196.39	156.0%
Myopic policy	168.21	94.9%	166.31	132.1%
RL with MR	171.63	96.8%	126.68	100.7%
$J_{0} = 20$
RL ignoring MR	153.98	86.9%	165.79	131.7%
Myopic policy	175.34	98.9%	127.74	101.5%
RL with MR	176.30	99.5%	126.26	100.3%

We now evaluate the performance of benchmark strategies described in Section 5.1. In particular, we focus on the expected total reward ${\hat{ρ}}^{c} (π)$ , and the standard deviation ${\hat{SD}}^{c} (π)$ under different sizes of historical data (i.e., $J_{0} = 3, 10, 20$ ) and harvest policy $π$ , calculated with 100 simulation replications as described in Appendix EC.2.1. Table 1 reports the performance of the considered strategies with the starting state $p_{0} = 1.5$ and $i_{0} = 2$ . Recall that harvesting policies $π$ under both PI-MDP and CP are independent of data size $J_{0}$ . The column labeled “% of PI-MDP” in Table 1 uses the perfect information setting (PI-MDP) as a benchmark to assess the performance of the considered strategies.

We observe from Table 1 that the strategy RL with MR provides substantial benefits (in terms of the performance metrics ${\hat{ρ}}^{c}$ and ${\hat{SD}}^{c}$ ) compared to all other strategies. In addition, we notice that the average reward of all strategies increases while variability decreases as the number of historical data increases. In this specific case study, we also observe that CP does not perform well compared to PI-MDP. For practitioners, these results emphasize the business value (and the potential impact) of accounting for the model risk in harvesting decisions. In addition, we observe that the strategy RL with MR results in a lower standard deviation ${\hat{SD}}^{c}$ compared to current practice CP, even when the amount of historical data is small $(J_{0} = 3)$ . This observation is interesting because the objective of the optimization model is to maximize the expected total reward (not to minimize variability). We obtained a similar result from the implementation at MSD (i.e., variability reduced after the implementation, as discussed in Section 6.1). For managers, the results reported in Table 1 underscore the importance of considering both model risk and inherent stochasticity of fermentation systems to achieve higher profits.

Figure 3 provides an analysis of the harvesting times under PI-MDP, CP, and RL with MR strategies to better understand the reason for the performance difference between these strategies. Specifically, we record the decision epoch at which the harvesting decision is realized in each of the 100 simulation replications (similar to obtaining the results in Table 1 as described in Appendix EC.2.2), and we plot the histogram of these realized harvesting decision epochs (recall that the time between two decision epochs is 6 h). As a managerial insight, Figure 3 indicates that the poor performance of the CP strategy in our case study is due to late harvesting decisions, as can be seen by comparing the harvesting decision epochs under CP with those under PI-MDP. This can be explained by the tendency of the CP policy to collect as much protein as possible by ignoring impurity-related costs and failure risk. However, our proposed “RL with MR” strategy takes into account the model risk, and the distribution of harvesting moments becomes similar to that of the PI-MDP, leading to the superior performance of “RL with MR” as observed in Table 1.

Figure 3.

The frequency of the number of decision epochs at which the harvesting decision is made for the case with $J_{0} = 3$ (out of 100 simulation replications).

5.3. Impact of Limited Historical Data on Harvesting Decisions

We illustrate the impact of model risk on harvesting decisions. For this purpose, our analysis considers two different sizes of historical data, $J_{0} = 3, 20$ . We use the strategy PI-MDP as a benchmark to represent the case with perfect information. Figure 4 represents the optimal harvesting policy under the strategy PI-MDP for the physical states $p_{t} \in [1.5, 30]$ and $i_{t} \in [2.0, 50]$ . Moreover, Figure 4 shows the fixed-threshold based CP, and the mean harvesting threshold under the strategy RL with MR with its corresponding $95 %$ confidence band.

Figure 4.

Optimal harvesting thresholds (above curve denotes harvest region) with $J_{0} \in {3, 20}$ under the strategies PI-MDP, RL with MR, and CP. PI-MDP = perfect information Markov decision processes; CP = common practice; RL = reinforcement learning; MR = model risk.

Figure 4 indicates how the optimal harvesting threshold moves as the number of historical data $J_{0}$ increases from $3$ to $20$ . As the number of historical data increases, we see that the $95 %$ confidence band of the harvest boundary shrinks and becomes closer to the one under the optimal harvest policy with perfect information. Finally, the structure of the RL with MR policy in Figure 4 verifies our analytical finding on the optimality of a threshold-type policy with respect to the impurity level, but shows the similar structure does not hold with respect to the protein level. Intuitively, if the protein level is too close to zero, the impurity/protein ratio is already too high and it becomes beneficial to stop the process by harvesting. On the other hand, if the protein level is too close to its upper limit, harvesting becomes more advantageous again, as continuing the fermentation would only add more impurities with little gain in protein.

5.4. Impact of Prior Information

In industrial applications, contextual information may be available either through expert judgments or historical data from similar processes, and this information can provide an informed starting point to build prior information on unknown process parameters. Our results so far assumed non-informative priors when there is no historical data (see Appendix EC.2.1). Considering the availability of historical data of size $J_{0}$ (collected from the fermentation process of interest), we built the prior distribution by updating the non-informative prior with this historical data of size $J_{0}$ as described in Section 3.2. Thus, the change of values in Table 1 with respect $J_{0}$ shows how the amount of correct prior information influences the total rewards. For example, we observe in Table 1 that the amount of prior information obtained from $J_{0} = 3$ data points leads to $89.8 %$ of the true optimal reward for the “RL with MR” policy, while this value reaches $99.5 %$ when the prior information is obtained from $J_{0} = 20$ data points from the fermentation process. Therefore, the analysis in $J_{0}$ can also help practitioners understand how much data is needed to achieve high performance.

The prior information may not always be an accurate representation of reality, e.g., two fermentation processes that are thought to be similar (e.g., using similar seed cultures) may eventually behave very differently. In Table 2, we investigate how the total rewards and harvesting times would change if the prior information were obtained from historical data generated from another fermentation process with mean protein and impurity growth rates equal to $k μ_{c}^{(p)}$ and $k μ_{c}^{(i)}$ , respectively, for $k \in {0.25, 0.5, 1, 2, 4}$ . Here, $k$ represents a deviation factor of the historical data from the true fermentation process (notice that $k = 1$ means that the historical data has been collected from the true fermentation process and it leads to the results in Table 1).

Table 2.
The mean and standard deviation of the total rewards and harvesting decision epochs under the RL with MR policy starting with different prior information obtained from $J_{0} = 3$ data points.

$k$ ${\hat{ρ}}^{c} (π)$ ${\hat{SD}}^{c} (π)$ Avg. $T$ Std. dev. $T$

0.25 122.60 221.07 5.45 1.05

0.5 147.57 166.15 5.45 0.77

1 159.16 128.76 5.42 0.82

2 127.22 69.10 4.69 0.90

4 34.86 23.42 1.98 0.98

$k$	${\hat{ρ}}^{c} (π)$	${\hat{SD}}^{c} (π)$	Avg. $T$	Std. dev. $T$
0.25	122.60	221.07	5.45	1.05
0.5	147.57	166.15	5.45	0.77
1	159.16	128.76	5.42	0.82
2	127.22	69.10	4.69	0.90
4	34.86	23.42	1.98	0.98

RL = reinforcement learning; MR = model risk.

Table 2 allows us to understand how the quality of the prior information affects the total rewards and the harvesting times. For example, building the prior by using a data set that comes from a process that has half of the mean growth rates of the current process (i.e., $k = 0.5$ ) leads to a smaller reduction in rewards than building the prior by using a data set from a process with twice the mean growth rates (i.e., $k = 2$ ). A similar observation can be made for $k = 0.25$ and $k = 0.4$ to compare how the prior information affects the total rewards. As a managerial insight, we observe that overestimating the mean growth rates seems to be worse than underestimating them for the problem instances considered in Table 2. This is because overestimating the mean growth rates leads to premature harvesting decisions in our problem configuration, as shown in Table 2.

6. Implementation at MSD

We quantify the real-world impact obtained at MSD’s daily operations in Section 6.1 and elaborate on the implementation process in Section 6.2.

6.1. Impact

The optimization framework has been implemented at MSD since 2019. The project focused on various products manufactured in Boxmeer and had a significant impact on business metrics. As an example, Figure 5 shows the implementation results for a specific product. The x-axis represents time and the y-axis represents batch yield (the value of y-axis starts from 0, but its scale is not shown for confidentiality reasons). In this figure, “batch yield” represents $p - i$ , which is the total amount of protein minus the impurities present in the batch at the time of harvest. The black dots in Figure 5 indicate the batch yield before implementation, while the red dots correspond to the batch yield after implementation. The figure shows that the average batch yield for this product increased by $\sim$ 50%, while the batch-to-batch variability decreased significantly after implementation.

Figure 5.

Performance of different batches of the same product produced over time: black dots represent the performance before implementation. Red dots represent the performance after implementation.

Prior to implementation, decisions were made based on domain knowledge and common industry guidelines (e.g., the so-called “common practice” strategy based on fixed thresholds and rules of thumb, as described in Section 5). However, these prior approaches did not systematically exploit any optimization framework. After implementing the OR-based framework, the company was able to make better use of the historical data and dynamically optimize operational decisions.

Recall that Figure 5 illustrates the results obtained for one particular product. We focused on this product due to the availability of a large number of historical data, allowing us to see the benefit gained from the proposed optimization framework when the inherent fermentation uncertainty is more important relative to the model risk. We now present insights based on all products within the scope of the implementation during 2019–2021 (including products with limited data). In this setting, our objective was to quantify the impact of the learning-by-doing framework. For this purpose, we first collected information on the “expected” batch yield (ex-ante) and the “actual” batch yield (ex-post) obtained for each batch produced in 2019–2021. The expected batch yield represents our predicted value of the batch yield under a certain harvesting policy used for that batch; whereas the actual batch yield denotes the realized batch yield at the time of harvest. Then, we calculated the absolute percentage difference (APD) between expected and actual values for each batch (where the denominator captures the expected yield). We defined the measure APD to understand how our prediction capability changed over time as a result of the learning-by-doing framework. For ease of exposition, Figure 6 plots the mean APD values on a monthly basis (i.e., the average of APD values across all batches produced in a certain month). In this figure, the implementation started around January 2019. We see a clear downward trend in Figure 6, indicating that our predictive capability continuously improved after implementing the data-driven decision framework. We quantified the impact in terms of batch yield (rather than financial figures) for confidentiality reasons.

Figure 6.

Impact of the learning-by-doing framework across all products considered in the implementation process.

6.2. Overview of the Implementation Process

6.2.1 The Team and Project Timeline

The research project has been conducted in close collaboration with a university team and a team of practitioners from MSD in the Netherlands. The university team brought in expertise on operations research, whereas the MSD team provided expertise in life sciences. A multi-disciplinary team from MSD (e.g., bioreactor operators, chemical and biological engineers, and middle/upper management) contributed to the project.

The project went through three major phases: model development, validation, and implementation. The project started in early 2018 with data collection and the development of the optimization framework. Prior to the implementation, the optimization model and the corresponding policies were validated based on (i) discussions with practitioners, and (ii) small-scale test runs. The control-limit structure of optimal policies facilitated these discussions, as these policies were “explainable” and their intuition aligned with the current practice. Following these discussions, small-scale test runs were performed. The results obtained from both computational experiments and real-world test runs established a foundation for industry-scale implementation.

6.2.2 Scope and Aspects Related to Data

The implementation focused on industry-scale production orders with limited historical process data. As a common characteristic, these products were typically high-mix, low-volume batches produced only a few times a year. Moreover, we encountered challenges with limited data when new equipment or raw materials were used. Available data typically involved around $10$ batches. However, some products had as little as one or two data sets. In some cases, we had no data to build the prior distribution because of a change in raw materials or equipment. All batches exhibited inherent stochasticity and model risk.

6.2.3 Implementation Challenges

The major challenge in this project was related to data collection. In some cases, data was not available in a digital format. We also encountered a few special cases with no historical data. One of the strengths of the project is its multi-disciplinary approach which combines life sciences and operations research. However, multi-disciplinary projects have their own challenges. For example, the concept of Markov decision processes may be difficult for scientists who have no background in operations research. Similarly, developing a thorough understanding of the fermentation processes was challenging for the university team (with no background in biological and chemical engineering). Therefore, both the university and MSD teams had regular meetings to learn from each other and co-design the model.

It is also important to facilitate the knowledge transfer to other facilities. When the scope of the implementation expands to other facilities in the future, it can be challenging to identify the right products (and facilities) that would obtain the highest benefit from the optimization framework. For this purpose, MSD developed a dashboard that collects information from all batches produced globally. This dashboard reports the APD values of selected batches (as illustrated in Figure 6) from their global network, thereby identifying opportunities for future implementations.

7. Conclusions

Limitations in historical process data (model risk) are often perceived as a common industry challenge in biomanufacturing. Yet the implications of model risk on optimal costs and harvesting decisions have not been fully understood. Our work provides one of the first attempts at modeling and optimization of fermentation systems under model risk (caused by limited historical data) and inherent stochasticity (caused by the uncertain nature of biological systems).

We developed an MDP model to guide fermentation harvesting decisions under a learning-by-doing framework. In particular, we used a Bayesian approach where the decision-maker sequentially collects real-world data on fermentation dynamics and updates the beliefs on state transitions. As a salient feature, the MDP model combines the knowledge from life sciences and operations research, and is equipped to capture the complex dynamics of fermentation processes under limited data. We studied the analytical properties of the optimal policy and the myopic policy, and characterized the impact of model risk on biomanufacturing harvesting decisions. To illustrate the use of the optimization model, we present a case study from MSD Animal Health. The implementation at MSD has shown that linking operations research with life sciences drives substantial productivity improvements. We hope that our results will inspire the global biomanufacturing industry and stimulate new research at the intersection of operations research and life sciences.

The developed optimization framework is generic and addresses common industry challenges. Therefore, it can be easily implemented at other production lines and facilities. The long-term vision is to encourage the worldwide use of such optimization models at other facilities. Moreover, it will be interesting to explore the potential applications in other industries. For example, the food and agriculture industries face similar challenges and can benefit from the developed optimization model. In addition, future research could explore optimal sampling decisions based on the costs and marginal benefits of additional data collection efforts. Another research direction could extend our framework to continuous-time models to support real-time fermentation decisions.

Supplemental Material

sj-pdf-1-pao-10.1177_10591478241270130 - Supplemental material for Biomanufacturing Harvest Optimization With Small Data

Supplemental material, sj-pdf-1-pao-10.1177_10591478241270130 for Biomanufacturing Harvest Optimization With Small Data by Bo Wang, Wei Xie, Tugce Martagan, Alp Akcay and Bram van Ravenstein in Production and Operations Management

Footnotes

Acknowledgments

This research was funded by the Dutch Science Foundation (NWO-VENI Scheme) and the National Institute of Standards and Technology (Grant nos. 70NANB17H002 and 70NANB21H086). We would like to thank the Master’s students Thijs Diessen and Len Hermsen for their assistance during the project.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iDs

Wei Xie

Tugce Martagan

Alp Akcay

Supplemental Material

Supplemental material for this article is available online ().

Notes

How to cite this article

Wang B, Xie W, Martagan T, Akcay A and Ravenstein Bv (2024) Biomanufacturing Harvest Optimization With Small Data. Production and Operations Management 33(12): 2381–2400.

References

Asmuth

Littman

Nouri

Wingate

(2012) A Bayesian sampling approach to exploration in reinforcement learning. arXiv preprint arXiv:1205.2664.

Asmuth

Littman

(2011) Approaching Bayes-optimalilty using Monte-Carlo tree search. Proceedings of the 21st International Conference on Automated Plannning and Scheduling.

Azoury

Miyaoka

(2013) Managing production and distribution for supply chains in the processed food industry. Production and Operations Management 22(5): 1250–1268.

Bansal

Coles

Natrajan

(2024) Redesigning harvesting processes and improving working conditions in agribusiness. Working paper.

Bansal

Nagarajan

(2017) Product portfolio management with production flexibility in agribusiness. Operations Research 65(4): 914–930.

Blackburn

Scudder

(2009) Supply chain strategies for perishable products: The case of fresh produce. Production and Operations Management 18(2): 129–137.

Chang

Liu

Henson

(2016) Nonlinear model predictive control of fed-batch fermentations using dynamic flux balance models. Journal of Process Control 42: 137–149.

Doran

(2013) Bioprocess Engineering Principles. Amsterdam: Elsevier.

Ferguson

(2000) Optimal stopping and applications. https://www.math.ucla.edu/tom/Stopping/Contents.html (accessed 10 April 2022).

10.

Fonteneau

Buşoniu

Munos

(2013) Optimistic planning for belief-augmented Markov decision processes. 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning 77–84.

11.

Gelman

Carlin

Stern

Rubin

(2004) Bayesian Data Analysis. 2nd ed. New York: Taylor and Francis Group, LLC.

12.

Ghavamzadeh

Mannor

Pineau

Tamar

(2016) Bayesian reinforcement learning: A survey. arXiv preprint arXiv:1609.04436.

13.

Jahandideh

Rajaram

McCardle

(2020) Production campaign planning under learning and decay. Manufacturing & Service Operations Management 22(3): 615–632.

14.

Koca

Martagan

Adan

Maillart

van Ravenstein

(2023) Increasing biomanufacturing yield with bleed–feed: Optimal policies and insights. Manufacturing & Service Operations Management 25(1): 108–125.

15.

Kolter

(2009) Near-Bayesian exploration in polynomial time. Proceedings of the 26th Annual International Conference on Machine Learning 513–520.

16.

Lowe

Preckel

(2004) Decision technologies for agribusiness problems: A brief review of selected literature and a call for research. Manufacturing & Service Operations Management 6(3): 201–208.

17.

Martagan

Adan

Baaijens

Dirckx

Repping

van Ravenstein

Yegneswaran

(2023) Merck animal health uses operations research methods to transform biomanufacturing productivity for lifesaving medicines. Franz Edelman Special Issue of INFORMS Journal on Applied Analytics 53(1): 85–95.

18.

Martagan

Krishnamurthy

Leland

Maravelias

(2016a) Optimal purification decisions for engineer-to-order proteins at aldevron. Production and Operations Management 25(12): 2003–2005.

19.

Martagan

Krishnamurthy

Maravelias

(2016b) Optimal condition-based harvesting policies for biomanufacturing operations with failure risks. IIE Transactions 48(5): 440–461.

20.

Martagan

Krishnamurthy

Leland

(2020) Managing trade-offs in protein manufacturing: How much to waste? Manufacturing & Service Operations Management 22(2): 223–428.

21.

McNeil

Harvey

(2008) Practical Fermentation Technology. New York: John Wiley & Sons.

22.

Mockus

Peterson

Lainez

Reklaitis

(2015) Batch-to-batch variation: A key component for modeling chemical manufacturing processes. Organic Process Research & Development 19(8): 908–914.

23.

Möller

Rodríguez

Müller

Arndt

Kuchemüller

Frahm

Eibl

Pörtner

(2020) Model uncertainty-based evaluation of process strategies during scale-up of biopharmaceutical processes. Computers & Chemical Engineering 134: 106693.

24.

Murphy

(2007) Conjugate Bayesian analysis of the Gaussian distribution. Tech. rep., University of British Columbia.

25.

Nikita

Tiwari

Sonawat

Kodamana

Rathore

(2021) Reinforcement learning based optimization of process chromatography for continuous processing of biopharmaceuticals. Chemical Engineering Science 230: 116171.

26.

Odenwelder

Harcum

(2021) Induced pluripotent stem cells can utilize lactate as a metabolic substrate to support proliferation. Biotechnology Progress 37(2): e3090.

27.

Osband

Russo

Van Roy

(2013) (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems 3003–3011.

28.

Peroni

Kaisare

Lee

(2005) Optimal control of fed-batch bioreactor using simulation based approximate dynamic programming. IEEE Transactions on Control Systems Technology 13: 786–790.

29.

Plante

Moskowitz

Tang

Duffy

(1999) Improving quality via matching: A case study integrating supplier and manufacturer quality performance. Manufacturing & Service Operations Management 1(1): 36–49.

30.

Poupart

Vlassis

Hoey

Regan

(2006) An analytic solution to discrete Bayesian reinforcement learning. Proceedings of the 23rd International Conference on Machine Learning 697–704.

31.

Powell

Ryzhov

(2012) Optimal Learning. vol. 841. Hoboken, NJ: John Wiley & Sons.

32.

Putra

Abasaeed

(2018) A more generalized kinetic model for binary substrates fermentations. Process Biochemistry 75: 31–38.

33.

Rajaram

Karmarkar

(2004) Campaign planning and scheduling for multiproduct batch operations with applications to the food-processing industry. Manufacturing & Service Operations Management 6(3): 253–269.

34.

Ross

Chaib-draa

Pineau

(2008) Bayesian reinforcement learning in continuous pomdps with application to robot navigation. 2008 IEEE International Conference on Robotics and Automation 2845–2851.

35.

Sahling

Hahn

(2019) Dynamic lot sizing in biopharmaceutical manufacturing. International Journal of Production Economics 207: 96–106.

36.

Subramanian

Lévesque

Van De Vrande

(2020) “Pulling the plug”: Time allocation between drug discovery and development projects. Production and Operations Management 29(12): 2851–2876.

37.

Templeton

Dean

Reddy

Young

(2013) Peak antibody production is associated with increased oxidative metabolism in an industrially relevant fed-batch cho cell culture. Biotechnology and Bioengineering 110(7): 2013–2024.

38.

Treloar

Fedorec

AJH

Ingalls

Barnes

(2020) Deep reinforcement learning for the control of microbial co-cultures in bioreactors. PLOS Computational Biology 16(4): e1007783.

39.

Tsao

Y-S

Cardoso

Condon

RGG

Voloch

Lio

Lagos

Kearns

Liu

(2005) Monitoring chinese hamster ovary cell culture by the analysis of glucose and lactate metabolism. Journal of Biotechnology 118(3): 316–327.

40.

Wechselberger

Sagmeister

Herwig

(2013) Model-based analysis on the extractability of information from data in dynamic fed-batch experiments. Biotechnology Progress 29(1): 285–296.

41.

Xie

Wang

Xie

Auclair

(2022) Interpretable biomanufacturing process risk and sensitivity analyses for quality-by-design and stability control. Naval Research Logistics (NRL) 69(3): 461–483.

42.

Xing

Bishop

Leister

(2010) Modeling kinetics of a large-scale fed-batch CHO cell culture by Markov chain Monte Carlo method. Biotechnology Progress 26(1): 208–219.

43.

Mani

Zhao

(2023) “Not a box of nuts and bolts”: Distribution channels for specialty drugs. Production and Operations Management 32(7): 2283–2303.

44.

Zhao

(2023) Pharmaceutical supply chains and drug shortages. Tutorials in Operations Research: Advancing the Frontiers of OR/MS: From Methodologies to Applications INFORMS: 228–245.

45.

Zheng

Xie

Feng

(2020) Green simulation assisted reinforcement learning with model risk for biomanufacturing learning and control. Proceedings of the 2020 Winter Simulation Conference 337–348.

46.

Zheng

Xie

Ryzhov

Xie

(2023) Policy optimization in dynamic Bayesian network hybrid models of biomanufacturing processes. INFORMS Journal on Computing 35(1): 66–82.

47.

Zhu

Ninh

Zhao

Liu

(2021) Demand forecasting with supply-chain information and machine learning: Evidence in the pharmaceutical industry. Production and Operations Management 30(9): 3231–3252.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.41 MB