Sage Journals: Discover world-class research

Abstract

Stocks of some food products, such as whiskey, cheese, or port wine, ameliorate during storage, facilitating product differentiation according to age. This induces a trade-off between immediate revenues and further maturation. Inventory management decisions include purchasing volumes of agricultural produce and production volumes for age-differentiated products. Because products can be blended from stocks of different ages, issuance decisions offer operational flexibility. However, whereas some industries (port wine, sherry) only request that the product labels refer to the average age of issued stocks, others (whiskey, rum) have stricter blending regulations, requiring that the product labels represent the minimum age of all components. Further, producers must deal with multiple uncertainties. Purchase prices of agricultural commodities depend on volatile climate-dependent harvest seasons, stocks decay during maturation, and sales market conditions fluctuate. We solve this inventory management problem using a deep reinforcement learning algorithm with three key innovations: (i) A novel actor pipeline that decomposes the action space and flexibly partitions decision dimensions between a neural network and a lookahead optimization model, (ii) an algorithm explicitly maximizing average rewards, and (iii) reward-handling techniques that exploit structural problem insights. Our approach yields near-optimal policies that consistently outperform benchmark heuristics. Beyond the algorithmic contributions, our results offer new managerial insights into the value of blending under uncertainty. Minimum-age blending substantially enhances the profits of firms as compared to no blending because companies can adjust their purchasing policy in response to price fluctuations. The more flexible average-age regime further improves profits by $8.7 %$ on average, suggesting that whiskey and rum regulators may wish to reconsider their strict blending rules. We mine black-box policies from deep reinforcement learning using supervised machine learning and Shapley values to analyze near-optimal decision drivers. Exploiting the value of blending requires producers to install sufficient processing capacity, especially when dealing with large variations in harvest seasons. Additionally, blending entails increased planning complexity because the inventory management decisions are driven by a large number of factors.

Keywords

Ameliorating Inventory Blending Deep Reinforcement Learning Actor Pipeline Policy Mining

1. Introduction

Inventory systems of ameliorating food such as spirits (whiskey, rum, grappa), fortified wine, ripened cheese, and ham (Dimson et al., 2015) consist of stock volumes in different age classes, that is, different accumulated maturation times. This work considers ameliorating products characterized and marketed by their maturation time, such as three-year-old or seven-year-old rum. Unlike fine wines (e.g., Hekimoğlu et al., 2017), the value of such products is independent of the vintage and primarily driven by the aging process, with older products yielding larger sales revenues.

The management of such inventory systems needs to align purchasing, production, and issuance decisions. Purchasing decisions determine the acquired volume of agricultural raw materials, which undergo an initial processing step, transforming them into ameliorating stocks that start the maturation process. Production decisions specify the volumes of various age-differentiated products originating from the same inventory system, which are packaged and placed in the sales market. The age indication, that is, the target age, is typically printed on the packaging and serves as the primary characteristic feature of the product. In concurrence with production decisions, issuance decisions determine the allocation of stock volumes from different age classes to different products. Thereby, producers have flexibility by blending stocks from different age classes. An example from port wine inventory management illustrates the interplay of production and issuance decisions. In a given year, managers want to produce $20, 000$ liters of 10-year-labeled Tawny. To do so, they blend $5, 000$ liters of 5-year-old stocks, $8, 000$ liters of 9-year-old stocks, and $7, 000$ liters of 15-year-old stocks. This is possible because for fortified wine products such as port or sherry, the target age refers to the required average maturation time of issued stocks (Palstra and Meijer, 2024), which is $10.1$ years in the provided example. We denote this regime as average-age blending. Whiskeys and rums typically have stricter blending regulations, as the age printed on the bottle refers to the maturation time of the youngest stocks in the blend (European Commission, 2013). We denote this regime as minimum-age blending. In this setting, blending very old stocks with young stocks becomes uneconomical, raising debates about switching to average-age blending in the pertinent community (Pietrek, 2023). Also, products that are not blended can entail flexibility in issuance decisions. Cheese products frequently have age indications that resonate with the minimum-age blending regime. For instance, young French Comté is typically labeled as “matured for at least eight months.” While blending facilitates a more flexible control of the inventory system, the involved complexity prohibits an analytical characterization of the optimal inventory policy.

This contrasts with perishables, for which issuing older stocks first (FIFO) is optimal in many problems (e.g., Chen et al., 2021). In ameliorating food inventory systems, however, simple issuance rules fail because they do not account for blending options, and producers must serve multiple age-differentiated products, whose value increases with age, from a shared inventory system. This creates specific trade-offs in production and issuance decisions. Stocks of young ages can either be used for production or be matured and used later for older products. By offering young products to the market, we obtain immediate revenues, but we may lose larger profits in later periods. Thus, a key management objective is to maintain a balanced structure of inventory volumes in different age classes. With more consumers turning towards luxury foods, such aging products gain importance (IWSR, 2024). In the following, we characterize ameliorating food inventory systems in more detail.

Food inventory managers face various sources of uncertainty. Globally, crop yields and quality are largely influenced by climate variability (Ray et al., 2015). As producers can only use stocks that meet the specific quality requirements for amelioration, this harvest variability is reflected in volatile purchase prices. For Protected Designation of Origin products, geographical restrictions on sourcing natural resources amplify the price variability. Examples are port wines, exclusively made from grapes from the Douro Valley in Portugal, and Parmigiano Reggiano cheese, produced and matured solely in selected Italian regions. In their purchasing decisions, managers must deal with regional price fluctuations. Directly after acquiring the agricultural commodities, the processing step transforms them into storable (not yet matured) products. Typically, producers have capacity limitations at this stage. For instance, whiskey (made from barley and water) is distilled using specific equipment before it is filled into wooden (preferably former port wine) casks for maturation. Further, the maturation process of stocks is subject to the risk of decay. When spirits decay during maturation, they are sold under a white label, generating substantially lower revenue than those used for brand products. In contrast, cheese is more stable but typically cannot be sold in case of decay. Apart from decay, the stored volume is also reduced due to evaporation. For spirits, the fixed proportion that evaporates each period is typically called the “angel’s share.” After turning stocks into final products, no further maturation is possible. Therefore, final products are placed in the market, generating stochastic revenues. However, because agricultural commodities need to be sourced locally, whereas final products are marketed worldwide, purchase prices fluctuate substantially more strongly than the revenues for final products.

To summarize, the decision problem is defined by purchasing decisions for the youngest age class and integrated production and issuance decisions for various age-differentiated products. To align these sequential decisions with the dynamics of aging, blending, and the involved uncertainties, we introduce a generic Markov decision processes (MDPs) formulation for the ameliorating food inventory management problem. Due to the long maturation times and the cross-generational perspective of predominantly family-owned producers, future profitability is equally important as short-term financial success. Hence, we use the average reward criterion in the value function. Due to the consideration of inventory age across the state, action, and transition spaces, the curse of dimensionality renders finding optimal solutions impracticable. Recently, several researchers have proposed deep reinforcement learning (DRL) algorithms for inventory management. Using an actor–critic architecture, these typically approximate both the value function and the policy through neural networks (NNs). However, the state-of-the-art algorithms such as proximal policy optimization (PPO) all rely on discounting future rewards for convergence, which conflicts with our objective of maximizing average rewards. Further, finding good policies when dealing with large, multidimensional action spaces remains difficult for DRL approaches. Additionally, delays between actions and the associated rewards may impede the learning progress (Dulac-Arnold et al., 2021). These challenges are immanent in ameliorating inventory management. Due to flexibility in issuance decisions, the number of action dimensions increases with the number of age classes in the inventory system. Further, due to the time required for maturation, revenues associated with purchased volumes are delayed at least by the products’ target ages.

Our methodological contributions are twofold. First, we develop an actor–critic DRL approach with a new type of actor pipeline that integrates the actor NN with a lookahead optimization model. Action dimensions can be assigned to either actor pipeline component. This entails a trade-off in its design. On the one hand, coordinating too many action dimensions in the NN may impair policy learning. On the other hand, allocating action dimensions to the lookahead model introduces approximation errors. We achieve the best performance when the NN handles only a few key decisions with long-term impact, that is, purchasing volumes and production volumes for younger products, while the remaining production volumes for older products and complex issuance decisions are delegated to the lookahead model. Second, we transfer and adapt the average policy optimization (APO) algorithm, developed for average reward optimization in computer game environments, to the inventory management domain. To this end, we introduce the following reward-handling techniques. (i) We develop a new reward-shaping method that exploits domain knowledge about the inventory problem. (ii) We scale the rewards observed during training by an upper bound on the optimal average reward derived from an analysis of the average inventory age structure. Across a range of ameliorating inventory problems, our DRL approach consistently and significantly outperforms benchmark heuristics. Compared to a solution inspired by industry practice, our algorithm improves profits by $27.7 %$ , on average. Using a case study from port wine management, we further demonstrate that our approach effectively solves large problem instances.

In addition, we provide managerial insights on ameliorating food inventory systems. We quantify the value of blending under different regulatory regimes. Compared with a setting that only allows issuance from the target ages, the profit increases by $18.1 %$ when there are no restrictions on the age classes used in the blend (average-age regime). Interestingly, allowing the use of merely one age class below and above the target age already yields an average improvement of $11.3 %$ . Under the minimum-age regime, the value of blending is significantly smaller, achieving an average profit increase of $8.6 %$ compared to exclusive issuance from target ages. For both blending regimes, there is a strong positive interaction effect between the purchase price uncertainty and the processing capacity. Hence, with increasing climate risks, producers should invest in additional processing capacities, allowing them to exploit the value of blending. Further, we identify the drivers of near-optimal purchasing, production, and issuance decisions. To this end, we first use decision trees to mine the DRL policy. We then derive local explanations, that is, the decision factors in individual states, using Shapley values. The purchase price is the only factor behind purchasing actions if blending is prohibited. With increasing blending flexibility, purchasing decisions are more sensitive to inventory levels. However, the absolute impact of prices on these decisions even rises. This demonstrates that blending facilitates a stronger policy adaptation to price fluctuations. Moreover, while merely the young inventory impacts purchasing volumes under the minimum-age blending regime, purchasing decisions under the average-age regime depend on inventory levels in a broader range of age classes. These findings imply that blending complicates the inventory problem, underscoring the need for algorithmic decision support.

The remainder of this paper is structured as follows. We discuss relevant literature in Section 2. Section 3 introduces the MDPs formulation for ameliorating food inventory management. Additionally, we provide a linear program (LP) yielding an upper bound on the average reward per period. We present our solution approach, including the novel actor pipeline, the APO algorithm, and the new reward-handling techniques in Section 4. Section 5 presents a port wine industry case. In Section 6, we evaluate our methodology, analyze the value of blending, and demonstrate our policy mining approach. Section 7 provides concluding remarks.

2. Literature Review

We structure the literature review as follows. Section 2.1 investigates research addressing the value of blending. Section 2.2 discusses extant research on multiage inventory systems, focusing on age-differentiated and ameliorating products. Section 2.3 reviews applications of DRL in inventory management. We summarize the identified research gap in Section 2.4.

2.1. Blending

Blending operations are predominantly used in the process industries, with various applications including crude oil refining (Mendez et al., 2006; Papageorgiou et al., 2012), chemical product formulation (Karmarkar and Rajaram, 2001), mining (Chen and Maravelias, 2022), and donor milk pooling (Chan et al., 2023). In these problems, companies blend raw materials with variable component mixes during production while blending constraints restrict the proportion of input components in the final products.

Although a large body of literature considers blending, only a few studies have quantified the value of the associated flexibility. Dong et al. (2014) address a crude oil refining problem where producers can convert heavy to light components before the blending stage, where final products are blended from these components according to fixed recipes. Their findings indicate that the added value of conversion flexibility is influenced by the variability of purchase prices and the processing capacity of producers. In a generalized capacity investment and production problem involving multiple raw materials and final products, Kulkarni and Francas (2017) examine the value of flexibility in designing blending recipes. They compare fully flexible blending networks with blending chains, illustrating that the relative advantage of full flexibility over chaining increases with blending costs.

2.2. Inventory Management for Age-Differentiated Products

Extant research on inventory problems involving age-differentiated products largely stems from the blood inventory management literature, as some medical treatments require fresher blood than others. Goh et al. (1993) evaluate different issuance rules under stochastic supply and age-differentiated demand. Deniz et al. (2010) jointly address replenishment and issuance rules. They analytically show that when substitution between the products is possible, different parameter settings favor different rules. Finally, Chen et al. (2019) consider the joint optimization of blood platelet collection, fulfillment, and issuance. They analytically show that FIFO issuance is optimal in their problem setting. In a successor paper, the authors generalize their results to perishable inventory models with age-differentiated demands and additionally show that younger products with higher shortage costs must be given priority in fulfillment (Chen et al., 2021). In this previous research, action spaces are kept small by focusing on individual decisions or are designed to represent straightforward decision rules.

Whereas the aforementioned works consider perishables, we have recently observed an increasing research interest in the effects of inventory amelioration. Early contributions are from the field of forest management. Lin and Buongiorno (1998) target the optimal stopping decision to determine the cutting time for trees under environmental and financial risks. Recently, Kouvelis et al. (2023) investigate inventory management at hog farms. Instead of putting underweight hogs on the market, farmers can feed them further to achieve higher revenues in later periods. For the inventory management of ameliorating food such as cheese and whiskey, Buisman and Rohmer (2022) propose a rolling-horizon optimization approach for dealing with demand uncertainty. Jahandideh et al. (2023) analyze the optimal allocation of a fixed production capacity to age-differentiated ameliorating products. They show that a stationary policy is optimal in their stylized problem setting. For ameliorating port wine inventory, Pahr et al. (2025) derive management policies for intractable practical problems by mining and scaling the optimal policies for aggregated problems. They rely on a state space discretization to solve the aggregated problems using value iteration.

2.3. DRL for Inventory Management with Multidimensional Action Spaces

Recently, many researchers have approached inventory problems using DRL algorithms. Boute et al. (2022) provide an early review and a comprehensive overview of different algorithms and the design choices relevant to different inventory management problems. Foundational works apply state-of-the-art algorithms to standard inventory problems to illustrate the general applicability of DRL. Oroojlooyjadid et al. (2022) provide a multiagent deep Q networks (DQNs) approach for the well-known beer game. Gijsbrechts et al. (2022) investigate DRL as a general-purpose method and report promising results for well-researched lost-sales, dual-sourcing, and multiechelon inventory problems using the A3C algorithm.

However, the performance of DRL approaches deteriorates when dealing with multidimensional action spaces (Dulac-Arnold et al., 2021). Hence, our review focuses on inventory research that explicitly tackles this challenge. Park et al. (2023) propose an intuitive approach that reduces the dimensionality of the action space. They exploit the near-optimality of a base-stock policy structure for multi-item inventory systems and learn the base-stock parameters instead of the state-dependent order quantities. Naturally, this kind of approach only works for problems with simple near-optimal policy structures. Kouvelis et al. (2024) exclude state-dependent infeasible actions by adding a hidden constraint layer to the NN determining the policy. Other researchers use well-performing heuristics to guide their DRL algorithm. For a perishable inventory problem, De Moor et al. (2022) develop a reward-shaping approach for their DQN algorithm that penalizes deviations from a heuristic policy. Liu et al. (2025) pre-train the policy network to mimic the actions suggested by an established heuristic before initiating the actual PPO algorithm for multiechelon, multiproduct inventory management. A few recent works enhance DRL algorithms for inventory management with optimization techniques. Harsha et al. (2025) develop a novel algorithm that estimates state values using an NN and uses sample average approximation and mixed-integer linear programing (MILP) to optimize the action selection based on the predicted state values. However, their approach does not scale well if the transition space is complex. An alternative approach is the implementation of an actor pipeline, which decomposes the action space dimensions. Only a subset of action dimensions is handled by an actor NN while the others are determined using other optimization techniques. Akkerman et al. (2025) propose an actor pipeline for the multi-item inventory record inaccuracy problem, where the actor NN decides whether to reorder and a MILP model determines the inventory inspection route.

2.4. Research Gap

Our research addresses open questions in all of the outlined fields. We are the first to analyze the value of age-based blending. In this setting, the operational flexibility lies in consolidating a variable inventory age structure. Further, we contribute to the emerging ameliorating inventory management literature by analyzing different blending regimes in issuance decisions. Lastly, we provide several contributions to the literature on DRL for inventory management. For handling the multidimensional action space in the ameliorating food problem, we develop a novel actor pipeline that integrates an actor NN and a lookahead optimization model. In contrast to existing actor pipeline applications, our approach enables control over the allocation of action dimensions to either of the two components. Hence, we analyze the trade-off between a (suboptimal) policy approximation using the lookahead model and the coordination of many action dimensions in the actor NN, also potentially impairing performance. Moreover, existing state-of-the-art DRL algorithms rely on discounting future rewards. Because the ameliorating food inventory problem is naturally modeled as an average-reward MDPs, we transfer an algorithm that specifically optimizes the long-run average reward to the inventory domain and thereby substantially outperform state-of-the-art algorithms. Finally, Boute et al. (2022) point out that integrating structural problem insights into DRL approaches remains an open avenue for future research. To that end, we suggest two novel reward-handling techniques that exploit the problem structure.

3. Modeling Ameliorating Food Inventory Management

This section develops a generic model for ameliorating food inventory systems. We first provide the MDPs formulation in Section 3.1. We represent the specific characteristics amelioration, evaporation, blending flexibility, and the various uncertainties in the MDP constituents, that is, decision epochs, states, actions, transitions, and rewards. In Section 3.2, we develop an analytical upper bound on the average reward, used for benchmarking and for enhancing our solution approach.

3.1. Model Formulation

Ameliorating food inventory systems are composed of age classes $i \in I := {1, \dots, | I |}$ , which model the accumulated maturation times. For instance, stocks in Age Class $i = 3$ have matured for three periods. The volumes in inventory are used to produce different products $w \in W := {1, \dots, | W |}$ . While these products are all supplied by the same single-commodity inventory system, they differ in the target age $τ_{w}$ , that is, the age printed on the product label. We arrange the products in set $W$ in ascending order of target ages, that is, $w = 1$ represents the youngest product.

Because ameliorating food stocks can be processed in any volume, we use a continuous-value MDPs formulation. Aging spirits are stored in wooden casks, whereas cheese and ham age in designated ripening chambers. For all aging products, a small proportion of $ϵ \in [0, 1]$ evaporates each period due to the exposure to air and the permeability of the storage casks. We denote $θ_{i} = (1 - ϵ)^{i}$ as the proportion of initially stored volume still available after $i$ storage periods.

For a problem with $| I | = 6$ , $| W | = 2$ , $τ_{1} = 3$ , and $τ_{2} = 5$ , Figure 1 illustrates one state transition for an exemplary set of state observation, actions, and uncertainty realizations. We define the shown model components in detail in the next section.

Figure 1.

Exemplary state transition in the ameliorating food inventory problem.

Decision epochs: Our model aligns decision epochs with supply (harvest) cycles. These represent natural intervals for planning activities, as purchasing prices are revealed, which represent the most significant source of uncertainty for producers of ameliorating foods. Purchasing decisions are logically tied to the timing of price disclosures. At the same time, firms conduct production planning for the upcoming period.

States (first column of Figure 1): We model states $s = {ψ, s_{1}, \dots, s_{| I |}} \in S$ as vectors of length $| I | + 1$ , composed of the current purchase price level $ψ$ and inventory volumes, with $s_{i}$ indicating the volume that has been matured for $i$ periods. The oldest age class $| I |$ represents the maturation limit. Stocks aging beyond this point outdate as their taste becomes too peculiar for the products under consideration, even if the respective stocks are blended with younger volumes. Note that the outdating of stocks is highly uneconomical as no revenue offsets the incurred purchasing and holding costs.

Actions (second column of Figure 1): The action space comprises three subspaces. Firstly, the purchasing action determines the volume $a^{P} \in A^{P}$ of agricultural produce acquired at the current market purchase price $ψ$ . We assume a maximum purchasing volume $k$ per period. This reflects the company’s capacity limitations at the initial processing stage, where the agricultural produce is converted into ameliorating stocks (Jahandideh et al., 2023).

Production actions $a^{Y} = {a_{1}^{Y}, \dots, a_{| W |}^{Y}} \in A^{Y}$ determine the volume of each product placed in the sales market. However, the stocks issued to create a product can stem from different age classes if blending is allowed (e.g., spirits), or if the label on the package specifies a target age range (e.g., ripened cheese). Hence, issuance actions $a^{X} = {a_{1}^{X}, \dots, a_{| I |}^{X}} \in A^{X}$ specify the stock volumes taken from the respective inventory age classes $i \in I$ . To link issuance and production, we let $a_{i, w}^{X Y}$ denote the issuance volume from age class $i$ allocated to product $w$ . Since evaporation losses are accounted for when converting stocks into final products, we account for the remains after evaporation ( $θ_{i}$ ) in the corresponding balance equation.

\sum_{w \in W} a_{w}^{Y} = \sum_{i \in I} \sum_{w \in W} a_{i, w}^{X Y} \cdot θ_{i} = \sum_{i \in I} a_{i}^{X} \cdot θ_{i}

(1)

To model different blending regimes for issuance decisions, we define

I_{w}

as the set of age classes which are eligible for creating product

w

. In general, we distinguish two regimes:

Average-age blending: This regime merely requires the issued volumes which are blended to create a specific product to have reached the target age on average. In the most general case, $I_{w} = I \forall w \in W$ , implying that very young stocks can be blended with very old stocks to create a medium age product. However, to maintain consistency in the products’ taste profiles, age class inclusion within $I_{w}$ may be restricted. We define $v_{w}$ as the maximum allowed deviation from the target age, that is, $I_{w} = {τ_{w} - v_{w}, \dots, τ_{w}, \dots, τ_{w} + v_{w}}$ . Products typically blended to achieve an average age indication include port wine and sherry.

Minimum-age blending: In this regime, stocks can be blended in production, but the age indication on the product label refers to the youngest volumes in the blend. In the most general case, $I_{w} = {τ_{w}, \dots, | I |} \forall w \in W$ . Similar to the average age blending case, $v_{w}$ denotes the maximum age distance from the target age in the blend, that is, $I_{w} = {τ_{w}, \dots, τ_{w} + v_{w}}$ . For most ameliorating spirits, such as whiskey and rum, minimum-age blending is common practice. While blending in the literal sense may not apply to cheese, the flexibility offered by product labeling, where cheeses are often marketed as “matured for at least $τ_{w}$ months”, resonates with this regime.

Note that the example in Figure 1 shows the average-age blending regime, as both products’ blends include stocks from age classes younger than the target ages. We characterize a complete action

a = (a^{P}, a^{Y}, a^{X}) \in A

as a tuple of purchasing, production, and issuance volumes.

Transitions: For a given state-action-pair $(s, a)$ , we derive the post-decision state $s^{P D}$ comprising inventory volumes in each age class before the realization of uncertainties as follows:

s^{P D} (s, a) = {a^{P}, (s_{1} - a_{1}^{X}), \dots, (s_{| I | - 1} - a_{| I | - 1}^{X})}

(2)

This deterministic transition comprises several steps (first column of Figure 1). Firstly, issuance volumes are subtracted from the inventory. Then, the remaining stocks age by one period. Lastly, the purchased volumes are added to the first age class, that is, we consider a lead time of one maturation period before the purchased stocks can be used in issuance.

We consider two types of uncertainty affecting the state transition (third column of Figure 1). First, a new purchase price $ψ$ is observed and added to the state. We consider a continuous probability distribution for the purchase price: $ψ \sim F^{Ψ}$ , which is truncated between $ψ^{m i n}$ and $ψ^{m a x}$ and denote $μ^{Ψ}$ as the mean purchasing price. Second, $ϕ_{i}$ denotes the decayed proportion of stock volumes in age class $i$ , which must be removed from the inventory state. As the decay risk of ameliorating products may vary over the maturation time, we assume age-dependent distribution functions, that is, $ϕ_{i} \sim F_{i}^{Φ}$ , where $μ_{i}^{Φ}$ represents the mean decay proportion in age class $i$ . Combining purchase price and decay uncertainties, the transition probability is defined as follows.

\begin{aligned} p (s^{'} | s, a) & = f^{Ψ} (ψ_{s^{'}}) \cdot \prod_{i \in I} f_{i}^{Φ} (ϕ_{i}), \\ where ϕ_{i} & = 1 - \frac{s_{i}^{'}}{s_{i}^{P D} (s, a)} \forall i \in I, \end{aligned}

(3)

Rewards (fourth column of Figure 1): Whereas the supply region is restricted, ameliorating food products are marketed worldwide. Hence, due to climate change effects and regional fluctuations of prices for agricultural commodities, producers are more prone to supply-side uncertainties than to demand-side uncertainties. Nevertheless, we account for demand-side risks by considering stochastic revenues for the production volume ( $a_{w}^{Y}$ ) of each age-labeled product. Stochastic revenues encompass both demand and market price uncertainties (third column of Figure 1). We model demands ( $d_{w} \in [0, d_{w}^{m a x}] \sim F_{w}^{D}$ ) and sales market prices ( $γ_{w} \in [0, \infty) \sim F_{w}^{Γ}$ ) to follow stationary probability distributions for each product. We use a Gaussian copula to allow for correlation between these two variables with coefficient $ρ_{w}^{D, Γ}$ . Then, $f_{w}^{D, Γ} (d, γ)$ denotes the joint probability of observing demand $d$ and sales price $γ$ .

Typically, demand for food products cannot be backlogged. Unmet demand is therefore lost. Further, bottled or packaged stocks have lost their maturation potential and cannot be returned to inventory. However, leftover products may be salvaged, with $γ_{w}^{L} (d_{w}, γ_{w})$ denoting the salvage revenue per unit as a function of demand and sales price realizations. We assume that salvage prices always lie below market prices, that is, $γ_{w}^{L} (d_{w}, γ_{w}) < γ_{w}$ . Additionally, the total volume sold or salvaged cannot exceed the maximum demand $d_{w}^{m a x}$ . The expected revenue for a production volume of $a_{w}^{Y} = y$ of product $w$ is hence given as follows.

\begin{aligned} r^{Y} (w, y) & = \int_{0}^{d_{w}^{m a x}} \int_{0}^{\infty} f_{w}^{D, Γ} (d, γ) \cdot (γ \cdot min (d, y) \\ + γ_{w}^{L} (d, γ) \cdot max (0, min (d_{w}^{m a x} - d, y - d))) d γ d d \end{aligned}

(4)

We exclude the option to store finished products. With regard to cheese and ham, this is due to the limited shelf life after packaging. With regard to spirits and fortified wines produced from harvested crops, the interval between decision epochs is one year. Due to the large storage costs of bottled products, producers and retailers avoid shelving finished products for such an extended period.

Decayed units from each age class $i$ may be salvaged at price $γ_{i}^{Φ}$ . For instance, spirits that do not meet the standards for the high-quality segment anymore are commonly marketed under a white label. The expected reward function shown in Equation (5) is complemented by purchasing costs at price level $ψ$ , and holding costs $h$ , which apply for all units stored after purchasing and issuance and mainly consist of expenses for the storage units and the quality assessment associated with decay risks.

\begin{aligned} r (s, a) & = \sum_{w \in W} r^{Y} (w, a_{w}^{Y}) - ψ_{s} \cdot a^{P} \\ + \sum_{i \in I} (- h + γ_{i}^{Φ} \cdot \int_{0}^{1} f_{i}^{Φ} (ϕ) \cdot ϕ d ϕ) \cdot s_{i}^{P D} (s, a) \end{aligned}

(5)

The value function concludes the infinite-horizon MDPs formulation. Because predominantly family-owned producers assume a cross-generational perspective and stocks gain rather than lose value over time, the average reward criterion naturally applies to the ameliorating food inventory problem.

V^{*} (s) = max_{a \in A_{s}} {r (s, a) + \sum_{s^{'} \in S} p (s^{'} | s, a) \cdot V^{*} (s^{'})}

(6)

3.2. Upper Bound on the Average Reward

We develop an upper bound on the optimal average reward of the MDPs described in Section 3.1 based on the analysis of the average inventory structure. This bound provides a conservative yet informative benchmark for evaluating the performance of inventory policies. Further, we use this analytical result to scale rewards in our solution approach (see Section 4.3). For the average-reward infinite-horizon MDPs, we denote $s_{i}^{\emptyset}$ as the average inventory volume in age class $i$ . Likewise, we specify $a^{P \emptyset}$ as the average purchasing volume, and $a_{i, w}^{X Y \emptyset}$ as the average issuance volume from age class $i$ allocated to product $w$ . On average, the following balance equations hold for any MDPs policy.

\begin{aligned} s_{1}^{\emptyset} & = a^{P \emptyset} \cdot (1 - μ_{1}^{Φ}) \end{aligned}

(7)

\begin{aligned} s_{i}^{\emptyset} & = (s_{i - 1}^{\emptyset} - \sum_{w \in W} a_{i - 1, w}^{X Y \emptyset}) \cdot (1 - μ_{i}^{Φ}) & \forall i \in {2, \dots, | I |} \end{aligned}

(8)

\begin{aligned} s_{| I |}^{\emptyset} & \geq \sum_{w \in W} a_{| I |, w}^{X Y \emptyset} \end{aligned}

(9)

Equation (7) relates the average inventory in the first age class to the average purchasing decisions. Equation (8) represents the inventory balance across adjacent age classes, dependent on average issuance volumes and the mean decay proportion. Equation (9) restricts the average issuance from the oldest age class to its average inventory level.

To obtain a tractable approximation of the non-linear expected revenue function in Equation (4), we employ a piece-wise affine upper approximation function $r^{Y ↑} (w, y)$ . We explain the details of its derivation in Electronic Companion EC.1. To obtain a piece-wise affine shape, we characterize $r^{Y ↑} (w, y)$ by a discrete set of break points $b \in B_{w}$ , at which the slope of the function changes. Hence, we introduce $a_{w, b}^{Y \emptyset}$ as an auxiliary continuous decision variable, denoting the average proportion of decision epochs in which $b$ units of product $w$ are produced.

We also discretize the continuous probability distribution of purchase prices by introducing a finite set of indices $j \in J := {1, \dots, | J |}$ with associated price levels $ψ_{j}$ , where $ψ_{1} = ψ^{min}$ and $ψ_{| J |} = ψ^{max}$ . We further introduce decision variables $a_{j}^{P \emptyset}$ as the average purchasing volume in the interval between purchasing price index $j$ and $j + 1$ . In the proposed optimization model, the objective function represents an upper approximation of the MDPs’s average reward which is directly derived from the values of the decision variables $s^{\emptyset}$ , $a^{Y \emptyset}$ , and $a^{P \emptyset}$ . We formulate several constraints on the average system structure that reflect state-dependent limitations of the action space in the dynamic MDPs. The complete formulation is given as follows:

\begin{aligned} Maximize Z^{\emptyset} & = \sum_{w \in W} \sum_{b \in B_{w}} a_{w, b}^{Y \emptyset} \cdot r^{Y ↑} (w, b) \\ - \sum_{j \in J ∖ {| J |}} ψ_{j} \cdot (F^{Ψ} (ψ_{j + 1}) - F^{Ψ} (ψ_{j})) \cdot a_{j}^{P \emptyset} \\ + \sum_{i \in I} \frac{(γ_{i}^{Φ} \cdot μ_{i}^{Φ} - h) \cdot s_{i}^{\emptyset}}{1 - μ_{i}^{Φ}} \end{aligned}

(10)

\begin{aligned} subject to (7) - (9) and \sum_{b \in B_{w}} a_{w, b}^{Y \emptyset} \cdot b & = \sum_{i \in I} a_{i, w}^{X Y \emptyset} \cdot θ_{i} & \forall w \in W \end{aligned}

(11)

\begin{aligned} \sum_{b \in B_{w}} a_{w, b}^{Y \emptyset} & = 1 & \forall w \in W \end{aligned}

(12)

\begin{aligned} a^{P \emptyset} & = \sum_{j \in J ∖ {| J |}} a_{j}^{P \emptyset} \cdot (F^{Ψ} (ψ_{j + 1}) - F^{Ψ} (ψ_{j})) \end{aligned}

(13)

\begin{aligned} \sum_{i \in I} a_{i, w}^{X Y \emptyset} \cdot i \cdot θ_{i} & \geq τ_{w} \cdot \sum_{i \in I} a_{i, w}^{X Y \emptyset} \cdot θ_{i} & \forall w \in W \end{aligned}

(14)

\begin{aligned} a_{i, w}^{X Y \emptyset} & = 0 & \forall w \in W, i \in I | i \notin I_{w} \end{aligned}

(15)

\begin{aligned} s_{i}^{\emptyset} & \geq 0 & \forall i \in I \end{aligned}

(16)

\begin{aligned} k \geq a_{j}^{P \emptyset} & \geq 0 & \forall j \in J ∖ {| J |} \end{aligned}

(17)

\begin{aligned} a_{i, w}^{X Y \emptyset} & \geq 0 & \forall i \in I, w \in W \end{aligned}

(18)

\begin{aligned} a_{w, b}^{Y \emptyset} & \geq 0 & \forall w \in W, b \in B_{w} \end{aligned}

(19)

Equation (10) represents the objective function that maximizes the average reward per period based on Equation (5). Constraints (11) link the production volumes to the issuance volumes for individual products. Constraints (12) ensure that production proportions add up. Constraints (13) relate the total average purchasing volumes used in Constraints (7) to the price-level dependent purchasing volumes. Constraints (14) implement the target age adherence in issuance decisions. Constraints (15) implement the problem-specific blending regime characterized by $I_{w}$ for each product (see Section 3.1). Finally, variable domains are declared in Constraints (16) to (19). Proposition 1 defines the optimal objective value as an upper bound for the MDPs average reward.

Proposition 1

For identical parameter settings, the optimal objective value for the optimization model defined by Equations (7) to (19) is at least as large as the average reward $r^{\emptyset} (π^{*})$ derived from the optimal policy $π^{*}$ for the MDPs defined in Section 3.1.

We provide the proof in Electronic Companion EC.1. The generation of an upper bound through an LP problem relaxation is also found in the fluid approach proposed by Bertsimas and Mišić (2016). In contrast with their approach, however, we cannot decompose the state space dimensions as inventory levels in different age classes are interdependent. Hence, we directly model the average system structure, including the interdependencies between average purchasing and issuance volumes and the resulting average inventory in different age classes. Note that the resulting upper bound is conservative, as it does not account for the sequential MDPs dynamics. Consequently, e.g., average purchasing volumes can be allocated entirely to low purchase price level intervals in the upper bound LP. In contrast, the optimal purchasing policy in the MDPs also depends on the current inventory volumes, which result from the historical sequence of actions, purchase prices, and decay volumes.

4. DRL Approach

The curse of dimensionality and the continuous-value MDPs formulation rule out the use of exact solution algorithms for the ameliorating food inventory problem introduced in Section 3.1. Therefore, we develop a DRL solution approach. We present the algorithm design, including a novel actor pipeline in Section 4.1. Moreover, we employ an algorithm that specifically maximizes average rewards and, therefore, conforms with our value function formulation in Equation (6). Section 4.2 discusses average-reward DRL algorithms and justifies our choice of APO. Section 4.3 introduces new reward-handling methods to further improve the algorithm’s performance.

4.1. Actor Pipeline

Recently, actor–critic DRL algorithms have been considered the state-of-the-art in inventory research (e.g., Boute et al., 2022; Gijsbrechts et al., 2022; Liu et al., 2025). Figure 2 provides an overview of their general architecture.

Figure 2.

Actor–critic architecture.

The actor implements the MDPs policy using an NN. In each step of the algorithm, it maps the current state variables to an action. The action is handed to the environment, which implements the dynamics of the MDPs. Hence, it outputs a reward based on the provided state-action-pair, and derives the transition state using Monte-Carlo sampling from the uncertainty distributions.

The critic also includes an NN, which likewise receives the current state as input but outputs an estimate of the state value as formulated in the value function in Equation (6). In each algorithm step, the tuple of state, action, reward, and state evaluation is stored in a buffer. The transition state is passed on to the actor and the critic to initiate the next algorithm step. Once a batch of $N$ tuples (steps) has been accumulated, the collected data is used to derive loss functions for the NNs in both the actor and the critic. Minimizing these losses, we update the network weights. Then, the next algorithm iteration replicates this procedure.

Actor–critic DRL approaches have successfully been applied to inventory problems with low-dimensional action spaces. For instance, Gijsbrechts et al. (2022) consider up to two action dimensions. Because of the integration of purchasing, production, and flexible issuance decisions, the action space in the ameliorating food inventory problem entails large dimensionality and high combinatorial complexity. Both represent major challenges for practical reinforcement learning, possibly leading to compromised quality of the trained policy (Dulac-Arnold et al., 2021). Contrary to DRL, we can easily integrate multiple decision dimensions when solving inventory problems with lookahead optimization models (e.g., Buisman and Rohmer, 2022). However, such models fail to represent the possibility to react to uncertainty in sequential decision-making. Therefore, we develop an actor pipeline that combines the strengths of both approaches by decomposing the action space, allowing the flexible allocation of action dimensions to either an NN or a lookahead optimization model. The pipeline is illustrated in Figure 3, which zooms into the actor in Figure 2, with state variables as input and action variables as output.

Figure 3.

Actor pipeline.

We formalize the generic design as follows. Let $H$ denote the set of action dimensions $η \in H$ in the MDP formulation. We group $H$ into a partition of two sets $H^{N N} \subseteq H$ , that is, the action dimensions represented in the actor NN, and $H^{L M} \subseteq H$ , that is, the action dimensions handled by a lookahead model. When configuring the actor pipeline such that $H^{N N} = H$ , we determine all actions by the actor NN, and the lookahead model becomes redundant. If $H^{N N} = \emptyset$ , no policy is trained, and we perform deterministic rolling-horizon planning. During execution, we first derive actions for dimensions $η \in H^{N N}$ from the actor NN. Together with the current state, these serve as input data for the constraints of the lookahead optimization model, which provides actions for the remaining dimensions $η \in H^{L M}$ . In the following, we explain the components of the actor pipeline in detail.

For each action dimension, the actor NN outputs a mean and a standard deviation, which together form a normal distribution from which action values are sampled (see left side of Figure 3). In the initial training iterations, this sampling ensures the exploration of the solution space. However, as the training progresses, the actor ideally decreases the standard deviation, exploiting the benefits of near-optimal actions.

The lookahead model derives actions for those dimensions $η \in H^{L M}$ that are not represented in the actor NN (see right side of Figure 3). We model the lookahead horizon using a set of periods $t \in T := {1, \dots, | T |}$ . We posit that $| T | > | I |$ to avoid end-of-horizon effects. We optimize decision variables $a_{t}^{η}$ for all action dimensions $η \in H^{N N} \cup H^{L M}$ . However, for the actor network dimensions $H^{N N}$ , we implement constraints demanding that the actions in the first lookahead period conform with the output of the actor NN. Further constraints ensure that state decision variables represent the current state observation in the first lookahead period. The objective function maximizes the cumulative lookahead reward over all periods in $T$ . The remaining constraints implement the transition logic and action space constraints of the underlying MDPs. When formulating the lookahead model, computational efficiency is imperative because the model must be solved in every step of the DRL algorithm. After solving the lookahead model, the optimal solutions for the first-period decision variables $a_{1}^{η} \forall η \in H$ determine the action that is handed over to the actor–critic environment.

For the ameliorating inventory management problem introduced in Section 3.1, Table 1 summarizes the action dimensions that must be partitioned between the actor NN and the lookahead model. To guarantee fast computation, we use an LP formulation for the lookahead model. For each period $t$ , decision variables determine the inventory volumes per age class ( $s_{i, t}$ ) and the issuance volumes per age class together with the allocation to products ( $a_{w, i, t}^{X Y}$ ). For the sake of brevity, we provide the detailed LP formulation in Electronic Companion EC.2.

Table 1.

Action dimensions in ameliorating food inventory management.

Action dimension $η$	Description
$P$	Purchasing volume
$X_{i}$	Issuance volume from age class $i$
$Y_{w}$	Production volume of product $w$

However, we illustrate how the LP interacts with the DRL environment and the actor NN. Constraints (20) initialize the inventory decision variables in Period 1 with the values $s_{i}$ observed in the current state.

s_{i, 1} = s_{i} \forall i \in I

(20)

The following constraints embed the LP in the actor pipeline and only apply if the respective action dimensions are handled by the actor NN. Constraints (21) enforce purchasing actions

a^{P}

determined by the actor NN. Note that the purchased volumes directly determine the value of the decision variables for the inventory in Age Class 1 at the beginning of Period 2. Constraints (22) account for issuance actions

a_{i}^{X}

and Constraints (23) for production actions

a_{w}^{Y}

imposed by the NN, respectively.

\begin{aligned} s_{1, 2} & = a^{P} \cdot (1 - μ_{1}^{Φ}) & if P \in H^{N N} \end{aligned}

(21)

\begin{aligned} \sum_{w \in W} a_{w, i, 1}^{X Y} & = a_{i}^{X} & \forall i \in I : X_{i} \in H^{N N} \end{aligned}

(22)

\begin{aligned} \sum_{i \in I} a_{w, i, 1}^{X Y} \cdot θ_{i} & = a_{w}^{Y} & \forall w \in W : Y_{w} \in H^{N N} \end{aligned}

(23)

We use the mean values of the underlying probability distributions when modeling state transitions across the lookahead horizon to formulate a deterministic model (e.g., $μ_{1}^{Φ}$ in Equation (21)). Naturally, this approximation and the neglect of the dynamic nature of the underlying decision problem lead to a suboptimal policy. Hence, we face a trade-off in the actor pipeline configuration, as we must decide which action dimension to allocate to which pipeline component. The approximation error may increase if more action dimensions are handed to the lookahead model. However, allocating more action dimensions to the actor NN may also impair solution quality due to the increased coordination burden during learning (Dulac-Arnold et al., 2021). Hence, it is important to strike the right balance.

4.2. Average-Reward DRL

State-of-the-art actor–critic algorithms such as PPO (Schulman et al., 2017), soft actor–critic (Haarnoja et al., 2019), and twin-delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) rely on discounting future rewards. The discount factor is frequently treated as a hyperparameter and tuned to achieve a better performance. However, in Operations Research problems, only the problem environment can justify discounting future rewards. Furthermore, if the discount factor is close to 1, state-of-the-art algorithms tend to perform worse (Zhang and Ross, 2021). As our value function in Equation (6) maximizes the average reward per period, we employ an algorithm specifically designed for this objective.

Recently, several actor–critic algorithms for the average reward criterion have been introduced, which all build on existing approaches for discounted rewards and are evaluated on computer game environments. These algorithms typically calculate state values relative to an estimate for the average reward ${\hat{r}}^{\emptyset}$ , a concept introduced by Schwartz (1993). Zhang and Ross (2021) develop an algorithm based on trust region policy optimization, whereas Saxena et al. (2023) adapt the TD3 algorithm. Our average-reward algorithm builds on APO, introduced by Ma et al. (2021), which is based on PPO. We select an on-policy algorithm because the average reward estimate ${\hat{r}}^{\emptyset}$ must be updated as the training progresses. Whereas off-policy algorithms such as the one proposed by Saxena et al. (2023) provide improved sample efficiency, off-policy data may also include states that are transient under the current policy, leading to bias in the average reward estimate ${\hat{r}}^{\emptyset}$ . Electronic Companion EC.3 presents our algorithm, which adapts APO to improve convergence and reduce the variance in value estimation. We benchmark the performance of APO against the state-of-the-art on-policy PPO in Section 6.3.

4.3. Reward-Handling Techniques

To enhance the learning performance of our algorithm, we introduce two novel reward-handling techniques that build on specific problem insights.

It is common in DRL research to normalize the state variables, action variables, and rewards to a pre-defined range. This practice prevents adjusting the network’s weights to differences in the numerical scales between the NN’s input and output layers during training (van Hasselt et al., 2016). We normalize state and action variables in the interval $[- 1, 1]$ using their problem-specific natural bounds. For example, for purchasing and inventory volumes, the lower scaling bound is zero and the upper scaling bound is the maximum processing capacity $k$ .

For rewards, however, strict clipping in the interval $[- 1, 1]$ may have detrimental effects since the minimum (no revenue, largest possible purchasing and holding costs) and maximum possible values (largest possible sales with an afterward empty inventory system) represent hypothetical outliers, leading to low variance in the reward signal observed during training. We, therefore, suggest scaling the rewards by the upper bound on the average reward $Z^{\emptyset *}$ , which is obtained using the LP presented in Section 3.2. With zero profits representing the lower bound of the scaling interval, the reward calculation in the APO algorithm is derived as follows:

r^{A P O} (s, a) = 2 \cdot \frac{r (s, a)}{Z^{\emptyset *}} - 1

(24)

Apart from reward scaling, reward shaping, that is, modifying the observed reward by adding a term $g$ to improve the learning process, is another familiar concept in reinforcement learning research. We introduce a reward-shaping method that utilizes domain knowledge about the structure of (near-)optimal MDPs policies. To this end, we extend the MDPs state representation for the current state $s^{0}$ by historical state and action information of the last $L$ periods.

s^{L} = {s^{0}, s^{- 1}, a^{- 1}, \dots, s^{- L}, a^{- L}}

(25)

For a given policy property $κ$ , the function $Δ_{κ}$ assesses its violation when comparing the current state-action pair with the historical ones in the state vector. We formulate the following requirement for the definition of $Δ_{κ}$ .

\begin{aligned} Δ_{κ} (s^{0}, a^{0}, s^{- l}, a^{- l}) & = {\begin{cases} 0 & if there is no violation of κ \\ when comparing state-action pair \\ (s^{0}, a^{0}) \\ to state-action pair (s^{- l}, a^{- l}) \\ > 0 & otherwise \end{cases} \end{aligned}

(26)

Let

K

denote the set of policy properties for the considered MDPs. The reward-shaping term then penalizes all observed deviations via a penalty parameter

p_{κ}^{L} \geq 0

g^{L} (s^{L}, a^{0}) = \sum_{κ \in K} - p_{κ}^{L} \cdot \sum_{l = 1}^{L} Δ_{κ} (s^{0}, a^{0}, s^{- l}, a^{- l})

(27)

The reward observation provided to the DRL algorithm is determined as follows:

r^{L} (s^{L}, a^{0}) = r^{A P O} (s^{0}, a^{0}) + g^{L} (s^{L}, a^{0}),

(28)

Note that including the history in the state may also support the NNs in learning the structure of reward delays incurred by maturation times (Hester and Stone, 2012). We provide a more detailed description and a theoretical analysis of the history-based reward-shaping approach in Electronic Companion EC.4. Moreover, we show the performance impact of both reward-handling techniques in Electronic Companion EC.5.

5. Application: Port Wine Inventory Management At Symington

We apply our solution approach to a large-scale port wine industry case. 10-year-old and 20-year-old Graham’s Tawny are high-volume trademark products from the Symington Family Estates (Symington Family Estates, 2025). The company uses stocks matured up to 25 years in issuance decisions to account for flexible blending under the average-age regime. We summarize further problem parameters derived from company data and the used algorithmic settings in Electronic Companion EC.6.

Figure 4 shows the training progress of the APO algorithm. The left subfigure shows that the explained variance of the value targets $\tilde{V}$ (see Electronic Companion EC.3) by the critic network approaches $100 %$ after merely a few training iterations. This metric is essential for policy improvements as the critic NN provides a baseline for the policy evaluation in the actor NN. The central subfigure demonstrates how the actor network moves from exploration to exploitation. The entropy, that is, the randomness in action selection, continuously decreases as the algorithm progresses. As a result, the average reward shown in the right subfigure quickly converges to a value close to the upper bound, oscillating around a stable mean after approximately 150 training iterations.

Figure 4.

Training progress of the average policy optimization (APO) algorithm for the port wine industry case.

We extract the policy from the set of training iterations that achieves the largest average profit on an evaluation data set. In a large-scale simulation of 50,000 periods, the gap to the upper bound of this policy is only $2.8 %$ , highlighting that our algorithm finds near-optimal policies for industry-size problem instances with a large number of state space dimensions. Figure 5 summarizes the average inventory system behavior under the best policy for Symington’s port wine problem. Figure 5(a) shows the average purchasing and production volumes and Figure 5(b) shows the average inventory levels and the issuance volumes across the 25 age classes. All volumes are presented as a percentage of the processing capacity $k$ . Error bars indicate the standard deviation of purchasing actions, production actions, and inventory volumes.

Figure 5.

Average analysis of port wine inventory system. (a) Purchasing and production actions. (b) Inventory levels and issuance actions.

The average purchasing volume is substantially larger than the accumulated production volumes. This difference is entirely attributed to evaporation and decay losses. The policy prevents outdating in the last age class by issuing all remaining stocks. Further, the distribution of issuance volumes provides insights into the blending strategy. Despite the large flexibility under the average-age regime, issuance volumes are centered around the products’ target ages. Blending from distant age classes leads to increased evaporation losses, even if the average age equals the target age. We analytically characterize this relationship between blending and evaporation in Electronic Companion EC.7.

Note that the production volumes have a very low standard deviation, demonstrating that our policy allows Symington to offer their products reliably on the sales market (see Figure 5(a)). In contrast, the purchasing volumes and, consequently also, the inventory volumes (see Figure 5(b)) are highly volatile. Enabled by the blending flexibility, Symington can purchase less in high-price periods and more in low-price periods. To summarize, average-age blending helps address Symington’s two main challenges: (i) Adapting the purchase volume to the price level while maintaining stable production volumes and (ii) blending on-target with a narrow spread of age classes while dealing with volatile inventory volumes across age classes. The following section comprehensively characterizes the value of blending in ameliorating inventory management. Further, because blending permits flexible inventory management, we provide managers with explanations and insights for the volatile policy resulting from NN training.

6. Analysis

Our numerical experiments provide insights into the performance of our actor pipeline, the value of blending in issuance decisions, and important decision drivers in ameliorating inventory management. We structure our analyses as follows. Section 6.1 presents the experimental design. We summarize benchmark approaches in Section 6.2. In Section 6.3, we evaluate our DRL approach and different configurations of the actor pipeline introduced in Section 4.1. Section 6.4 investigates the impact of different blending regimes on the average reward and the resulting policies. We mine the trained policies to elicit key factors behind specific decisions and generic managerial insights in Section 6.5.

6.1. Design of Experiments and Algorithm Configuration

We assess configurations of the novel actor pipeline as well as different blending regimes in a generic ameliorating food inventory system comprising ten age classes ( $| I | = 10$ ) and three products ( $| W | = 3$ ). Further, we analyze the impact of four factors that vary across different industries and producers. First, we consider the standard deviation of purchase prices ( $σ^{Ψ}$ ), which depends on the climate risks and geographical sourcing restrictions. Furthermore, we examine the impact of the producers’ maximum capacity for processing agricultural commodities ( $k$ ), which sets an upper limit on purchasing actions in each decision epoch. Third, evaporation ratios ( $ϵ$ ) vary with the climatic conditions in the geographical storage location and depend on both temperature and humidity (European Commission, 2013). Finally, the variance in target ages ( $τ$ ) relies on industry-specific standards or producers’ long-term marketing strategies. We employ a $2^{4}$ -experimental design to analyze the effects of these factors. Table 2 presents the factor level configurations. We summarize the remaining parameter settings in Electronic Companion EC.8.

Table 2.
Factor levels in numerical experiments.

Factor $-$ level $+$ level

$σ^{Ψ}$ $0.15 \cdot μ^{Ψ}$ $0.25 \cdot μ^{Ψ}$

$k$ $30$ $50$

$ϵ$ $0.02$ $0.03$

$τ$ ${4, 5, 6}$ ${3, 5, 7}$

Factor	$-$ level	$+$ level
$σ^{Ψ}$	$0.15 \cdot μ^{Ψ}$	$0.25 \cdot μ^{Ψ}$
$k$	$30$	$50$
$ϵ$	$0.02$	$0.03$
$τ$	${4, 5, 6}$	${3, 5, 7}$

We implement the APO algorithm in Python within the Ray Rllib ecosystem (release 2.3.0) (Liang et al., 2017). It is noteworthy that, unlike other DRL approaches, where hyperparameter sensitivity can lead to substantial performance variations (e.g., Gijsbrechts et al., 2022), our APO implementation yields robust performance across problem instances with limited tuning effort. Nevertheless, we carefully tune the hyperparameters within ranges conventionally used for inventory management. Because we observe stable training performance and algorithmic convergence, we use the same APO hyperparameter settings throughout our experiments. We provide these settings as well as details of our hyperparameter tuning approach in Electronic Companion EC.3. Our source code and all experiment data are available on the following repository: https:/github.com/amelioratinginventory/ameliorating_inventory.

6.2. Benchmark Heuristics

Unlike, for example, the conventional lost sales model studied by Gijsbrechts et al. (2022), well-established heuristics for benchmarking our DRL approach have not yet been developed for the ameliorating food inventory problem introduced in Section 3.1. Therefore, we develop heuristics based on classic inventory models to demonstrate our algorithm’s effectiveness. The details of their implementation are provided in Electronic Companion EC.9.

For the purchasing decisions, we propose two heuristics. First, we adapt the popular newsvendor model for each product individually, accounting for the expected accumulated costs during maturation. This is also inspired by industry practice, where products are managed separately in spreadsheet-based planning tools. We label this newsvendor-based approach to purchasing $N V P$ . Second, as it is effective for multiage inventory management of perishables (Haijema and Minner, 2019), we implement a two-dimensional order-up-to policy. This policy, labeled $2 D O$ , considers a second order-up-to threshold for the young inventory besides the total inventory.

For production and issuance decisions, the classic FIFO policies used in perishable inventory management fail in our problem as they neglect amelioration effects as well as blending flexibility. We hence develop two new benchmarks. The first approach again resembles spreadsheet-based planning in industry. We define each product’s overall production volume target through the newsvendor model. Then, in any given period, we use a simple LP to issue the stocks, maximizing adherence to the production volume targets while minimizing target age excess in blending. We label this approach $V O L$ . The second approach, labeled $N V τ$ , is also inspired by the newsvendor model and aims at distributing the current target-age stocks at hand among the differently aged products while maximizing the expected profit per product.

A further intuitive benchmark approach is derived from the actor pipeline presented in Section 4.1. When delegating all action dimensions to the lookahead model ( $H^{L M} = H$ ), the NN becomes obsolete. This corresponds to a rolling-horizon LP approach, labeled $R L P$ , where inventory positions and prices are updated in each decision period.

For all problem instances discussed in Section 6.1, we implement all possible combinations of $N V P$ and $2 D O$ for purchasing, and $V O L$ and $N V τ$ for production and issuance, respectively. We also evaluate the $R L P$ approach. Table 3 presents the average performance of the benchmark algorithms. We provide an extensive analysis in Electronic Companion EC.9. The $R L P$ approach consistently outperforms all other benchmarks across problem instances. Since it forms an integral part of our actor pipeline, its strong standalone performance is critical for the effectiveness of our DRL approach. Given the inferiority of other heuristics, we also use $R L P$ as our primary benchmark. Additionally, we compare our DRL approach with the heuristic inspired by industry practice ( $N V P + V O L$ ).

Table 3.
Average profit for benchmark heuristics (mean over problem instances relative to the respective upper bound in $%$ ).

Benchmark heuristic $N V P + V O L$ $N V P + N V τ$ $2 D O + V O L$ $2 D O + N V τ$ $R L P$

Average profit $76.0$ $81.9$ $72.8$ $74.6$ $95.9$

Benchmark heuristic	$N V P + V O L$	$N V P + N V τ$	$2 D O + V O L$	$2 D O + N V τ$	$R L P$
Average profit	$76.0$	$81.9$	$72.8$	$74.6$	$95.9$

NVP = newsvendor-based approach to purchasing; NV = newsvendor; RLP = rolling-horizon linear program; VOL = volume.

6.3. Performance Analysis of Actor Pipeline Configurations

We utilize the benchmarks developed in the previous section to assess the performance of our DRL approach across different configurations of our novel actor pipeline. In our analyses, we focus on the average-age blending regime because it involves the largest problem complexity. Since purchasing decisions predefine the feasible space for future production and issuance decisions, we expect the most significant impact on the average reward when handing this action dimension to the NN. Similarly, production decisions for younger products predefine the feasible space for future production decisions for older products. Therefore, we investigate the following configurations of $H^{N N}$ for all problem instances: ${P}$ , ${P, Y_{1}}$ , ${P, Y_{1}, Y_{2}}$ , and ${P, Y_{1}, Y_{2}, Y_{3}}$ . In all of these configurations, the issuance decisions are determined using the lookahead LP in our actor pipeline. Additionally, we investigate the case $H^{N N} = H$ , where all decision dimensions are taken over by the actor NN.

Table 4 reports the average training time per DRL iteration on a local 16-core Intel Xeon Platinum 8280L CPU (2.7 GHz, 32 GB RAM), using the algorithmic setup described in Section 6.1 and Electronic Companion EC.3. For the hybrid actor pipeline configurations, which combine an NN and a lookahead LP model, we observe comparable total training times (Columns 2–5 in Table 4). In all these configurations, the LP solution time is the main bottleneck, consistently accounting for more than half of the total iteration time. This underlines the importance of a computationally efficient LP formulation within the actor pipeline. By contrast, the configuration $H^{N N} = H$ , which avoids the lookahead LP entirely, yields significantly lower total training times. However, the increased dimensionality of the actor NN in this setting leads to higher action inference and training efforts. Naturally, computational effort grows with the number of age classes in the inventory system, but our hybrid algorithm scales to large instances. In the industry case presented in Section 5, training time per iteration increases linearly without loss of policy quality.

Table 4.
Average computational complexity of different actor pipeline configurations.

DRL = deep reinforcement learning; NN = neural network; LP = linear program.

We evaluate the performance of each configuration with 30 independent simulation runs of 2,000 decision epochs. To allow for rigorous statistical analysis, we use equal initial state variables across all runs and common random numbers across all configurations, including the benchmark cases. From each training run, we extract the policy from the specific training iteration, for which we have obtained the best performance on a separate test dataset. Table 5 illustrates the percentage change of the gap between the best benchmark ( $R L P$ ) and the upper bound for the different actor pipeline configurations. We also show the statistical significance of these changes. Notably, we manage to outperform $R L P$ significantly in all problem instances if up to three action dimensions are represented by the NN ( ${P}, {P, Y_{1}}$ , and ${P, Y_{1}, Y_{2}}$ ). If additional action dimensions are added to the NN, however, the algorithm’s performance deteriorates, especially when issuance decisions with high combinatorial complexity are learned in actor pipeline configuration $H^{N N} = H$ . Considering the best actor pipeline configuration in each problem instance, highlighted in bold in Table 5, we manage to reduce $R L P$ ’s gap to the upper bound by $16.9 %$ , on average. In 8 out of 16 problem instances, the configuration $H^{N N} = {P}$ performs best, whereas the best average performance is achieved by the configuration $H^{N N} = {P, Y_{1}}$ . This result illustrates the strength of our actor pipeline and highlights the important trade-off in its design. On the one hand, the actor NN fails if it must coordinate a large number of action dimensions. On the other hand, despite its strong standalone performance, the lookahead model yields suboptimal solutions, as it neglects the sequential dynamic nature of the underlying MDPs. Therefore, a balanced approach, in which the actor NN only learns purchasing and selected long-term-impact production decisions, is essential when designing the actor pipeline. These findings have implications for the design of DRL algorithms for problems with high-dimensional action spaces, beyond the ameliorating food inventory setting. Instead of inflating NNs, algorithm developers should focus on key decisions while utilizing efficient, high-quality alternatives for other actions.

Table 5.

Percentage change of the gap to the upper bound compared to the best benchmark algorithm $R L P$ .

PPO = proximal policy optimization; NN = neural network; RLP = rolling-horizon linear program. $^{* * *}$ : $p < .01$ , $^{* *}$ : $p < .05$ , $^{*}$ : $p < .1$

The rightmost column of Table 5 reports results for the actor pipeline configuration $H^{N N} = {P}$ , using the state-of-the-art PPO algorithm instead of APO. We apply the same hyperparameter tuning procedure as for APO, fixing the PPO discount factor at $0.99$ to approximate the average-reward value function shown in Equation (6) without violating PPO’s convergence requirements. While PPO exhibits some initial training progress, it converges to inferior policies, primarily because the discounting distorts the value function approximation of the critic NN. This highlights the effectiveness of APO in average-reward problems with long-term-impact decisions.

For all ensuing analyses, we use the policy derived from the best-performing actor pipeline configuration in each instance. On average, our actor pipeline DRL approach outperforms the benchmark inspired by industry practice ( $N V P + V O L$ ) by $27.7 %$ . Figure 6 depicts how the factor levels influence this improvement. The processing capacity $k$ has the largest impact. Whereas the $N V P$ purchasing heuristic only considers the current price level, our DRL adapts the purchasing policy to all relevant state features, including inventory volumes. To that end, larger processing capacity provides increased flexibility. Figure 6 also shows that the benefit of our approach for producers increases considerably when the problem environment becomes more challenging, that is, under increased purchase price variability and increased periodic stock evaporation. A change in the variance in target ages has a comparatively small impact.

Figure 6.

Main effects of factors on the percentage profit gain compared to the heuristic inspired by industry practice.

6.4. Value of Blending

In this section, we analyze the value of blending using the same experimental design with common random numbers as in Section 6.3. In the first part of our analysis, we investigate the degree of blending flexibility under the average-age regime expressed by the parameter $v_{w}$ , that is, the maximum allowed deviation from the target age in issuance decisions. For the problem instances presented in Section 6.1, Figure 7 shows how the average percentage gap to the upper bound decreases as $v_{w}$ gradually increases. Note that the upper bound, which relaxes the dynamic MDPs to a static optimization problem, is equal for all blending settings. Blending is not beneficial in the static problem representation as it leads to increased average evaporation and decay losses. In general, the potential for profit improvements through blending is substantial. On average, we can reduce the percentage gap to the upper bound by $14.4$ percentage points when allowing full flexibility ( $v_{w} = | I |$ ) compared to the setting that prohibits blending ( $v_{w} = 0$ ). In terms of average profits, this represents a $18.1 %$ increase. However, the added value of incrementing $v_{w}$ is marginally decreasing. When only allowing blending from the age classes adjacent to the target ages ( $v_{w} = 1$ ), we already achieve $62.3 %$ of the maximum improvement potential, corresponding to an $11.3 %$ average profit increase. This result is especially interesting for blending skeptics among producers, who fear that large differences in the blending mixes may lead to inhomogeneous products. They also obtain large benefits while restricting blending to a small range of age classes. Interestingly, blending also consolidates the upper bound gaps across different problem instances. Whereas they cover a range of $18.5$ percentage points for $v_{w} = 0$ , this range reduces to merely $4.4$ percentage points for $v_{w} = | I |$ . This indicates that blending restrictions are particularly detrimental in problem settings with larger degrees of freedom in the static problem representation, that is, when we have large processing capacity and large purchase price uncertainty. Under the fully flexible setting, the MDPs profit is $3.5 %$ below the upper bound, on average. Because the upper bound entirely neglects the dynamic nature of the problem, this again underlines the strength of our solution approach.

Figure 7.

Average profits under the average-age regime for different levels of blending flexibility, expressed through $v_{w}$ .

In the second part of our analysis of the value of blending, we compare the average-age blending regime ( $I_{w} = I \forall w \in W$ ), which applies mostly in smaller industries such as port wine and sherry, with the minimum-age blending regime ( $I_{w} = {τ_{w}, \dots, | I |} \forall w \in W$ ), which applies in high-volume industries such as whiskey and rum. Solid aged products such as cheese and ham cannot be blended, but also entail issuance flexibility corresponding to the minimum-age regime (see Section 3.1).

Figure 8.

Average profits under different blending regimes.

Figure 8 summarizes the average profits from the different blending regimes across problem instances, again expressed as the percentage gap to the upper bound. For reference, we also include the setting in which blending is prohibited. Minimum-age blending substantially improves the average profits, especially in those problem instances with the largest gaps under the no blending regime. On average, the gap to the upper bound decreases by $6.7$ percentage points ( $p$ -value $< .01$ ). In terms of average profits, this represents a $8.6 %$ increase. By implementing average-age blending, we obtain another $7.7$ percentage points ( $p$ -value $< .01$ ) decrease compared to minimum-age blending. This represents an average profit increase of $8.7 %$ when switching from minimum-age to average-age blending. When comparing the results in Figures 7 and 8, we observe that even strongly restricted average-age blending with only one age class of tolerated target age deviation ( $v_{w} = 1$ ) already outperforms minimum-age blending with full flexibility. There are multiple reasons for this remarkable superiority of average-age blending. Issuance from multiple age classes to create a single product imposes target age excess under the minimum-age regime, leading to surplus holding costs, evaporation, and decay losses. In contrast, target age excess can be avoided under the average-age regime. This is reflected in the resulting policies. The issuance actions under the average-age regime include a lower proportion of target age stocks than under the minimum-age regime, indicating a more dispersed blending mix (see Figure 9(a)). Nevertheless, the blends under the average-age regime are almost always on target. In contrast, producers have to accept a moderate target age excess to also benefit from blending under the minimum-age regime (see Figure 9(b)).

Figure 9.

Issuance decisions across different blending regimes. (a) Usage of target age in issuance. (b) Target age excess in issuance.

To illustrate further drivers of profit enhancements through blending, Figure 10 illustrates the policy variability for the remaining action dimensions during simulation. We observe that with increasing blending flexibility, the volatility in the purchasing policy increases substantially. On the other hand, despite the more variable inventory structure resulting from the more variable purchasing policy, the production volumes fluctuate less across all products. This entails two effects on average profits. Firstly, we decrease the average purchasing costs by adjusting the purchasing policy to price fluctuations. Secondly, due to the concavity of Equation 4, more stable production volumes lead to larger average revenues (see Electronic Companion EC.1). Note that across all instances, the production volumes of older products ( $Y_{2}$ & $Y_{3}$ ) vary significantly less than the production volumes of the youngest product ( $Y_{1}$ ). This highlights the effect of stock amelioration, as producers should prioritize the higher-margin older products in their inventory management decisions.

Figure 10.

Policy variability across different problem instances.

Finally, we investigate the effect of the factors $σ^{Ψ}$ , $k$ , $ϵ$ , and $τ$ on the average profits under the three blending regimes. Figure 11 presents the corresponding main effects when changing each factor from the ” $-$ ” level to the ” $+$ ” level. We use the instance with the largest average profits ( $σ^{Ψ} +, k +, ϵ -, τ -$ , average-age blending) for normalization. All presented main effects are statistically significant at the $1 %$ level. Note that an increase in evaporation has a strong negative effect under all regimes, outweighing the effects of all other factors. This implies that evaporation must be considered when managing ameliorating products. Counterintuitively, increased purchase price variability has a positive impact. In the no blending case, the profit increase in low-price periods outweighs the decrease in high-price periods. Blending flexibility facilitates a policy that adapts the purchasing volumes to the price fluctuations (cf. Figure 10). Because the purchasing volumes can thus be shifted to low price periods, larger price variability in combination with blending flexibility implies larger saving potentials. However, producers require sufficient processing capacity to exploit these. This is reflected in a strong positive interaction effect for factors $σ^{Ψ}$ and $k$ . Because more blending options enable a more adaptive purchasing policy, the main effect of the processing capacity $k$ strongly increases with the blending flexibility. Hence, if industries transition from minimum-age to average-age blending, provided that such regulatory changes are aligned with consumer expectations and industry traditions, producers can further enhance profitability by investing in additional processing capacity. Among all analyzed factors, the variance of target ages has the least impact on profit. In general, more dispersed target ages slightly reduce profitability due to increased evaporation losses associated with higher-margin older products.

Figure 11.

Main effects of factor levels on average profits under different blending regimes.

6.5. Policy Mining

We develop a two-step policy mining approach to analyze the drivers behind near-optimal decisions. In the first step, we map the inventory management policy using machine learning, similar to Bravo and Shaposhnik (2020), who mine optimal policies to obtain structural insights, and Kouvelis et al. (2024), who use classification trees to interpret near-optimal DRL policies. We train regression trees for each action dimension that use state variables as features and the corresponding policy output as the label. We provide the full details of this first step in Electronic Companion EC.10.

Trees are interpretable by following the feature evaluations from the root node to the leaf node. However, this path analysis does not quantify the impact of state variables on the eventual action in a given state. Therefore, we use Shapley values to derive such local explanations. Shapley values originate from game theory and determine how much each feature contributes to the deviation of the model output for the current observation from the mean model output. For the details of the computation of exact Shapley values for regression trees, we refer to Lundberg et al. (2020). When applied to MDPs policies, these values can be used to explain the factors behind actions in specific states (Beechey et al., 2023).

Figure 12 illustrates local explanations of the purchasing action ( $P$ ) and the production action for the youngest product ( $Y_{1}$ ) for an exemplary state. All volume-related values are shown as a proportion of the processing capacity $k$ . For reference, the state representation in Figure 12(a) also includes the average inventory volumes under the implemented policy. Starting from the average purchasing volume ( $0.496$ ) on the bottom, Figure 12(b) shows the contributions of individual state variables to the purchasing volume in the current state ( $0.519$ ). The large inventory volumes in the youngest age classes negatively influence the purchasing decision. However, the below-average purchase price offsets this effect, leading to a slightly above-average decision. For the production of the three-period-old product, the effects are generally smaller (cf. Figure 12(c)). This aligns with the results in Figure 10. The overall high level of young inventory supports increased production. The volatile inventory age structure and distribution of Shapley values across various age classes indicate that this policy extensively uses blending.

Figure 12.

Exemplary Shapley values (problem instance: $σ^{Ψ} -, k +, ϵ +, τ +$ , average-age blending regime). (a) State. (b) Shapley values for action $P$ (as proportion of $k$ ). (c) Shapley values for action $Y_{1}$ (as proportion of $k$ ).

Generally, policy mining, in combination with Shapley values, helps managers better understand why certain actions are taken in specific states, building trust in the solution approach. We further generate global policy insights by aggregating the local explanations for a set of recurrent states. Figure 13 shows the distribution of Shapley values for action dimensions $P$ and $Y_{1}$ across 5,000 states visited during policy simulation. The color coding shows the relationship between normalized feature values and their corresponding impact on decisions. Note that the Shapley values shown in Figures 12(b) and (c) represent a single dot for each feature in Figures 13(c) and (f), respectively.

Figure 13.

Global analysis of Shapley values for different blending regimes (problem instance: $σ^{Ψ} -, k +, ϵ +, τ +$ ). (a) Action $P$ , no blending $(v_{w} = 0)$ . (b) Action $P$ , minimum-age blending. (c) Action $P$ , average-age blending. (d) Action $Y_{1}$ , no blending $(v_{w} = 0)$ . (e) Action $Y_{1}$ , minimum-age blending. (f) Action $Y_{1}$ , average-age blending.

Naturally, there is a negative correlation between the purchasing actions and the purchase prices (see Figures 13(a–c)). When blending is prohibited (see Figure 13(a)), the price is the only relevant feature due to the lack of interaction between age classes in issuance decisions. The absolute impact of the purchase price $ψ$ even increases as the blending flexibility increases (see Figures 13(b) & (c)), illustrating the stronger adaptation of the policy. Under more flexible blending regulations, the adaptive purchasing policy translates into a more volatile inventory structure. This contributes to a more balanced impact of inventory volumes on purchasing decisions under the average-age regime compared to the minimum-age regime. Notably, large inventory volumes in old age classes can even positively affect purchasing decisions under the average-age regime (see Figure 13(c)). These old stocks should be blended with young stocks to create on-target blends (cf. Figure 9). Additional purchasing must compensate for the resulting reduction of inventory volumes in young age classes.

For the production volume of the youngest product, we also observe strong differences between the blending regimes (see Figures 13(d–f)). If blending is prohibited, the inventory volume in the target age ( $s_{3}$ ) is the sole decisive factor (see Figure 13(d)). With increasing blending flexibility, the impact of inventory volumes gets more balanced over the age classes. Along with that, the absolute impact of the age classes around the target age decreases. Purchase prices also moderately influence $Y_{1}$ volumes under the average-age regime. This results from the more adaptive purchasing policy in combination with amelioration effects. In high-price periods, producers should anticipate that lowering the production of the low-margin young product facilitates reduced purchasing while still allowing stable production volumes of higher-margin older products in future periods (see Figure 10). We outline the global Shapley value analysis for the remaining production actions in Electronic Companion EC.11.

Overall, Figure 13 shows that blending complicates the inventory problem, as decision-makers must consider various influencing factors to arrive at near-optimal decisions. For companies, this implies that fully reaping the profit improvement potential from more flexible blending strategies requires investing in sophisticated algorithmic support. To enhance the practical usability for inventory managers, a subsequent step could translate the policy mining insights into a set of “if-then” rules (Pahr et al., 2025).

7. Conclusion and Further Research

This paper tackles the inventory management of ameliorating food products such as whiskey, rum, and port wine using a DRL approach. Our methodology introduces several innovations, including a novel actor pipeline, an average reward algorithm, and new reward-handling techniques that utilize the underlying MDPs structure. Our managerial results demonstrate that blending substantially enhances profitability, suggesting that sectors such as whiskey and rum may benefit from re-evaluating their stringent blending regulations. Through policy mining, we identify the key drivers behind the value of blending and foster trust in the derived policy. Increased blending flexibility requires the consideration of an increased number of relevant decision factors, leading to more complexity in inventory management. We see several possibilities for future research. First, innovative actor pipelines and exploiting structural problem insights are promising avenues to enhance the applicability of DRL in the Operations Management domain. Future work could investigate adaptive actor pipelining that dynamically allocates action dimensions across actor components. Second, while this paper provides general insights into the specific characteristics of ameliorating inventory management, the modeling framework can be readily extended to integrate industry case-specific modifications, such as non-stationary environments or correlated supply-side and demand-side uncertainties. Moreover, subsequent research could investigate how endogenous sales price dynamics through pricing decisions or the market influence of large producers interact with the value of blending.

Supplemental Material

sj-pdf-1-pao-10.1177_10591478251387795 - Supplemental material for The Value of Blending—Managing Ameliorating Inventory Using Deep Reinforcement Learning

Supplemental material, sj-pdf-1-pao-10.1177_10591478251387795 for The Value of Blending—Managing Ameliorating Inventory Using Deep Reinforcement Learning by Alexander Pahr and Martin Grunow in Production and Operations Management

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Alexander Pahr

Martin Grunow

Supplemental Material

Supplemental material for this article is available online (doi: ).

How to cite this article

Pahr A and Grunow M (2025) The Value of Blending–Managing Ameliorating Inventory Using Deep Reinforcement Learning. Production and Operations Management XX(XX): 1–22.

References

Akkerman

Prak

Mes

(2025) Dynamic reordering and inspection for the multi-item inventory record inaccuracy problem. European Journal of Operational Research 321(2): 428–444.

Beechey

Smith

TMS

Şimşek

(2023) Explaining reinforcement learning with Shapley values. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S and Scarlett J (eds.) Proceedings of the 40th International conference on machine learning, Proceedings of Machine Learning Research, Vol. 202, pp.2003–2014. PMLR.

Bertsimas

Mišić

(2016) Decomposable markov decision processes: A fluid optimization approach. Operations Research 64(6): 1537–1555.

Boute

Gijsbrechts

van Jaarsveld

, et al. (2022) Deep reinforcement learning for inventory control: A roadmap. European Journal of Operational Research 298(2): 401–412.

Bravo

Shaposhnik

(2020) Mining optimal policies: A pattern recognition approach to model analysis. INFORMS Journal on Optimization 2(3): 145–166.

Buisman

Rohmer

(2022) Inventory decisions for ameliorating products under consideration of stochastic demand. International Journal of Production Economics 252: 108595.

Chan

TCY

Mahmood

O’Connor

, et al. (2023) Got (optimal) milk? Pooling donations in human milk banks with machine learning and optimization. Manufacturing & Service Operations Management 0(0). DOI: 10.1287/msom.2022.0455.

Chen

Yang

, et al. (2021) Managing perishable inventory systems with age-differentiated demand. Production and Operations Management 30(10): 3784–3799.

Chen

Zhou

(2019) Joint decisions for blood collection and platelet inventory control. Production and Operations Management 28: 1674–1691.

10.

Chen

Maravelias

(2022) Variable bound tightening and valid constraints for multiperiod blending. INFORMS Journal on Computing 34(4): 2073–2090.

11.

De Moor

Gijsbrechts

Boute

(2022) Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management. European Journal of Operational Research 301(2): 535–545.

12.

Deniz

Karaesmen

Scheller-Wolf

(2010) Managing perishables with substitution: Inventory issuance and replenishment heuristics. Manufacturing & Service Operations Management 12(2): 319–329.

13.

Dimson

Rousseau

Spaenjers

(2015) The price of wine. Journal of Financial Economics 118(2): 431–449.

14.

Dong

Kouvelis

(2014) The value of operational flexibility in the presence of input and output price uncertainties with oil refining applications. Management Science 60(12): 2908–2926.

15.

Dulac-Arnold

Levine

Mankowitz

, et al. (2021) Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Machine Learning 110(9): 2419–2468.

16.

European Commission . (2013) Technical File for Scotch Whisky. https://ec.europa.eu/geographical-indications-register/eambrosia-public-api/api/v1/attachments/45004 (Retrieved 29 September 2025).

17.

Fujimoto

van Hoof

Meger

(2018) Addressing Function Approximation Error in Actor-Critic Methods. https://arxiv.org/abs/1802.09477.

18.

Gijsbrechts

Boute

Van Mieghem

, et al. (2022) Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing & Service Operations Management 24(3): 1349–1368.

19.

Goh

Greenberg

Matsuo

(1993) Two-stage perishable inventory models. Management Science 39(5): 633–649.

20.

Haarnoja

Zhou

Hartikainen

, et al. (2019) Soft actor-critic algorithms and applications. https://arxiv.org/abs/1812.05905.

21.

Haijema

Minner

(2019) Improved ordering of perishables: The value of stock-age information. International Journal of Production Economics 209: 316–324.

22.

Harsha

Jagmohan

Kalagnanam

, et al. (2025) Deep policy iteration with integer programming for inventory management. Manufacturing & Service Operations Management 27(2): 369–388.

23.

Hekimoğlu

Kazaz

Webster

(2017) Wine analytics: Fine wine pricing and selection under weather and market uncertainty. Manufacturing & Service Operations Management 19(2): 202–215.

24.

Hester

Stone

(2012) TEXPLORE: Real-time sample-efficient reinforcement learning for robots. Machine Learning 90(3): 385–429.

25.

IWSR . (2024) Status spirits set to overcome short-term headwinds. https://www.theiwsr.com/status-spirits-set-to-overcome-short-term-headwinds/. International Wine and Spirits Research (IWSR), (retrieved 29 September 2025).

26.

Jahandideh

McCardle

Tang

, et al. (2023) Capacity allocation for producing age-based products. Decision Sciences 54(5): 473–493.

27.

Karmarkar

Rajaram

(2001) Grade selection and blending to optimize cost and quality. Operations Research 49(2): 271–280.

28.

Kouvelis

Liu

Qiu

, et al. (2023) Managing operations of a hog farm facing volatile markets: Inventory and selling strategies. Manufacturing & Service Operations Management 25(5): 1711–1729.

29.

Kouvelis

Liu

Turcic

(2024) An empirically grounded analytical approach to hog farm finishing stage management: Deep reinforcement learning as decision support and managerial learning tool. Journal of Operations Management 71(4): 426–446.

30.

Kulkarni

Francas

(2017) Capacity investment and the value of operational flexibility in manufacturing systems with product blending. International Journal of Production Research 56(10): 3563–3589.

31.

Liang

Liaw

Moritz

, et al. (2017) RLlib: Abstractions for distributed reinforcement learning. https://arxiv.org/abs/1712.09381.

32.

Lin

Buongiorno

(1998) Tree diversity, landscape diversity, and economics of maple-birch forests: Implications of markovian models. Management Science 44(10): 1351–1366.

33.

Liu

Alexopoulos

, et al. (2025) A simulation-driven machine learning framework for large-scale inventory management. Annals of Operations Research. DOI: 10.1007/s10479-025-06857-5.

34.

Lundberg

Erion

Chen

, et al. (2020) From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2(1): 56–67.

35.

Tang

Xia

, et al. (2021) Average-reward reinforcement learning with trust region methods. In: Proceedings of the thirtieth international joint conference on artificial intelligence. International Joint Conferences on Artificial Intelligence Organization. DOI: 10.24963/ijcai.2021/385.

36.

Méndez

Grossmann

Harjunkoski

, et al. (2006) A simultaneous optimization approach for off-line blending and scheduling of oil-refinery operations. Computers & Chemical Engineering 30(4): 614–634.

37.

Oroojlooyjadid

Nazari

Snyder

, et al. (2022) A deep Q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing & Service Operations Management 24(1): 285–304.

38.

Pahr

Grunow

Amorim

(2025) Learning from the aggregated optimum: Managing port wine inventory in the face of climate risks. European Journal of Operational Research 323(2): 671–685.

39.

Palstra

Meijer

(2024) Verification of the age of 10- and 20-year-old tawny port wines using radiocarbon. Food Chemistry 448: 139081.

40.

Papageorgiou

Toriello

Nemhauser

, et al. (2012) Fixed-charge transportation with product blending. Transportation Science 46(2): 281–295.

41.

Park

Choi

Min

(2023) Adaptive inventory replenishment using structured reinforcement learning by exploiting a policy structure. International Journal of Production Economics 266: 109029.

42.

Pietrek

(2023) Minimum Age Statements – Can We Do Better? https://www.rumwonk.com/p/minimum-age-statements-can-we-do (Retrieved 29 September 2025).

43.

Ray

Gerber

MacDonald

, et al. (2015) Climate variation explains a third of global crop yield variability. Nature Communications 6(1): 5989.

44.

Saxena

Khastagir

Kolathaya

, et al. (2023) Off-Policy average reward actor-critic with deterministic policy search. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S and Scarlett J (eds.) Proceedings of the 40th International conference on machine learning, Proceedings of Machine Learning Research, Vol. 202, pp.30130–30203. PMLR. https://proceedings.mlr.press/v202/saxena23a.html.

45.

Schulman

Wolski

Dhariwal

, et al. (2017) Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347.

46.

Schwartz

(1993) A Reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the Tenth International conference on machine learning, ICML’93. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. ISBN 1558603077, p.298–305. DOI: 10.1016/b978-1-55860-307-3.50045-9.

47.

Symington Family Estates . (2025) Symington Family Estates. https://www.symington.com/ (Retrieved 29 September 2025).

48.

van Hasselt

Guez

, et al. (2016) Learning values across many orders of magnitude. In: Lee D, Sugiyama M, Luxburg U, Guyon I and Garnett R (eds.) Advances in Neural Information Processing Systems, Vol. 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/5227b6aaf294f5f027273aebf16015f2-Paper.pdf.

49.

Zhang

Ross

(2021) On-policy deep reinforcement learning for the average-reward criterion. In: Meila M and Zhang T (eds.) Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 139, pp.12535–12545. PMLR. https://proceedings.mlr.press/v139/zhang21q.html.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.12 MB

The Value of Blending—Managing Ameliorating Inventory Using Deep Reinforcement Learning

Abstract

Keywords

1. Introduction

2. Literature Review

2.1. Blending

2.2. Inventory Management for Age-Differentiated Products

2.3. DRL for Inventory Management with Multidimensional Action Spaces

2.4. Research Gap

3. Modeling Ameliorating Food Inventory Management

3.1. Model Formulation

4.1. Actor Pipeline

4.3. Reward-Handling Techniques

6.1. Design of Experiments and Algorithm Configuration

Table 2. Factor levels in numerical experiments. Factor − level + level σ Ψ 0.15 ⋅ μ Ψ 0.25 ⋅ μ Ψ k 30 50 ϵ 0.02 0.03 τ { 4 , 5 , 6 } { 3 , 5 , 7 }

Table 3. Average profit for benchmark heuristics (mean over problem instances relative to the respective upper bound in % ). Benchmark heuristic N V P + V O L N V P + N V τ 2 D O + V O L 2 D O + N V τ R L P Average profit 76.0 81.9 72.8 74.6 95.9

Table 4. Average computational complexity of different actor pipeline configurations.

Supplemental Material

sj-pdf-1-pao-10.1177_10591478251387795 - Supplemental material for The Value of Blending—Managing Ameliorating Inventory Using Deep Reinforcement Learning

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Supplemental Material

How to cite this article

References

Supplementary Material

Table 2.
Factor levels in numerical experiments.

Factor $-$ level $+$ level

$σ^{Ψ}$ $0.15 \cdot μ^{Ψ}$ $0.25 \cdot μ^{Ψ}$

$k$ $30$ $50$

$ϵ$ $0.02$ $0.03$

$τ$ ${4, 5, 6}$ ${3, 5, 7}$

Table 3.
Average profit for benchmark heuristics (mean over problem instances relative to the respective upper bound in $%$ ).

Benchmark heuristic $N V P + V O L$ $N V P + N V τ$ $2 D O + V O L$ $2 D O + N V τ$ $R L P$

Average profit $76.0$ $81.9$ $72.8$ $74.6$ $95.9$

Table 4.
Average computational complexity of different actor pipeline configurations.