Sage Journals: Discover world-class research

Abstract

When a platform has limited inventory, it is important to have a variety of products available for each customer while managing the remaining stock. To maximize revenue over the long term, the assortment policy needs to take into account the complex purchasing behavior of customers whose arrival orders and preferences may be unknown. We propose a data-driven approach for dynamic assortment planning that utilizes historical customer arrivals and transaction data. To address the challenge of online assortment customization, we use a Markov decision process framework and employ a model-free deep reinforcement learning (DRL) approach to solve the online assortment policy because of the computational challenge. Our method uses a specially designed deep neural network (DNN) model to create assortments while observing the inventory constraints, and an advantage actor-critic algorithm to update the parameters of the DNN model, with the help of a simulator built from the historical transaction data. To evaluate the effectiveness of our approach, we conduct simulations using both a synthetic data set generated with a pre-determined customer type distribution and ground-truth choice model, as well as a real-world data set. Our extensive experiments demonstrate that our approach produces significantly higher long-term revenue compared to some existing methods and remains robust under various practical conditions. We also demonstrate that our approach can be easily adapted to a more general problem that includes reusable products, where customers might return purchased items. In this setting, we find that our approach performs well under various usage time distributions.

Keywords

Online Assortment Customization Deep Reinforcement Learning Simulation Reusable Products

1. Introduction

In online assortment customization, the platform develops a policy to offer products to diverse customers who arrive over time with the aim of maximizing total profits over a finite-time horizon (Bernstein et al., 2015; Chen et al., 2024; Golrezaei et al., 2014). Many online retail scenarios, including hotel bookings, show-ticket sales, and the sale of short-life-cycle products, involve presenting customers with various assortments from a limited inventory within a finite selling period. Our problem setting can be illustrated using the example of hotel booking, as shown in Figure 1. Different customers, represented by different colors, arrive one by one over time. When a customer arrives, the type is immediately revealed to the platform. The platform’s objective is to provide an assortment with a cardinality constraint, and the customer can choose to purchase from the offered set or leave without buying anything. If a hotel is chosen, the platform receives income from the sale and reduces the inventory count by one. If a customer chooses not to purchase, the platform earns zero revenue. During the selling horizon, the platform cannot display sold-out products, which is a standard requirement in retailing applications (Gallego et al., 2015). The objective of the platform is to maximize total revenue over the selling horizon by generating assortments sequentially.

Figure 1.

Online assortment customization with hotel booking as an example.

The input of our problem consists of historical arrival sequences and transaction data. However, the future customer arrival orders are unknown. To address this online optimization problem, assuming an adversarial pattern can be overly conservative and disregard historical arrival data. Conversely, fitting a stationary arrival pattern from data and assuming that different customer types arrive independently over time can result in an unrealistic assumption, as customer arrivals are typically random and follow a non-stationary pattern over time (Deng et al., 2022). The challenge of effectively utilizing historical arrival data in the online assortment optimization problem remains unresolved. The second challenge is to predict future customer choice behavior based on historical data. While researchers have put forth a variety of discrete choice models for such predictions, not all of them enable a tractable assortment optimization problem. Previous works on the online assortment customization problem usually assume a specific class of parametric choice models (Bernstein et al., 2015; Golrezaei et al., 2014; Gong et al., 2022; Rusmevichientong et al., 2020). However, the platform might find it challenging to select the appropriate choice model. A tractable parametric model may result in model misspecification, as real-world customer choice patterns are typically very complex. The model misspecification can propagate to the assortment decision and result in significant revenue losses. On the other hand, a comprehensive nonparametric choice model may lead to computational challenges. The final challenge to tackle concerns the limited inventory and the observation that assortment policies evolve over the selling period. Previous research has demonstrated that straightforward reservation strategies can enhance total revenue given certain assumptions about the underlying choice model (Golrezaei et al., 2014). However, a more tailored assortment strategy aimed at maximizing long-term revenue remains to be developed.

Considering these challenges, we propose a novel deep reinforcement learning (DRL) method to learn an assortment policy based on historical data. DRL, through its iterative interaction with the environment and progressive policy enhancement via trial and error, has proven to be an effective strategy for tackling real-world challenges associated with complex environmental dynamics in revenue management. Recent research in this area, including Oroojlooyjadid et al. (2022), Gijsbrechts et al. (2022), and Liu et al. (2024), demonstrates the tremendous power of DRL in outperforming state-of-the-art heuristics. For training our DRL agent, we construct a simulated environment where customers arrive according to historical arrival data and their choices are simulated based on a choice model, named the simulator, which is fitted on historical transaction data and exhibits high out-of-sample predictive accuracy. Subsequently, we interact with each customer by presenting assortments and observing their choices to iteratively enhance the assortment policy. Through our DRL agent’s interaction with this simulated environment, we continuously pose the simulated question: what would the revenue outcome have been if we had implemented an alternative assortment policy instead of the documented one? By leveraging responses from the simulated environment, our DRL agent learns to refine the current policy, ultimately aiming to maximize the total expected revenue. In particular, our approach belongs to the class of model-free reinforcement learning (RL), as the policy and value functions are approximated and solved without learning the environment explicitly. Our contributions can be summarized as follows.

First, to the best of our knowledge, this is the first study that solves the online assortment customization problem using DRL with general customer arrival processes and discrete choice models. Although DRL has been successful in various other applications, it is still challenging to apply to our problem, due to the unknown choice probabilities and the high-dimensional state and action spaces. This study not only provides a proof of concept that DRL, when adapted properly using the problem structure, can be a powerful tool for the online assortment customization problem, but also demonstrates the impressive performance against benchmarks in the literature.

Second, we propose several novel designs in the implementation of the RL algorithm. We develop a special architecture for the deep neural network (DNN) that combines Recurrent Neural Network (RNN) layers to effectively generate assortment decisions under two practical constraints: cardinality constraint and inventory constraint. The RNN sequentially selects products, naturally enforcing the cardinality constraint through a fixed number of selection steps and an “end-of-sequence” option, while inventory constraints are handled by dynamic masking of unavailable items. We also combine real-world sales data and a simulated environment to solve the issue of RL training demanding a large, sometimes unrealistic, number of transaction data. We adopt the advantage actor-critic (A2C) algorithm to update the parameters of the DNN that approximates the optimal assortment policy. Our framework is flexible to incorporate customer attributes for personalization and can be extended to settings with reusable products by augmenting the state with filtered past sales information.

Third, we conduct comprehensive numerical experiments to compare our approach with existing methods. We show that our method consistently matches or outperforms benchmark policies across various settings. The results offer several key insights. First, model misspecification, which happens when the fitted choice model does not reflect the actual customer behavior, can significantly reduce the performance of the learned assortment policy. Second, incorporating rich customer features yields significant revenue improvements. Overall, choosing the Markov chain (MC) choice model to fit the data, combined with the A2C algorithm, performs particularly well. It suggests that the MC choice model may strike a balance between complexity and tractability in RL algorithms. Finally, to support reproducibility and future research, we publicly release our code at https://github.com/Anonymous-Manuscript/DRL-assortment.

1.1. Organization of the Article

Section 2 provides a review of the related literature. Section 3 describes our problem setting in detail. Section 4 defines the MDP formulation of our problem and illustrates our model associated with its training algorithm. We present the numerical experiments in Section 5, including simulations based on a synthetical data set and a real-world transaction data set. More numerical experiments are included in the E-Companion, where we demonstrate the robustness and scalability of our approach. Section 6 concludes our article and provides several directions for future research.

2. Literature Review

2.1. Online Assortment Problem

Assortment optimization, which aims to choose the best assortment by solving a revenue maximization problem, is a well-studied topic in revenue management. Our work is closely related to several existing works about optimizing assortments with limited inventory for heterogeneous online arriving customers. Bernstein et al. (2015) consider a stochastic arrival order where the distribution of customer types is known, and the revenue maximization problem can be formulated as a dynamic program (DP). However, this formulation suffers from the notorious “curse of dimensionality” problem since we need high-dimensional state variables to record the remaining inventory of each product. They propose heuristics called ${Sub}_{0}$ and ${Sub}_{t}$ instead of solving the DP numerically. Besides, Golrezaei et al. (2014) consider the online assortment customization problem in an adversarial setting, and propose an inventory-balancing algorithm with 0.5 ( $1 - 1 / e$ ) competitive ratio guarantee in the asymptotic setting. The insight is to discount the price of products at low inventory levels, so they may be preserved for the long run. Different from the above two streams of arrival assumptions, we assume an unknown IID distribution of customer types, which is a moderate assumption between stochastic and adversarial. It assumes that the online arrival orders follow a distribution over some unknown base patterns, which are provided in the existing data set. Under this setting, Karande et al. (2011) and Alomrani et al. (2022) consider a well-known online bipartite matching problem. They claim that the arrival orders of online nodes are sampled from a base graph which can be revealed from an existing data set. However, our problem is more general than theirs in the way that we choose a subset of offline nodes one time and the matching between the offline nodes and the arriving online node is stochastic (the matching could be unsuccessful).

There are also several works relating to our extension where products are reusable. Under stochastic customer arrivals, the extra requirement of recording products that are currently in use makes the DP formulation even more intractable (Rusmevichientong et al., 2020). Rusmevichientong et al. (2020) propose a 1/2-approximation algorithm based on approximate dynamic programming. under the adversarial setting, the best algorithmic result is that the myopic policy is 1/2 competitive against the offline clairvoyant when the usage time of a product only depends on this particular product (Gong et al., 2022). Besides, Feng et al. (2024) prove that the aforementioned inventory-balancing algorithm is $(1 - 1 / e)^{2} \approx 0.4$ -competitive for the case of large capacity. Under the same large capacity assumption, Goyal et al. (2025) propose a fluid-guided algorithm that achieves a guarantee of $(1 - 1 / e)$ . Differently, our approach doesn’t depend on any assumption about product usage time or size of inventory. We record past sales outcomes to capture the reusable nature and transform them by a time series filter to input into our model. Recently, Meng et al. (2024) develop a framework that uses RL to solve the online assortment optimization problem in continuous time.

The aforementioned works on online assortment customization are a special case of choice-based revenue management. A significant issue in this line of work is how to describe the choice probability of an alternative when faced with a specific assortment, and this is often captured by a specific choice model in the literature. Under the random utility maximization (RUM) principle, several choice models are proposed, including multinomial logit (MNL) choice model (McFadden et al., 1973), nested logit choice model, (Talluri et al., 2004) and MC choice model (Blanchet et al., 2016). Although they all have explanations for customer behaviors, they cannot capture irrational choices. Non-parametric models like rank list-based model (Farias et al., 2009) and tree-based choice model (Chen et al., 2019) are proposed on this account. Additionally, several studies have sought to improve empirical predictive performance by integrating neural networks with choice modeling, treating the problem as a multi-class classification task. Bentz and Merunka (2000) start this line of work. They propose a partially connected neural network with shared weights and Softmax outputs for predicting choice probabilities, and this tends to possess stronger predictive power than the widely used MNL choice model in their experiments. In recent years, a number of works follow up, for example, Han et al. (2022), Aouad and Désir (2022), Gabel and Timoshenko (2022), and Cai et al. (2022), extending the previous work to deep learning with different architectures in different application scenarios. Specifically, Gabel and Timoshenko (2022) propose a scalable deep-learning model in the coupon distribution context, and Cai et al. (2022) propose both feature-based and feature-free deep learning-based choice models. Although these neural choice models benefit from strong predictive performance, optimizing assortment based on them is hard due to their complex structure. Different from the literature on choice-based revenue management that assumes a specific choice model or a class of choice models, our approach is model-free. We fill the gap in choice-based revenue management literature to learn an assortment policy based on an arbitrary choice model. By learning a policy from a simulated environment built upon historical data, Our work provides a new data-driven approach in revenue management (Chen and Hu, 2023).

2.2. Deep RL

RL defines the process where a goal-driven decision-maker interacts with the environment sequentially (Sutton and Barto, 2018). Traditional RL methods like Q-learning (Watkins and Dayan, 1992) are adopted for problems where the environment can be simply modeled (Rana and Oliveira, 2014). In highly complex environments, DNNs are needed to act as function approximators, leading to DRL methods. DRL has made a huge success in recent years in the fields like games (Mnih et al., 2015) and traffic planning (Xie et al., 2023).

Thanks to cross-disciplinary research, many innovative technologies have been applied in the field of OR/OM (Choi et al., 2022). This includes the application of RL in classical OR/OM problems. By modeling these problems as Markov decision processes (MDPs), RL leads to long-term reward maximization. For MDP with large state space, tractable solutions are infeasible, and here DRL can be applied to learn a near-optimal policy. It has been shown that the performance of DRL beats conventional heuristics in many OR/OM applications. Bello et al. (2016) and Delarue et al. (2020) apply DRL in traditional combinatorial optimization problems, for example, the traveling salesman problem, KnapSack problem, and vehicle routing problem. More recently, Kokkodis and Ipeirotis (2021) adopt DRL in online labor markets; Green and Plunkett (2022) train a DRL agent to bargain optimally for both sides as platforms and buyers on eBay; Alomrani et al. (2022) leverage DRL for solving the online bipartite matching problem, and the DRL agent can be trained to choose one conjoint offline node to match once an online node arrives, with the goal of maximizing the total number of matches. In the field of revenue management (RM), Yang et al. (2022) leverage DRL to learn how to price and disclose the true information of fresh produce based on its current inventory and the remaining selling time; Liu (2023) develop DRL to learn a coupon targeting strategy; Oroojlooyjadid et al. (2022), Gijsbrechts et al. (2022), and Liu et al. (2024) train DRL agents for different inventory management tasks. Differently from the above problems in RM, the assortment problem is oriented to multiple products, and the substitution effects between products should be captured. While the actions in the above OR/OM applications are either to choose a particular node or to determine a specific value, the action in our problem is to choose a subset of products to offer once an online customer comes, which is more difficult because of its combinatorial nature.

DRL has also been adopted in the recommender system (RS); however, our work differs in both the problem it addresses and the methodology it employs. The RS literature typically studies repeated interactions with a single user (Xue et al., 2025) and uses RL to learn a policy that maps past recommendations and feedback to new ones to maximize engagement metrics. The major component to learn is the preference and internal state of a customer from past interactions. In contrast, the system dynamics in our application stem from the stochastic arrival of different customer types and the evolution of inventory as products are purchased over time. RL is used to learn the complex policy that depends on the combinatorial space of remaining inventory. From the perspective of methodology, Ie et al. (2019) develop a SlateQ algorithm based on DQN to propose recommendation sets. They learn a Q-value for each item and solve a static combinatorial optimization problem, where the Q-values are treated as the prices of each item. However, this optimization problem is only tractable under specific choice models such as the MNL choice model Rusmevichientong et al. (2010). The dependence of SlateQ on the MNL choice model results in a model misspecification problem, leading to poor performance in contexts where the MNL choice model is hard to capture real choice behavior. We compare our method with other methods in the literature in Table 1.

Table 1.
Comparison of our method with the ones in the literature.

Limited inventory Historical arrival data Arbitrary choice model

Gong et al. (2022) ✘ ✘ ✘

Golrezaei et al. (2014) ✓ ✘ ✘

Bernstein et al. (2015) ✓ ✓ ✘

Rusmevichientong et al. (2020) ✓ ✓ ✘

Ie et al. (2019) ✓ ✓ ✘

Ours ✓ ✓ ✓

	Limited inventory	Historical arrival data	Arbitrary choice model
Gong et al. (2022)	✘	✘	✘
Golrezaei et al. (2014)	✓	✘	✘
Bernstein et al. (2015)	✓	✓	✘
Rusmevichientong et al. (2020)	✓	✓	✘
Ie et al. (2019)	✓	✓	✘
Ours	✓	✓	✓

3. Problem Setting

We consider an online assortment optimization problem in a finite selling horizon. Each horizon consists of $T$ periods, with one customer coming at each period. The platform holds a set of products $N = {1, 2, \dots, N}$ . With a bit of abuse of notations, $N$ is also used to denote the total number of products. The initial inventory levels is denoted as $I_{1} = (I_{11}, I_{12}, \dots, I_{1 N})$ , where $I_{1 i}$ represents the initial inventory of product $i$ . The revenue of product $i$ is $r_{i}$ , which is fixed in the selling horizon. There is no inventory replenishment during the selling horizon.

Let $Z = {1, \dots, Z}$ represent the set of customer segments, with $Z$ denoting the total number of customer types. For the customer who arrived in period $t$ , the type $z_{t} \in Z$ is immediately revealed. Based on the type of customer, the remaining inventory and the current time period, the platform provides a set of products $a_{t}$ from the set of available assortments $A_{t} = {a | I_{t i} > 0, \forall i \in a; | a | \leq C}$ , where $I_{t i}$ denotes the inventory level of product $i$ at period $t$ , $| a |$ denotes the size of the assortment $a$ , and $C$ is the cardinality constraint of an assortment.

The customer’s choice probability of type $z_{t}$ is represented by $p_{z_{t}} (i, a_{t})$ for product $i \in a_{t}$ . We assume that the customer purchases at most one product and the no-purchase option is indexed as product 0. For any product $i \notin a_{t}$ , the choice probability is $p_{z_{t}} (i, a_{t}) = 0$ . Additionally, the probabilities of choosing all available products in $a_{t}$ and the no-purchase option sum to 1, i.e., $\sum_{i \in a_{t}} p_{z_{t}} (i, a_{t}) + p_{z_{t}} (0, a_{t}) = 1$ . At time period $t$ , the expected revenue $f (a_{t}, z_{t})$ associated with the assortment $a_{t}$ and customer type $z_{t}$ is represented as follows:

f (a_{t}, z_{t}) = \sum_{i \in a_{t}} r_{i} p_{z_{t}} (i, a_{t}) .

An assortment policy is a mapping from the platform’s current information, which consists of the current time period, inventory levels, and arriving customer type, to the selected assortment. The goal is to develop an assortment policy

π

to maximize the total expected revenue:

max_{π} E [\sum_{t = 1}^{T} f (a_{t}, z_{t})],

(1)

where the expectation is taken over the arrival order of the underlying customer types, the assortment policy (if stochastic), and the choice behavior of each arriving customer.

We solve this problem based on historical data, which provides two key types of information. First, we can infer the nonstationary arrival patterns. Since we have recorded customer arrivals by their types, one can fit the proportion of all the types and assume stationary arriving customer types, which leads to more tractable policies (Bernstein et al., 2015). We, on the other hand, allow for nonstationary arrivals of customer types that are directly extracted from the historical data. Second, historical transaction data captures customer preferences, including offered assortments and corresponding choices. Using this data, we can fit a choice model for each customer type $z$ , mapping an assortment to choice probabilities $p_{z} (i, a_{t})$ . While parametric models like MNL allow for a dynamic programming formulation of (1), they are prone to model misspecification, leading to revenue losses. More flexible choice models (Aouad and Désir, 2022; Cai et al., 2022; Chen et al., 2019) better capture observed behaviors but complicate the explicit formulation and optimization. To overcome these limitations, we propose an approach that avoids explicitly modeling the structure of a choice model when solving (1). Instead, we start with an initial policy $π$ and iteratively refine it through interactions with an environment constructed from historical data, optimizing long-term revenue over the selling horizon. With a DNN approximating the policy $π$ , actor-critic methods provide convergence guarantees to a locally optimal policy (Sutton et al., 1999), enabling tractable optimization even under complex, non-parametric choice models in (1).

4. A2C for Online Assortment Customization

In this section, we describe our neural network architecture and the RL framework. Firstly, we formalize the online assortment customization problem as an MDP. Then we introduce our neural network model architecture. Lastly, we describe the actor-critic (A2C) training algorithm based on a simulator.

4.1. MDP Formulation

The online assortment customization problem can be formulated as a finite-horizon MDP, represented by $M = (T, S, A, R, P)$ . We use $T$ to represent the length of a selling horizon, that is, the number of customer arrivals. $S$ and $A$ are the state and action space, respectively. At each discrete time period $t \in {1, \dots, T}$ , the platform agent observes a state $S_{t}$ and needs to take an action $a_{t}$ . A stochastic reward $R_{t}$ is then revealed and the state transitions to another state according to the transition probability $P$ . We describe each component of the MDP in detail as follows:

–
State: At time period $t$ , we observe the current inventory level $I_{t} = (I_{t 1}, \dots, I_{t N})$ and the type of the arrivial customer $z_{t}$ . Thus, the state vector $S_{t}$ at time period $t$ is expressed as follows:
$S_{t} = (I_{t}, z_{t}, t) .$

–
Action: Upon arriving a customer, the platform agent offers a feasible assortment. We use a $N$ -dimensional binary vector $a_{t} = {0, 1}^{N}$ to represent the action in time $t$ , then the displayed assortment is ${i \in N | a_{t i} = 1}$ , where $a_{t i}$ denote the $i$ -th element of vector $a_{t}$ . Due to inventory and cardinality constraints, the action space is $A_{t} = {a \in {0, 1}^{N} | I_{t i} \geq a_{i}, \forall i \in N; \sum_{i = 1}^{N} a_{i} \leq C}$ , indicating that recommendations cannot include products that are out of stock. This constraint leads to an exponentially growing number of possible actions based on the feasible products. Furthermore, actions across different time periods are interdependent due to the inventory constraint, meaning that a product cannot be recommended once it is out of stock.
–
Reward function: The assortment optimization problem is revenue-driven, so we consider the revenue of the product sold in time period $t$ as our reward $R_{t}$ in period $t$ :
$R_{t} = \sum_{i = 1}^{N} r_{i} (I_{t i} - I_{(t + 1) i}) .$
This reward is a function of $S_{t}, a_{t}, S_{t + 1}$ . If the customer leaves without purchase, the reward is 0.
–
Environment dynamics: With the elements introduced above, the MDP of our problem can be described as $(S_{1}, a_{1}, R_{1}, \dots, S_{T}, a_{T}, R_{T})$ . The state $S_{t}$ transitions to $S_{t + 1}$ after action $a_{t}$ is taken at time period $t$ . The stochastic transition is embodied in two aspects: the product inventory level $I_{t i}$ changes if product $i$ is sold in time period $t$ ; the customer type $z_{t}$ changes if $z_{t + 1} \neq z_{t}$ . Both these two incidents happen in a stochastic way, where $P (\cdot)$ denotes the probability of an event and $D_{z} (S_{t})$ is the distribution of customer types given state $S_{t}$ :
$\begin{aligned} P (I_{(t + 1) i} = I_{t i} - 1) = p_{z_{t}} (i, a_{t}), \\ P (I_{(t + 1) i} = I_{t i}) = 1 - p_{z_{t}} (i, a_{t}), \\ \forall i \in N, z_{t + 1} \sim D_{z} (S_{t}) . \end{aligned}$
We use $P (S_{t + 1} | S_{t}, a_{t})$ to denote the transition probability from state $S_{t}$ to state $S_{t + 1}$ , when action $a_{t}$ is taken as follows:
$\begin{aligned} P (S_{t + 1} = (I_{t}, z_{t + 1}, t + 1) | S_{t}, a_{t}) \\ = (1 - \sum_{i = 1}^{N} p_{z_{t}} (i, a_{t})) \times D_{z} (z_{t + 1} | S_{t}), \\ P (S_{t + 1} = (I_{t}^{i}, z_{t + 1}, t + 1) | S_{t}, a_{t}) = p_{z_{t}} (i, a_{t}) \times D_{z} (z_{t + 1} | S_{t}), \\ \forall i \in N, \\ \sum_{z_{t + 1} \in Z} {[P (S_{t + 1} = (I_{t}, z_{t + 1}, t + 1) | S_{t}, a_{t}) \\ + \sum_{i = 1}^{N} P (S_{t + 1} = (I_{t}^{i}, z_{t + 1}, t + 1) | S_{t}, a_{t})]} = 1, \end{aligned}$
where the inventory vector after product $i$ is purchased at time $t$ is defined as $I_{t}^{i} ≐ (I_{t 1}, \dots, I_{t i} - 1, \dots, I_{t N})$ , and $D_{z} (z_{t + 1} | S_{t})$ denotes the probability of a customer of type $z_{t + 1}$ coming at time $t + 1$ given state $S_{t}$ at the previous time period.

A policy $π$ in the MDP maps a state to a distribution of all possible actions, $π : S_{t} \mapsto Δ (A_{t})$ . Following policy $π$ , the future total return starting from time period $t$ can be described as $G_{t}^{π} (S_{t})$ :
$G_{t}^{π} (S_{t}) = \sum_{l = 0}^{T - t} E_{π} [R_{t + l} | a_{t + l} \sim π] .$
Define the optimal value function $v^{} (S_{t}) = max_{π} G_{t}^{π} (S_{t})$ , indicating the maximum expected return we can get starting from state $S_{t}$ . Assuming countable states and actions and known $P (S_{t + 1} | S_{t}, a_{t})$ , the optimal value functions and optimal actions can be obtained by solving the well-known Bellman equations (Bellman, 1954):
$v^{} (S_{t}) = max_{a_{t} \in A_{t}} \sum_{S^{'}} P (S^{'} | S_{t}, a_{t}) [R + v^{} (S^{'})],$
where $S^{'}$ is a possible transition state when $a_{t}$ is taken at state $S_{t}$ , and $R$ is a function of $S^{'}, S_{t}, a_{t}$ , indicating the reward. However, note that the form and value of $p_{z_{t}} (i, a_{t})$ and $D_{z} (S_{t})$ are unknown to the DRL agent, so the transition probability $P (S_{t + 1} | S_{t}, a_{t})$ is unknown in advance. On the other hand, with a larger number of products $N$ , large state space and particularly large action space make this equation intractable. Furthermore, when we consider reusable products, the requirement to record the in-use products leads to even higher-dimensional state variables, see the detail in E-Companion EC.11. Consequently, an optimal policy is not available from solving Bellman equations. We aim to train a model-free DRL policy through trial and error. Specifically, we leverage a policy network to specify our assortment policy $π_{θ}$ , where $θ$ denotes the parameters of the network, with the goal of finding the best policy to maximize the total reward in a selling horizon:
$θ^{} = \underset{θ}{\arg \max} E_{S_{1}, a_{t} \sim π_{θ}, S_{t + 1} \sim P (S_{t + 1} | S_{t}, a_{t}), t = 1, \dots, T} [\sum_{t = 1}^{T} R_{t}] .$
(2)

Figure 2.
Our deep neural network (DNN) architecture.
4.2. Model Architecture

We develop a DNN model to generate assortment and train it using the A2C algorithm (Konda and Tsitsiklis, 1999). There are three streams of the DRL method for policy learning. The value-based method, such as deep Q-learning (Mnih et al., 2015), keeps track of state-action value function $Q^{π} (S, a)$ for every state-action pair. Given a state, one can choose the action that maximizes the value function. However, in our problem with combinatorial action space, the optimal Q-function is hard to learn, and the arg max search among all actions is intractable. On the other hand, policy-based RL like reinforce (Williams, 1992) adopt a policy network to generate a distribution over all actions, and we can sample an action according to this distribution. The parameters of the policy network are updated by gradient descent. Although the learning process is simple, it is less stable and suffers from the high variance issue. The actor-critic architecture, positioned at the crossroads of these two approaches, constructs an actor to direct the action, and a critic is developed to evaluate the performance of this action. We build up a model consisting of the policy network, the value network and the shared layers.

Figure 2 illustrates the architecture of our model. The input is the state vector, which contains information about both the products and the arriving customer. The inventory vector $I_{t}$ is normalized into $[0, 1]^{N}$ . For a customer of type $z_{t}$ , the type is indexed into a learnable embedding matrix $E^{z}$ of shape $(Z, d^{z})$ , where $Z$ is the number of customer types, and $d^{z}$ represents the dimension of the customer embedding vector. Additionally, we encode one-dimensional time information $t$ using a learnable embedding matrix $E^{t}$ of shape $(T, d^{t})$ , where $d^{t} = 1$ specifies the dimension of the time embedding vector.

The processed state vector is then passed through a multi-layer fully connected neural network, which models customer preferences across segments and accounts for the complex relationships among products. These relationships, including substitution and complementarity effects, are challenging to capture with parametric choice models. The output of these shared hidden layers, referred to as the “learned state” $S_{t}^{'}$ , is used as input to both the value network and the policy network. The shared layers are designed to facilitate faster convergence. The value network is a fully connected neural network built to produce the estimated expected revenue starting from this learned state. There are two intuitions in constructing this value network. First and most important, it provides a criterion for evaluating the current policy and thus helps stabilize the training process, as detailed in Section 4.4 and verified in E-Companion EC.8. The second role of the value network can serve to verify whether the trained network works well. We demonstrate this in E-Companion EC.6 by interpreting that the output of the value network truly estimates the expected future return during testing. The policy network is a RNN, which generates the assortment set in an auto-regressive way so that the chosen products are related. The detail about the generation process is shown in Section 4.2.1. We use $θ$ to denote all the parameters in the model that need to be fine-tuned through training, including embedding matrices, the parameters in the shared layers, the parameters in the policy network and the parameters in the value network. Given $θ$ , our model generates a value function $v^{π} (S_{t}; θ)$ and a policy $π (S_{t}; θ)$ for input state $S_{t}$ at time period $t$ .

Figure 3.

Policy network.

Our framework can be extended to incorporate consumer features instead of relying on categorical customer types, to offer personalized assortments. To achieve this, the embedding matrix is replaced with a dense layer that processes a continuous consumer feature vector to output an embedded representation. This dense layer is trained jointly with the main neural network. Experiments on real-world data, where customer features are available, are presented in Section 5.2 to demonstrate the framework’s flexibility in capturing contextual information.

We conduct ablation studies in E-Companion EC.8, to evaluate the value of key components in our model. Specifically, we evaluate the value of the customer-type embedding matrix by analyzing the performance of an architecture that excludes it. The role of the value network is assessed similarly. Furthermore, we examine the effectiveness of the RNN-based policy network by comparing it to a policy network implemented with a standard multilayer perceptron (MLP), highlighting its advantages in capturing product correlations during assortment generation.

4.2.1. Assortment Action

The policy network is designed to map the learned state $S_{t}^{'}$ , produced by the shared layers, to an assortment set. However, directly adopting a fully connected layer to output a one-hot vector to represent the assortment action is not feasible due to the combinatorial nature of the large discrete action space. To address this, we propose an RNN-based policy network that sequentially selects products for the assortment. This policy network effectively manages the large discrete action space and, by utilizing the hidden states in the RNN, accounts for product correlations during assortment formation. Additionally, we introduce a masking mechanism to handle multiple constraints in our context. An overview of the process is shown in Figure 3.

The first component of the policy network is an RNN, which sequentially maps the learned state $S_{t}^{'}$ to $C$ output vectors $o_{τ} \in R^{N + 1}, τ \in {1, \dots, C}$ , where $C$ is the cardinality constraint of an assortment. Each output vector $o_{τ}$ is a function of a hidden state $h_{τ}$ , which is updated iteratively using the learned state vector $S_{t}^{'}$ and the previous hidden state $h_{τ - 1}$ :

o_{τ} = g (h_{τ}), h_{τ} = f (S_{t}^{'}, h_{τ - 1}), \forall τ \in {1, \dots, C},

where the initial hidden state is set as

h_{0} = 0

. The RNN is implemented using the PyTorch module, with

g (\cdot)

and

f (\cdot)

constructed as fully connected neural network layers.

Each $(N + 1)$ -dimensional output vector $o_{τ}, τ \in {1, \dots, C}$ is related to a product-selection decision while the assortment is forming sequentially in $C$ steps. The first $N$ elements of $o_{τ}$ determine whether we should select each of the $N$ products into the assortment at step $τ$ and the last element represents the “end-of-sequence” (“eos”) decision, determining whether we should stop the assortment generation process at step $τ$ . To ensure practical feasibility, the output vector $o_{τ}$ is timed with two $(N + 1)$ -dimensional masking vectors. First, the inventory masking vector $M_{τ}^{I}$ is adopted to ensure that products out of stock are excluded by setting their probabilities to zero, which is also the so-called “Gate” step in Figure 2. Specifically, the $i$ -th element of $M_{τ}^{I}$ is $- \infty$ if $I_{t i} = 0$ and 1 otherwise, and the last element of $M_{τ}^{I}$ is always set to 1. Second, a selection masking vector $M_{τ}^{s}$ is adopted to ensure that previously selected products are masked at each product selection step to avoid duplication. The $i$ -th element of $M_{τ}^{s}$ is $- \infty$ if product $i$ is already in the assortment and 1 otherwise. Moreover, we mask we mask the “eos” in the first step to prevent premature termination. The element masked with $- \infty$ is shown in a dark color in Figure 3. The final masked output vector at step $τ$ is:

o_{τ}^{M} = o_{τ} \cdot M_{τ}^{I} \cdot M_{τ}^{s}, \forall τ \in {1, \dots, C} .

Applying the Softmax function to masked output vectors $o_{τ}^{M}$ produces distribution vectors $D_{τ} (S_{t})$ :

D_{τ} (S_{t}) = Softmax (o_{τ}^{M}), \forall τ \in {1, \dots, C} .

Each distribution vector

D_{τ} (S_{t})

has

N + 1

elements, where the

i

-th element represents the probability of selecting product

i \in {1, 2, \dots, N}

, and the last element corresponds to the probability of selecting the “eos” decision. The assortment generation process selects products sequentially based on the distribution vectors. There are

C

steps of product selection in total. At each sequential step

τ \in {1, \dots, C}

, a product is selected based on distribution vector

D_{τ} (S_{t})

. The first product

i_{1}

is chosen from

D_{1} (S_{t})

, either stochastically or deterministically as detailed later. This product selection process continues until either

C

products are chosen or the “eos” token is selected, forming a sequence of products

i_{1}, i_{2}, \dots, i_{c}

with

c \leq C

, which is the generated assortment set

a_{t}

. Note that the assortment set

a_{t}

and its binary vector representation

a_{t}

are interchangeable. Due to the introduced masking mechanism, the masked indexes in

D_{τ} (S_{t}), τ \in {1, \dots, C}

have zero probability of being chosen, ensuring that constraints are not violated.

With distribution vectors $D_{τ} (S_{t}), τ \in {1, \dots, C}$ , the stochastic policy is formally defined as follows:

\begin{aligned} π_{θ} (a_{t} = {i_{1}, i_{2}, \dots, i_{c}} ∣ S_{t}) \\ = {\begin{cases} \prod_{τ = 1}^{C} D_{τ} (i_{τ} ∣ S_{t}), & if c = C, \\ \prod_{τ = 1}^{c} D_{τ} (i_{τ} ∣ S_{t}) \cdot D_{c + 1} (e ∣ S_{t}), & if c < C, \end{cases} \end{aligned}

(3)

where

D_{τ} (i_{τ} ∣ S_{t})

indicates the

i_{τ}

-th element of

D_{τ} (S_{t})

. This stochastic policy is used during training for exploration, while a deterministic policy is employed during validation and testing, following Alomrani et al. (2022), where the highest-value index,

i_{τ} = {\arg \max}_{i} D_{τ} (i | S_{t})

is selected at each step

τ

The RNN-based policy network has three key advantages. First, it effectively handles the high-dimensional action space, where the large number of possible assortments makes it impractical to directly output a one-hot vector indicating the assortment action. Second, the sequential selection process dynamically prioritizes products by their state-aware value (e.g., Product 2 in Figure 3 is selected first due to its highest inferred value) while implicitly capturing product correlations through the passing hidden states. Last, the cardinality constraint is naturally considered by constructing the “eos” decision and the $C$ sequential product selection steps, and inventory constraints are incorporated through the masking process.

4.3. Simulator

The DRL agent is data-hungry, requiring substantial data to learn accurate policy and value networks. However, in real-world applications, transaction data is often limited. To train the A2C algorithm, we construct a simulator using historical transaction data $D$ , which captures assortments, customer types, and choices across multiple selling horizons. Each observation in horizon $h$ and period $t$ is represented as $(z_{h t}, a_{h t}, y_{h t})$ , where $z_{h t}$ is the customer type, $a_{h t}$ is the offered assortment, and $y_{h t}$ is the customer’s choice. To build the simulator, we fit the data using a type-based choice model, which in turn allows us to simulate infinite type-specific customer transactions across selling horizons to train the algorithm. We do not specify the choice model to use, which can be parametric or non-parametric. The transaction data has two main functions: it helps the estimation of the choice behavior as well as the training of the A2C algorithm, with the former less data-hungry than the latter. While a well-designed simulator may not fully capture real customer decision-making due to limited historical data, it provides a computationally efficient way to learn the assortment policy through A2C. If the simulator accurately predicts the ground-truth choice behavior, the assortment policy is likely to perform well in real-world settings. As new real-world data becomes available (e.g., after another selling season), the choice model in the simulator can be easily updated, allowing for retraining and continuous improvement of the A2C algorithm.

4.4. Training Algorithm

We adopt the A2C algorithm to update our model by interacting with the constructed simulated environment. For a selling horizon with $T$ periods, we update the parameter $θ$ every $M$ periods, through a training buffer of observed states, actions, and received rewards. Here $M$ is a hyper-parameter to be tuned. During the $k$ -th training buffer, $k \in {1, 2, \dots, ⌈ T / M ⌉}$ , we follow the current stochastic policy $π_{θ_{k}}$ for $M$ subsequent periods starting from state $S_{M (k - 1)}$ . We keep track of the facing states $(S_{M (k - 1)}, \dots, S_{M k})$ , the taken actions $(a_{M (k - 1)}, \dots, a_{M k})$ , the incurred rewards $(R_{M (k - 1)}, \dots, R_{M k})$ and the resulting state $S_{M k + 1}$ after this training buffer. We compute the loss $L_{k}$ in this training buffer given the record, and then use an optimizer like Adam (Kingma, 2014) to update current parameters in the direction of gradient to reduce the total loss, where $α$ is the step size:

θ_{k + 1} = θ_{k} - α \nabla_{θ} L_{k} .

(4)

The loss

L_{k}

consists of three parts: value loss

L_{k}^{v}

, policy loss

L_{k}^{p}

, and entropy loss

L_{k}^{e}

. We detail them separately.

Value loss measures how well the value network estimates expected future reward at each time period. At time period $l$ in $k$ -th training buffer, $l \in {M (k - 1), \dots, M k}$ , the recorded state $S_{l}$ corresponds to an expected future value estimation by current parameter: $v (S_{l}; θ_{k})$ . Based on the recorded rewards and the final resulting state, we can compute another value expectation: $R_{l} + I (l \neq T) (\sum_{t = l + 1}^{M k} (R_{t}) + v (S_{M k + 1}; θ_{k}))$ , where $I (l \neq T)$ is the indicator function of whether it comes to the end of the selling season. Note that we don’t consider a discount factor here. This value expectation is more realistic since it contains real feedback from the environment. In order to get a more accurate value estimation, we want this difference to be as small as possible. We construct value loss as the squared error of this difference at each state in the training buffer:

\begin{aligned} L_{k}^{v} & = \sum_{l = M (k - 1)}^{M k} (R_{l} + I (l \neq T) (\sum_{t = l + 1}^{M k} R_{t} + v (S_{M k + 1}; θ_{k})) \\ - {v (S_{l}; θ_{k}))}^{2} / M . \end{aligned}

(5)

Policy loss measures how well the generated action performs. For the “good” actions, we want to increase the probability of being chosen. The norm advantage ${\hat{A}}_{l}$ is used to act as the critique of the performance of actions:

{\hat{A}}_{l} = R_{l} + I (l \neq T) v (S_{l + 1}; θ_{k}) - v (S_{l}; θ_{k}) .

(6)

At time period

l

, the advantage can be positive, which means that we have taken a better-than-average action and we should increase its probability, zero, or negative, which means we have taken a worse-than-average action and we should decrease its probability. For the set action

a_{l}

at time period

l

, the probability of it being taken, which is denoted as

π_{θ_{k}} (a_{l} ∣ S_{l})

, is calculated by (3). We define the policy loss as below:

L_{k}^{p} = - \sum_{l = M (k - 1)}^{M k} {\hat{A}}_{l} \cdot (\log π_{θ_{k}} (a_{l} ∣ S_{l})) .

(7)

Entropy loss is used to ensure that we can explore more actions and avoid A2C converging to local optima. The entropy of time period $l$ in $k$ -th training buffer is defined as the entropy of the chosen action $a_{l}$ : $π_{θ_{k}} (a_{l} ∣ S_{l}) \log π_{θ_{k}} (a_{l} ∣ S_{l})$ . This entropy is maximized when each assortment shares the same probability of being chosen at this time. So entropy loss is defined as the total negative entropy of $k$ -th training buffer as follows:

L_{k}^{e} = - \sum_{t = M (k - 1)}^{M k} π_{θ_{k}} (a_{t} ∣ S_{t}) \log π_{θ_{k}} (a_{t} ∣ S_{t}) .

(8)

The total loss function can be expressed as below:

L_{k} = β_{v} L_{k}^{v} + β_{p} L_{k}^{p} + β_{e} L_{k}^{e},

(9)

where

L_{k}^{v}, L_{k}^{p}, L_{k}^{e}

are calculated as in (5), (7), (8), and

β_{v}, β_{p}, β_{e}

are the regularization factors of value loss, policy loss, and entropy loss, respectively. The total loss

L_{k}

is used for parameter updating in (4).

The training process is detailed in Algorithm 1. Specifically, the first two inputs are based on historical data. The recorded historical arrival sequences, ${z_{1}, z_{2}, \dots, z_{H}}$ , describe the arrival order of customer types across $H$ historical selling horizons. Each sequence $z_{j} = [z_{j 1}, z_{j 2}, \dots, z_{j T}], \forall j \in {1, \dots, H}$ , consists of $T$ elements, where $z_{j t} \in Z$ represents the type of customer arriving at period $t$ in this selling horizon $j$ . During the training process, the A2C agent offers an assortment based on the current parameter set and the simulator $ϕ$ generates the customer choice. Although historical data is limited, the simulator enables the generation of unlimited simulated choices, allowing the A2C agent to learn sufficiently from interactions. Specifically, we run Algorithm 1 multiple times, validate the agent’s performance after each run, and save the best parameters $θ^{*}$ .

5. Numerical Experiments

In this section, we evaluate the performance of the proposed DRL approach through semi-synthetic numerical experiments. Section 5.1 presents simulations conducted on a synthetic data set, which is constructed based on a predefined customer arrival pattern and ground-truth choice model. First, we describe the construction of the simulation environment and the generation of synthetic transaction data in Section 5.1.1. The training and testing procedures for the DRL model are detailed in Section 5.1.2, while the criterion for selecting the simulator for training is outlined in Section 5.1.3. The main results of these simulations are provided in Section 5.1.4, while the interpretation of the results and more robustness checks are shown in E-Companion EC.6 and E-Companion EC.7. Section 5.2 presents a simulation based on a real-world hotel-booking data set, where we show our framework can be extended to capture customer attributes. The simulation results are presented in Section 5.2.1, with data preprocessing steps and simulator fitting deferred to E-Companion EC.9. Finally, numerical experiments extending our framework to a reusable product setting are discussed in E-Companion EC.11.

Below we introduce several existing approaches for the online assortment customization problem to compare with our approach.

–
SlateQ-MNL: SlateQ-MNL is a DRL algorithm based on deep Q-learning (Ie et al., 2019). Unlike our method, which directly maps state to action through a specially designed neural architecture, SlateQ-MNL follows a less flexible two-stage approach, where actions are selected by solving an optimization problem based on the output scores of a Q-network. Specifically, at state $S_{t}$ , SlateQ-MNL outputs the Q-value of each product $q_{i} (S_{t}), \forall i \in N$ , and then solves a revenue maximization problem based on a fitted MNL choice model, regarding the $i$ -th Q-value $q_{i} (S_{t})$ as the price of product $i$ :
$a_{t} = \underset{a \in A_{t}}{\arg max} \sum_{i \in N} q_{i} (S_{t}) p_{z_{t}} (i, a),$
where $A_{t}$ is the set of available assortments considering inventory and cardinality constraint. The MNL choice model is adopted to ensure the above optimization problem can be solved efficiently. The training process and choice of the simulator for SlateQ-MNL are the same as our proposed method, but a set of MNL parameters needs to be fitted first to guide actions.
–
Random: Random means that we randomly pick $1$ to $C$ products from the available product set into the assortment, regardless of the current state.
–
Myopic-MNL: Myopic-MNL solves a revenue maximization problem within the set of available assortments based on the fitted MNL choice model:
$a_{t} = \underset{a \in A_{t}}{\arg max} \sum_{i \in a} r_{i} p_{z_{t}} (i, a) .$

–
EIB-MNL: EIB-MNL is an online algorithm proposed by (Golrezaei et al., 2014). At each period, we solve the assortment optimization problem within the set of available assortments, with the prices discounted by exponential penalty function $Ψ (x) = (e / (e - 1)) (1 - e^{- x})$ , where $I_{1 i}$ denotes the initial inventory level of product $i$ and $I_{(t - 1) i}$ denotes the inventory level of product $i$ at the beginning of period $t$ :
$a_{t} = \underset{a \in A_{t}}{\arg max} \sum_{i \in a} Ψ (I_{(t - 1) i} / I_{1 i}) r_{i} p_{z_{t}} (i, a) .$

–
Sub $_{t}$ -MNL: Sub $_{t}$ -MNL (Bernstein et al., 2015) assumes the stochastic arrival pattern with a fixed selling length and a Poisson arrival process. An approximate DP is solved for online assortments, considering the limited inventory.
–
DP-GR-MNL/DP-RO-MNL: DP-GR-MNL and DP-RO-MNL are the greedy policy and the rollout policy proposed by Rusmevichientong et al. (2020), respectively. A DP is formulated assuming the stochastic arrival pattern and linear approximations are constructed for the optimal value functions. DP-GR-MNL solves assortment optimization problems based on the approximated functions and offers a fixed assortment to a specific customer type at each time period. DP-RO-MNL further considers the inventory levels of the products at each time period.

The implementation details of Myopic, Sub $_{t}$ , EIB, DP-GR, and DP-RO policy are put in E-Companion EC.5. Following Golrezaei et al. (2014) and Rusmevichientong et al. (2020), we compare our method with these benchmarks based on the offline linear programming (LP) upper bound:
$\begin{aligned} max & \sum_{t = 1}^{T} \sum_{a \in A} \sum_{i = 1}^{N} r_{i} p_{z_{t}} (i, a) y^{t} (a) \\ s.t. & \sum_{t = 1}^{T} \sum_{a \in A} p_{z_{t}} (i, a) y^{t} (a) \leq I_{1 i}, 1 \leq i \leq N, \\ \sum_{a \in A} y^{t} (a) = 1, 1 \leq t \leq T, \\ y^{t} (a) \geq 0, 1 \leq t \leq T, a \in A . \end{aligned}$
(10)

Figure 4.
Simulation flowchart.

This LP upper bound assumes that the customer arrival order within a selling horizon is already known, which is why it is referred to as offline. The overall feasible set of assortment sets is denoted by $A = {a : | a | \leq C}$ , representing all possible assortment sets with cardinality at most $C$ . The decision variables $y^{t} (a)$ represent the probability of offering assortment $a$ to a customer of type $z_{t}$ arriving at time $t$ . In practice, an assortment can either be offered ( $y^{t} (a) = 1$ ) or not offered ( $y^{t} (a) = 0$ ). However, the LP formulation allows these decision variables to take fractional values, providing greater flexibility. As a result, the LP upper bound represents a theoretical maximum that cannot be attained in practice.

In the testing results presented below, the performance of our proposed method and all benchmarks is evaluated as a fraction of the offline LP upper bound, which is called the approximation ratio (short as “App Ratio”), providing a measure of how close each approach is to achieving optimality.
5.1. Simulation With Ground-Truth Choice Model

In this section, we simulate an environment where customers arrive following a predefined arrival pattern and provide feedback on offered assortments based on a realistic ground-truth choice model. The overall simulation process, including synthetic data generation, simulator estimation and training, and testing, is illustrated in Figure 4, with a detailed explanation for each part provided below.

5.1.1. Environment Construction and Transaction Data Generation

First, we describe the basic setup of the ground-truth environment. The environment consists of 10 products indexed by $N = {1, 2, \dots, 10}$ . The price of each product is selected uniformly from the range of $10$ to $25$ . These products are reordered such that $r_{1} \leq r_{2} \leq \dots \leq r_{10}$ . The maximum assortment size is $4$ and the initial inventory level of each product is $10$ . There are four customer types, indexed by $z \in Z = {1, 2, 3, 4}$ . The customer arrival pattern is modeled following the work Rusmevichientong et al. (2020), where customer type 1 tends to arrive earlier, and customer type 4 arrives later in the selling horizon. Specifically, the total number of customers $T$ per horizon is sampled uniformly at random from $U (90, 110)$ . Four equally spaced time points, $τ^{1} \leq τ^{2} \leq τ^{3} \leq τ^{4}$ , are defined over the selling horizon. The arrival probability of type $z$ at time $t$ , denoted by $p^{t, z}$ , is calculated as $p^{t, z} = e^{- κ | t - τ^{z} |} / \sum_{z^{'} \in Z} e^{- κ | t - τ^{z^{'}} |}$ , where $κ$ is a parameter that controls the arrival order for various customer types. As $κ \to 0$ , different customer types arrive with the same probability at each time period. We set $κ$ as $0.03$ here, while change it to $0$ in E-Companion EC.7. This ensures that the arrival probability of customer type $z$ peaks around time $τ^{z}$ , with the four peak times defined as $20, 40, 60, 80$ .

Next, we introduce the ground-truth choice model, referred to as environment feedback function $\hat{ϕ}$ in Figure 4. We adopt a rank list-based choice model (Farias et al., 2013) as the ground-truth choice model due to its flexibility. It has been shown the rank list-based choice model subsumes all choice models following the RUM principle (e.g., see Block and Marschak, 1960), including the MNL and MC choice models. Each customer type is associated with a set of preference lists and a distribution over these lists. A preference list ranks products from most to least preferred, and a customer following a given list will always choose the highest-ranked item available in the offered assortment. The associated distribution introduces stochasticity into customer choice behavior. We construct 20 preference lists and define four distinct distributions over these lists, each corresponding to a specific customer type. By assigning positive probabilities only to the lists where product 1 and product 2 rank at the top two and other products (including product 0) rank later randomly, customer type 1 has the smallest consideration set mostly consisting of two lowest-priced products. Customer type 2 mostly considers the four lowest-priced products, and customer type 3 mostly considers the six lowest-priced products. By assigning positive probabilities to all preference lists, customer type 4 has the largest consideration set, preferring all products almost equally. Within their respective consideration sets, customers make choices stochastically according to the assigned distribution over preference lists, with the probability of selecting the no-purchase option fixed at $\sim$ 0.1. This behavior is referred to as the environment’s ground-truth choice model, which is used for synthetic data generation and serves as feedback during the testing process. This construction aligns with prior works Rusmevichientong et al. (2020), Aouad et al. (2023) and facilitates interpretability, as discussed in E-Companion EC.6. Details on the preference lists and distributions are provided in E-Companion EC.3.

Lastly, we describe the data generation process, which includes customer arrival sequences and transaction data. Based on the defined customer arrival pattern, we simulate synthetic arrival sequences for 500 selling horizons. The first 400 sequences, denoted as ${(z_{1}, \dots, z_{T_{d}})}_{d = 1}^{400}$ , are designated as training arrival sequences. The remaining 100 sequences, ${(z_{1}, \dots, z_{T_{d}})}_{d = 401}^{500}$ , represent the potential future arrival scenarios and are used for testing our approach and benchmarks. In addition to the training arrival sequences, we generate synthetic transaction data $D = {(z_{t}, a_{t}, y_{t})_{t = 1}^{T_{d}}}_{d = 1}^{400}$ . Specifically, for each horizon and every customer of type $z_{t}$ arriving at time $t$ , a random assortment $a_{t}$ of four products is selected from the 10 available products (ignoring inventory levels). The customer’s purchasing feedback, denoted by $y_{t} \in {0, 1, \dots, 10}$ , is observed according to the ground-truth choice model $\hat{ϕ}$ for that customer type (represented as “FB” in Figure 4). This transaction data provides insight into the ground-truth choice model, allowing us to fit a model to mimic customer behavior.

The following sections will detail the processes of training, testing, and evaluating performance. In E-Companion EC.7, we conduct robustness checks by varying initial inventory levels and arrival patterns, as well as our scalability tests by transitioning from the basic scenario to those involving more products, diverse customer types, and an extended selling period. Additionally, we assess the robustness of our method by altering the ground-truth choice model to the latent class MNL (LC-MNL).

5.1.2. DRL Training and Testing

In this section, we outline our approach to preparing inputs for Algorithm 1, implementing benchmarks, and evaluating their performance throughout the testing sequences.

We first describe how to construct the simulated environment for training our DRL agent. The simulated environment serves two purposes: simulating type-specific customer arrivals in a sequence and modeling their choice behaviors. Specifically, customer arrival sequences are defined as ${h_{d}}_{d = 1}^{400} = {(z_{1}, \dots, z_{T_{d}})}_{d = 1}^{400}$ , as described in Section 5.1.1, which act as the first input to Algorithm 1. Next, a simulator $ϕ$ is fitted based on historical transaction data $D$ to simulate customer choices and provide feedback during training, serving as the second input to Algorithm 1. Our framework is flexible to incorporate any choice model as the simulator. To examine the impact of simulators in the performance of our DRL agents and determine practical criteria for selecting simulators, we evaluate various choice models, including parametric models like the MNL, LC-MNL, random consideration set (RCS), and MC models, as well as a nonparametric model like Gated-Assort-Net. We illustrate how the simulator is fitted using type-specific MNL choice models as an example. For each customer type $z \in {1, 2, 3, 4}$ , an MNL choice model is estimated by solving the following maximum likelihood estimation (MLE) problem:

max_{u_{z}} \sum_{(z_{t}, a_{t}, y_{t}) \in D_{z_{t} = z}} L L ((z_{t}, a_{t}, y_{t}) ∣ u_{z}),

where

L L ((z_{t}, a_{t}, y_{t}) ∣ u_{z})

represents the log-likelihood of the transaction record

(z_{t}, a_{t}, y_{t})

under the MNL parameters

u_{z}

. Other choice models, such as RCS, LC-MNL, and MC, are estimated using constrained MLE or expectation-maximization (EM) algorithms (Berbeglia et al., 2022). The Gated-Assort-Net model is trained on mini-batches of the entire training transaction data set using a cross-entropy loss function. The whole data set is used 10 times, and the model at the end is saved. Detailed descriptions of these choice models are provided in E-Companion EC.3.

Once the simulated environment is built, Algorithm 1 is executed 40 times, each denoted as one training epoch, with the model update step set as $M = 10$ . Model performance is validated after each epoch by interacting with the training sequences, receiving simulator feedback, and recording rewards. The model with the best validation performance is saved for testing. Architecture and implementation details for A2C are available in E-Companion EC.4.

The non-DRL benchmarks (Myopic-MNL, Sub $_{t}$ -MNL, EIB-MNL, DP-GR-MNL, and DP-RO-MNL) are directly implemented using the fitted MNL choice models without additional training. SlateQ-MNL, a DRL-based approach, is trained similarly to A2C but incorporates fitted MNL models to guide actions. Its deep Q network (DQN) parameters are updated according to Ie et al. (2019), with inventory and cardinality constraints handled as described by Kalweit et al. (2020). Detailed implementation and parameter updates are outlined in E-Companion EC.5. For notational simplicity, we will omit the suffix “-MNL” from each benchmark in the following discussion.

All approaches are evaluated using the 100 testing arrival sequences ${(z_{1}, \dots, z_{T_{d}})}_{d = 401}^{500}$ . During testing, each customer interacts with the assortment offered by a specific approach, and their choice is determined using the ground-truth choice model $\hat{ϕ}$ . We calculate the average revenue across all testing sequences and repeat this process for $10$ times. The performance is measured as the fraction of the average LP upper bound, providing a normalized metric ranging from 0 to 1.

Table 2.
Out-of-Sample Log-Likelihood of Choice Models.

Customer type MNL LC-MNL RCS MC Gated-Assort-Net

1 (24.4%) $-$ 0.722 $-$ 0.718 $-$ 0.725 $-$ 0.601 ( $-$ 0.623) $-$ 0.612 ( $-$ 0.631)

2 (25.4%) $-$ 1.318 $-$ 1.302 $-$ 1.323 $-$ 1.159 ( $-$ 1.176) $-$ 1.182 ( $-$ 1.194)

3 (25.8%) $-$ 1.594 $-$ 1.595 $-$ 1.572 $-$ 1.471 ( $-$ 1.459) $-$ 1.505 ( $-$ 1.486)

4 (24.4%) $-$ 1.688 $-$ 1.685 $-$ 1.656 $-$ 1.573 ( $-$ 1.541) $-$ 1.580 ( $-$ 1.552)

Mean $-$ 1.334 $-$ 1.329 $-$ 1.323 $-$ 1.205 $-$ 1.224

Customer type	MNL	LC-MNL	RCS	MC	Gated-Assort-Net
1 (24.4%)	$-$ 0.722	$-$ 0.718	$-$ 0.725	$-$ 0.601 ( $-$ 0.623)	$-$ 0.612 ( $-$ 0.631)
2 (25.4%)	$-$ 1.318	$-$ 1.302	$-$ 1.323	$-$ 1.159 ( $-$ 1.176)	$-$ 1.182 ( $-$ 1.194)
3 (25.8%)	$-$ 1.594	$-$ 1.595	$-$ 1.572	$-$ 1.471 ( $-$ 1.459)	$-$ 1.505 ( $-$ 1.486)
4 (24.4%)	$-$ 1.688	$-$ 1.685	$-$ 1.656	$-$ 1.573 ( $-$ 1.541)	$-$ 1.580 ( $-$ 1.552)
Mean	$-$ 1.334	$-$ 1.329	$-$ 1.323	$-$ 1.205	$-$ 1.224

Notes: Numbers in the brackets indicate in-sample log-likelihood on the training data.

MNL = multinomial logit; LC-MNL = latent class multinomial logit; RCS = random consideration set; MC = Markov chain.

5.1.3. Choice of the Simulator

In this section, we compare the prediction ability of different choice models on the training transaction data, examine the impact of the simulator on the performance of our A2C method, and provide a criterion for selecting the simulator within our framework.

Different simulators exhibit varying abilities to fit the historical transaction data, leading to different levels of deviation from the ground-truth choice model. To evaluate these differences, we split the training transaction data into two parts: the training part (80%) and the testing part (20%). We fit all choice models on the training part, and use their log-likelihood on the testing part, named the out-of-sample log-likelihood, to compare their fitting performance, as suggested by Berbeglia et al. (2022). The results are summarized in Table 2. Each column in the table corresponds to a specific choice model, including MNL, LC-MNL, RCS, MC, and Gated-Assort-Net. Each row represents the fitting result for a particular customer type, with the proportion of each customer type displayed in the leftmost column. The last row provides the average performance of each choice model across all customer types. For the Gated-Assort-Net model, we save the version with the best in-sample log-likelihood during the 10-epoch training process. To assess potential over-fitting issue, we also report the best in-sample log-likelihood in brackets in the last two columns. The results show that MC achieves the best out-of-sample predictive performance overall, followed closely by Gated-Assort-Net, while LC-MNL, RCS, and MNL perform significantly worse. Moreover, for customer types 3 and 4, all choice models show lower log-likelihood values. This is probably a result of the increased size of the consideration sets for these customer types, which complicates capturing their behavior patterns. Additionally, both MC and Gated-Assort-Net model experience an overfitting problem with customer types 3 and 4, evidenced by their superior in-sample performance relative to their out-of-sample results, while the problem with MC is less severe.

Better out-of-sample performance indicates stronger predictive ability in unseen scenarios. Since accurate predictions for out-of-data choice scenarios are essential for effective learning in A2C through simulations, we recommend selecting the choice model with the best out-of-sample predictive performance as the simulator. To test this hypothesis, we train A2C agents using different choice models as simulators. Additionally, since the ground-truth choice model is available, we also train an A2C agent using the ground-truth choice model as the simulator. This allows us to quantify the optimality gap of our approach while isolating the impact of model misspecification in the simulator.

By comparing the performance of A2C agents trained with ground-truth and fitted choice models as simulators, as shown in Figure 5, we verify our criterion for simulator selection. The results show that A2C agents trained with MC and Gated-Assort-Net as simulators achieve the best performance, while those trained with LC-MNL, RCS, and MNL perform worse. This ranking aligns with the out-of-sample fitting performance of the choice models discussed earlier, validating our recommendation to use the choice model with the best out-of-sample predictive performance as the simulator. Furthermore, no significant performance difference is observed between A2C agents trained with MC and those trained with the ground-truth choice model as simulators, indicating that the MC model approximates the ground-truth choice model effectively and can reliably simulate choice scenarios. Finally, we observe that A2C agents trained with the ground-truth, MC, and Gated-Assort-Net simulators achieve near-optimal performance, with an approximation ratio larger than 0.97. In conclusion, we suggest selecting the choice model with the best out-of-sample performance in our framework to provide reliable feedback during the A2C training process.

Figure 5.

Testing results of advantage actor-critic (A2C) trained with six different simulators.

5.1.4. Results

This section reports the testing results of different approaches under the basic setting.

Figure 6 presents the training curve of A2C and a comparison of the testing results across all approaches. In the left panel, the solid curve shows the mean validation result during the training process, with vertical lines indicating standard error across three training runs. The training stabilizes after around 15 epochs, with reduced variance across different initializations, suggesting the robustness of our method. In the right panel, each bar represents the mean approximation ratio of an approach, with black lines showing deviation across 10 testing runs. The results reveal several key insights. First, our A2C algorithm achieves a testing approximation ratio over 0.97 (right panel), outperforming all benchmarks. Note that this may differ from the validation results in the left panel due to discrepancies between the simulator used for validation and the ground-truth choice model used for testing. Among the benchmarks, DP-GR, DP-RO, and SlateQ achieve approximation ratios around 0.95, representing the best within this group. In contrast, the Random policy performs worst, with a mean ratio of 0.86, as expected due to its lack of learning from historical data. The Myopic policy improves upon Random by leveraging customer preferences, while EIB and Sub $_{t}$ further outperform Myopic by accounting for inventory constraints and future arrivals. The superior performance of our A2C algorithm can be attributed to two key factors. First, our algorithm interacts with the MC simulator, which provides a better fit for the generation of training transaction data compared to the MNL choice model, thereby aligning more closely with the underlying ground-truth. Second, the training process of our A2C algorithm incorporates sequential decision-making, which is not captured by the benchmark policies. By leveraging past arrival sequences, our A2C agent learns an approximately optimal assortment policy aimed at maximizing long-term revenue. This also leads to reduced standard error in testing, highlighting the robustness of our method.

Figure 6.

Training and testing results with the setting defined in Section 5.1.1.

5.2. Real-World Data Set Experiment

Here we conduct simulations based on a real-world data set: Expedia¹, which is a data set about hotel transactions, consisting of the customer search sessions, the recommended assortments, the attributes of products in the assortments, and the choices of these sessions. We choose this data set from a practical point of view that the total number of rooms is fixed in a time horizon. We extract a stream of customer arrival sequences and the transaction record of each arriving customer from this data set and manually set the initial inventory level of each hotel to construct the environment. Data pre-processing procedure is detailed in E-Companion EC.9.

The processed data set used in this section consists of 30 unique products and 4,272 search queries spanning 211 days. In E-Companion EC.7.2, we expand the data set to a case with 100 unique products to test scalability of our approach. Each search query comprises recommended products and the customer’s final choice. For each search query, we also observe six customer features, such as a weekend indicator, and six product features like the star rating of a hotel. Detailed description about product features and customer features is introduced in E-Companion Tables EC.8 and EC.9.

While our framework can adapt to a contextual setting by directly incorporating customer features, the benchmark methods require predefined customer types. Consequently, we apply K-means clustering to group customers into four types based on their features. Since the underlying choice model is unknown, our first step is to identify a ground-truth choice model by fitting various choice models to the transaction data and selecting the one with the best predictive performance. There are two classes of possible choice models. The first class is feature-based choice models, including MNL-feature and Gated-Assort-Net-feature, which take the feature vectors as input and output the choice probability of each product, while the second class is choice models which take product indexes without features as input and output probabilities. For the first class, we concatenate the six-dimensional feature vector of a certain product and the six-dimensional customer feature vector of a certain search query into a 12-dimensional feature vector of a certain product in a certain search query, to act as the input. We fit six choice models, including MNL-feature and Gated-Assort-Net-feature, and MNL, LC-MNL, MC, and Gated-Assort-Net, on the overall data set, which includes transaction data of all customer types. We also fit MNL, LC-MNL, MC, and Gated-Assort-Net models on transaction data of each clustered customer type. While fitting each choice model, we split the data into 80% training data for estimation and 20% testing data. Log-likelihood on testing data, named out-of-sample log-likelihood, is used to evaluate the predictive power of the models, with results summarized in Tables 3 and 4.

Table 3.
Log-likelihood of choice models fitted on overall data with all customer types.

MNL LC-MNL MC Gated-Assort-Net MNL-feature Gated-Assort-Net-feature

Mean (1) $-$ 2.531 $-$ 2.529 $-$ 2.503 $-$ 2.472 $-$ 2.539 $-$ 2.414

(2) $-$ 2.404 $-$ 2.402 $-$ 2.375 $-$ 2.356 $-$ 2.408 $-$ 2.339

		MNL	LC-MNL	MC	Gated-Assort-Net	MNL-feature	Gated-Assort-Net-feature
Mean	(1)	$-$ 2.531	$-$ 2.529	$-$ 2.503	$-$ 2.472	$-$ 2.539	$-$ 2.414
	(2)	$-$ 2.404	$-$ 2.402	$-$ 2.375	$-$ 2.356	$-$ 2.408	$-$ 2.339

Notes: (1) In-sample and (2) out-of-sample.

MNL = multinomial logit; LC-MNL = latent class multinomial logit; MC = Markov chain.

Table 4.

Out-of-sample log-likelihood of choice models fitted on segmented data.

Customer type	MNL	LC-MNL	MC	Gated-Assort-Net
1 (66.5%)	$-$ 2.410	$-$ 2.411	$-$ 2.389 ( $-$ 2.496)	$-$ 2.376 ( $-$ 2.484)
2 (12.3%)	$-$ 2.481	$-$ 2.482	$-$ 2.519 ( $-$ 2.430)	$-$ 2.503 ( $-$ 2.504)
3 (6.7%)	$-$ 2.187	$-$ 2.175	$-$ 2.181 ( $-$ 1.981)	$-$ 2.058 ( $-$ 2.136)
4 (14.5%)	$-$ 2.573	$-$ 2.550	$-$ 2.619 ( $-$ 2.422)	$-$ 2.500 ( $-$ 2.470)
Mean	$-$ 2.427	$-$ 2.424	$-$ 2.424	$-$ 2.388

Notes: Numbers in the brackets indicate in-sample log-likelihood on the training segmented data.

MNL = multinomial logit; LC-MNL = latent class multinomial logit; MC = Markov chain.

We have the following observations. First, the Gated-Assort-Net-feature model achieves the best performance, illustrating the value of contextual information in predicting customer choices. However, the poor performance of the MNL-feature model may be caused by model misspecification, and the underlying nonlinear relationships between utility and features. Second, fitting choice models separately for each customer type performs worse than fitting a single model on the entire data set, suggesting that market segmentation is not always beneficial. With only 4,000 training samples, dividing them into smaller groups reduces data available per segment, weakening predictive power and causing overfitting, as seen in the in-sample versus out-of-sample performance gap for customer types 3 and 4 in Table 4. This highlights a trade-off between model accuracy and customization under limited data. Our A2C approach mitigates this by learning directly from a feature-based choice model fitted on the full data set, achieving both accurate modeling and effective personalization. Based on these results, we adopt the Gated-Assort-Net-feature model as the ground-truth choice model and use it as the simulator when training our A2C and SlateQ agents, to demonstrate our framework’s ability to handle contextual scenarios by directly incorporating customer features.

To evaluate the value of customer features, we compare two types of A2C agents. The first, referred to as A2C-F, takes a customer feature vector as input, followed by a fully connected embedding layer. The second, called A2C-T, uses a customer-type vector, which is pre-classified by K-means, as input, followed by an embedding matrix. Other benchmarks are implemented using MNL models fitted to the clustered customer types.

For training, we allocate customer sequences from the first 160 days and use the remaining 51 days for testing. The initial inventory level of each hotel is manually set to 5. We train our model for 40 epochs, with each epoch covering all 160 training customer sequences. After each epoch, we validate the current parameters of the model. The training process is repeated 3 times independently. Testing is conducted 10 times for each approach using the testing customer sequences to account for stochastic customer choice behavior.

5.2.1. Performance of A2C Algorithm

The training and testing results are summarized in Figure 7. We can see that the performance of both A2C-F and A2C-T improve during training, demonstrating that our framework with either continuous customer feature or categorical customer type information is effective in policy learning. The deviation across three training results with different model initial parameters diminishes over time, indicating stable convergence for both agents. Interestingly, while A2C-T keeps improving in the first 20 epochs and stays stable afterwards, A2C-F consistently improves during training until it surpasses A2C-T in the last 20 training epochs, suggesting that incorporating finer-grained customer feature vectors enhances the agent’s ability to learn an effective policy.

Figure 7.

Training curve and testing result.

In testing, A2C-F achieves the highest revenue followed by A2C-T. Note that the only difference between these two agents lies in the granularity of the customer input: A2C-F utilizes detailed customer feature vectors, whereas A2C-T relies on pre-classified customer-type vectors. This comparison highlights the generality and flexibility of our framework in accommodating various levels of customer heterogeneity. The superior performance of A2C-F underscores its advantage in leveraging rich customer feature information, whereas A2C-T’s competitive results demonstrate the framework’s effectiveness in handling predefined customer types.

Among the benchmarks, SlateQ ranks third but significantly underperforms compared to A2C-F and A2C-T. Despite using the same Gated-Assort-Net-feature simulator, SlateQ’s reliance on the MNL choice model to guide actions introduces model misspecification issues, which hinder its performance. Of the remaining benchmarks, DP-RO achieves the best results but still lags behind SlateQ and the A2C agents. These findings demonstrate the challenges faced by MNL-based benchmarks, particularly as the number of products increases and customer behavior becomes more complex (e.g., context-dependent decision-making).

These findings confirm the efficacy of our A2C framework in tackling the challenge of modeling customer behavior and solving assortment optimization problem. The ability to accommodate both granular customer features and predefined customer types further underscores the flexibility and robustness of our approach. Similar trends are observed in the 100-product case, detailed in E-Companion EC.7.2, where the comparison between A2C-F and A2C-T is more evident.

6. Conclusion and Future Work

We address the online assortment customization problem under inventory and cardinality constraints using a novel data-driven approach based on DRL. Departing from traditional methods that rely on specific choice models or their variants, our work proposes a generalizable framework that learns customized assortment policies through interactions with a simulated environment. This environment, constructed from historical transaction data, enables the DRL agent to trial assortment actions and observe purchasing outcomes through a feedback mechanism. Leveraging a specially designed DNN and the A2C training algorithm, our model learns an approximately optimal policy and extends to settings involving customer features and reusable products.

Our study contributes to the application of DRL in revenue management and demonstrates its potential through comprehensive numerical experiments. The key insights from our work are as follows: –

Effectiveness of our DRL framework: Under environments with realistic rank list-based ground-truth choice models, our DRL framework with specially designed neural network structure learns effective assortment policies through interactions with the simulator.

–

Flexibility and generalizability: Unlike SlateQ, our A2C-based framework is highly flexible, allowing for the use of various simulator forms that fit historical transaction data. This flexibility translates into superior performance, achieving the highest long-term revenue among benchmarks. The expected revenue loss is <5% compared to the offline optimal value under various settings, showcasing strong generalizability and near-optimal performance.

–

Robust performance: Our approach is robust across diverse synthetic environments and a real-world data set. It adapts effectively to variations such as reusable product settings, performing consistently well under different usage time distributions.

–

Practical implementation: We provide practical guidance for implementing our framework, including how to select an appropriate simulator, determine the neural network structure based on a fixed data sample size, and assess the sample size required to train a neural network effectively.

We also admit that our work has limitations. One major constraint is its reliance on historical data for simulator construction. The effectiveness and adaptability of our approach depend on the quality, quantity, and representativeness of the data. Theoretical investigation of the relationship between data quality and model performance is an important future direction.

This study opens up several avenues for further research: first, future work can explore settings where customer choice depends on product positions within the assortment or across multiple pages due to space constraints (Abeliuk et al., 2016; Feldman and Segev, 2022). Integrating these behavioral effects into our framework can provide a more comprehensive solution for online platforms. Second, pricing is a critical lever to enhance revenue during the selling horizon. While prior research has addressed this issue under specific models like the MNL (Lei et al., 2022), extending our framework to jointly optimize assortment and pricing strategies is a promising direction. Finally, real-world settings often involve the introduction of new products and the discontinuation of existing ones. Addressing this dynamic through transfer learning, as explored by Oroojlooyjadid et al. (2022), could significantly enhance the adaptability of our approach.

Supplemental Material

sj-pdf-1-pao-10.1177_10591478251351737 - Supplemental material for Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach

Supplemental material, sj-pdf-1-pao-10.1177_10591478251351737 for Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach by Tao Li, Chenhao Wang, Yao Wang, Shaojie Tang and Ningyuan Chen in Production and Operations Management

Footnotes

Acknowledgments

We thank the department editor, the senior editor, and two anonymous referees whose comments substantially improved this article. Yao Wang is grateful for the support of the National Natural Science Foundation of China under Grant 12371513, and the Major Program of National Fund of Philosophy and Social Science of China under Grant 23&ZD135. This work was conducted while Tao Li was affiliated with Xi’an Jiaotong University, and the revisions were completed during his PhD at the Hong Kong University of Science and Technology.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iDs

Ningyuan Chen

Chenhao Wang

Yao Wang

Shaojie Tang

Tao Li

Supplemental Material

Supplemental materials for this article are available online (doi: ).

Notes

How to cite this article

Li T, Wang C, Wang Y, Tang S, Chen N (2026) Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach. Production and Operations Management 35(2): 665–684.

References

Abeliuk

Berbeglia

Cebrian

Van Hentenryck

(2016) Assortment optimization under a multinomial logit model with position bias and social influence. 4OR 14(1): 57–75.

Alomrani

Moravej

Khalil

(2022) Deep policies for online bipartite matching: A reinforcement learning approach. Transactions on Machine Learning Research.

Aouad

Désir

(2022) Representing random utility choice models with neural networks. arXiv preprint arXiv:2207.12877, Working Paper.

Aouad

Feldman

Segev

(2023) The exponomial choice model for assortment optimization: An alternative to the MNL model? Management Science 69(5): 2814–2832.

Bellman

(1954) The theory of dynamic programming. Bulletin of the American Mathematical Society 60(6): 503–515.

Bello

Pham

Norouzi

Bengio

(2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, Working Paper.

Bentz

Merunka

(2000) Neural networks and the multinomial logit for brand choice modelling: A hybrid approach. Journal of Forecasting 19(3): 177–200.

Berbeglia

Garassino

Vulcano

(2022) A comparative empirical study of discrete choice models in retail operations. Management Science 68(6): 4005–4023.

Bernstein

Kök

Xie

(2015) Dynamic assortment customization with limited inventories. Manufacturing & Service Operations Management 17(4): 538–553.

10.

Blanchet

Gallego

Goyal

(2016) A Markov chain approximation to choice modeling. Operations Research 64(4): 886–905.

11.

Block

Marschak

(1960) Random orderings and stochastic theories of response. Contributions to Probability and Statistics 2: 97–132.

12.

Cai

Wang

Talluri

(2022) Deep learning for choice modeling. arXiv preprint arXiv:2208.09325, Working Paper.

13.

Chen

Gallego

Tang

(2019) The use of binary choice forests to model and estimate discrete choices. arXiv preprint arXiv:1908.01109, Working Paper.

14.

Chen

(2023) Frontiers in service science: Data-driven revenue management: The interplay of data, model, and decisions. Service Science 15(2): 79–91.

15.

Chen

Simchi-Levi

Xin

(2024) Assortment planning for recommendations at checkout under inventory constraints. Mathematics of Operations Research 49(1): 297–325.

16.

Choi

T-M

Kumar

Yue

Chan

H-L

(2022) Disruptive technologies and operations management in the industry 4.0 era and beyond. Production and Operations Management 31(1): 9–31.

17.

Delarue

Anderson

Tjandraatmadja

(2020) Reinforcement learning with combinatorial actions: An application to vehicle routing. Advances in Neural Information Processing Systems 33: 609–620.

18.

Deng

Song

J-SJ

(2022) A unified parsimonious model for structural demand estimation accounting for stockout and substitution. Available at SSRN 4134738, Working Paper.

19.

Farias

Jagabathula

Shah

(2009) A data-driven approach to modeling choice. Advances in Neural Information Processing Systems 22: 504–512.

20.

Farias

Jagabathula

Shah

(2013) A nonparametric approach to modeling choice with limited data. Management Science 59(2): 305–322.

21.

Feldman

Segev

(2022) The multinomial logit model with sequential offerings: Algorithmic frameworks for product recommendation displays. Operations Research 70(4): 2162–2184.

22.

Feng

Niazadeh

Saberi

(2024) Near-optimal Bayesian online assortment of reusable resources. Operations Research 72(5): 1861–1873.

23.

Gabel

Timoshenko

(2022) Product choice with large assortments: A scalable deep-learning model. Management Science 68(3): 1808–1827.

24.

Gallego

Truong

V-A

Wang

(2015) Online resource allocation with customer choice. arXiv preprint arXiv:1511.01837, Working Paper.

25.

Gijsbrechts

Boute

Van Mieghem

Zhang

(2022) Can deep reinforcement learning improve inventory management? Performance on dual sourcing, lost sales and multi-echelon problems. Manufacturing & Service Operations Management 24(3): 1349–1368.

26.

Golrezaei

Nazerzadeh

Rusmevichientong

(2014) Real-time optimization of personalized assortments. Management Science 60(6): 1532–1551.

27.

Gong

X-Y

Goyal

Iyengar

Simchi-Levi

Udwani

Wang

(2022) Online assortment optimization with reusable resources. Management Science 68(7): 4772–4785.

28.

Goyal

Iyengar

Udwani

(2025) Asymptotically optimal competitive ratio for online allocation of reusable resources. Operations Research, Forthcoming.

29.

Green

Plunkett

(2022) The science of the deal: Optimal bargaining on ebay using deep reinforcement learning. In: Proceedings of the 23rd ACM conference on economics and computation, pp. 1–27.

30.

Han

Pereira

Ben-Akiva

Zegras

(2022) A neural-embedded discrete choice model: Learning taste representation with strengthened interpretability. Transportation Research Part B: Methodological 163: 166–186.

31.

Jain

Wang

Narvekar

Agarwal

Cheng

H-T

Chandra

Boutilier

(2019) Slateq: A tractable decomposition for reinforcement learning with recommendation sets. In: Proceedings of the 28th international joint conference on artificial intelligence, pp. 2592–2599.

32.

Kalweit

Huegle

Werling

Boedecker

(2020) Deep constrained q-learning. arXiv preprint arXiv:2003.09398, Working Paper.

33.

Karande

Mehta

Tripathi

(2011) Online bipartite matching with unknown distributions. In: Proceedings of the forty-third annual ACM symposium on theory of computing, pp.587–596.

34.

Kingma

(2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

35.

Kokkodis

Ipeirotis

(2021) Demand-aware career path recommendations: A reinforcement learning approach. Management Science 67(7): 4362–4383.

36.

Konda

Tsitsiklis

(1999) Actor-critic algorithms. Advances in Neural Information Processing Systems 12: 1008–1014.

37.

Lei

Jasin

Uichanco

Vakhutinsky

(2022) Joint product framing (display, ranking, pricing) and order fulfillment under the multinomial logit model for e-commerce retailers. Manufacturing & Service Operations Management 24(3): 1529–1546.

38.

Liu

(2023) Dynamic coupon targeting using batch deep reinforcement learning: An application to livestream shopping. Marketing Science 42(4): 637–658.

39.

Liu

Peng

Yang

(2024) Express: Multi-agent deep reinforcement learning for multi-echelon inventory management. Production and Operations Management, Forthcoming.

40.

McFadden

, et al. (1973) Conditional logit analysis of qualitative choice behavior.

41.

Meng

Chen

Gao

(2024) Reinforcement learning for intensity control: An application to choice-based network revenue management. arXiv preprint arXiv:2406.05358, Working Paper.

42.

Mnih

Kavukcuoglu

Silver

Rusu

Veness

Bellemare

Graves

Riedmiller

Fidjeland

Ostrovski

, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540): 529–533.

43.

Oroojlooyjadid

Nazari

Snyder

Takáč

(2022) A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing & Service Operations Management 24(1): 285–304.

44.

Rana

Oliveira

(2014) Real-time dynamic pricing in a non-stationary environment using model-free reinforcement learning. Omega 47: 116–126.

45.

Rusmevichientong

Shen

Z-JM

Shmoys

(2010) Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations Research 58(6): 1666–1680.

46.

Rusmevichientong

Sumida

Topaloglu

(2020) Dynamic assortment optimization for reusable products with random usage durations. Management Science 66(7): 2820–2844.

47.

Sutton

Barto

(2018) Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

48.

Sutton

McAllester

Singh

, et al (1999) Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12: 1057–1063.

49.

Talluri

Van Ryzin

(2004) The Theory and Practice of Revenue Management, Vol. 1. Berlin: Springer.

50.

Watkins

CJCH

Dayan

(1992) Q-learning. Machine Learning 8: 279–292.

51.

Williams

(1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3): 229–256.

52.

Xie

Liu

Chen

(2023) Two-sided deep reinforcement learning for dynamic mobility-on-demand management with mixed autonomy. Transportation Science 57(4): 1019–1046.

53.

Xue

Cai

Yang

Jiang

Gai

(2025) Auro: Reinforcement learning for adaptive user retention optimization in recommender systems. In: Proceedings of the ACM on web conference 2025, pp. 391–401.

54.

Yang

Cenying

Feng

Yihao

Whinston

Andrew

(2022) Dynamic pricing and information disclosure for fresh produce:An artificial intelligence approach. Production and Operations Management 31(1): 155–171.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.20 MB

Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach

Abstract

Keywords

1. Introduction

2. Literature Review

2.1. Online Assortment Problem

2.2. Deep RL

4.1. MDP Formulation

4.4. Training Algorithm

5.1.1. Environment Construction and Transaction Data Generation

5.1.2. DRL Training and Testing

Table 3. Log-likelihood of choice models fitted on overall data with all customer types. MNL LC-MNL MC Gated-Assort-Net MNL-feature Gated-Assort-Net-feature Mean (1) − 2.531 − 2.529 − 2.503 − 2.472 − 2.539 − 2.414 (2) − 2.404 − 2.402 − 2.375 − 2.356 − 2.408 − 2.339

Supplemental Material

sj-pdf-1-pao-10.1177_10591478251351737 - Supplemental material for Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iDs

Supplemental Material

Notes

How to cite this article

References

Supplementary Material

Table 3.
Log-likelihood of choice models fitted on overall data with all customer types.

MNL LC-MNL MC Gated-Assort-Net MNL-feature Gated-Assort-Net-feature

Mean (1) $-$ 2.531 $-$ 2.529 $-$ 2.503 $-$ 2.472 $-$ 2.539 $-$ 2.414

(2) $-$ 2.404 $-$ 2.402 $-$ 2.375 $-$ 2.356 $-$ 2.408 $-$ 2.339