Diamonds in the Rough: Leveraging Click Data to Spotlight Underrated Products

Abstract

We study the click and purchase behavior of customers in an online retail setting by employing a structural estimation approach. In particular, we aim to understand the impact of the information available to the customer before and after the click on the customer’s search and purchase behavior. We propose a sequential discrete choice framework to model the customer’s search strategy, where the customer repeatedly decides between continuing her search by clicking on a product, or stopping her search and making a purchase/no-purchase decision. By combining the click and order data, our proposed structural framework allows us to disentangle and separately estimate the attractiveness of a product before and after the click. This, in turn, allows us to identify underrated products which we call diamonds in the rough: these are products that have low pre-click but high post-click attractiveness; thus, even though such products have a low chance of being clicked, they have a high chance of being purchased, if clicked. The proposed framework provides an online retailer with new tools and insights to better manage the product assortment based on customer click and purchase behavior. We estimate our model on a data set from the Chinese retailer JD.com. Through simulation studies, we illustrate how our model can be operationalized and used for improving assortment decisions by accounting for the customers’ search behavior. In particular, we focus on a subset of 126 substitutable products as a representative sample of the data and find that the optimal assortments under our model significantly increase the expected revenue compared to the actual assortments displayed by JD.com, and two multinomial logit (MNL) benchmarks.

Keywords

Click Data Consumer Search Model Structural Estimation Online Retailing

1. Introduction

Motivation and Objective. E-commerce is one of the largest growing sectors in the digital economy, with sales of $766.77 billion in the United States in 2019 and compound growth of 20.8% during 2014–2019 (IBISWorld 2019). This boom in e-commerce does not show any signs of slowing down. According to eMarketer, the worldwide e-commerce sales reached $6 trillion in 2024. The Coronavirus pandemic has also significantly contributed to further growth of online shopping. Customers who rarely used to shop online for some of their purchases (such as groceries) are now shopping purely online. Such rapid growth in online shopping provides retailers with abundance of data to study customer shopping behavior. For example, online retailers can collect and leverage customers’ click data to study their search process. In addition, they can complement click data with customers’ order data to study the connection between customers’ click and purchase decisions. By studying the customer shopping behavior and with a better understanding of what triggers customers’ clicks and purchases, online retailers can improve their operational decisions such as assortment planning.

Figure 1.

JD.com search page for earbuds (left), and the product page of a specific earbud (right).

The goal of this article is to study the click and purchase behavior of customers in an online retail setting by employing a structural estimation approach that combines customers’ click and order data. In particular, our proposed framework disentangles and separately estimates the attractiveness of products before and after the click. This, in turn, allows us to identify underrated and overrated products on the search page, and use the insights for improving assortment decisions.

Our work is motivated by the data sets that were provided by the MSOM society (Shen et al. 2020). The data sets provide transaction level data from JD.com which is one of the two largest Chinese online retailers. These data sets capture a “full customer experience cycle” from the time of click to purchase to receiving the product. The data sets provide information on close to 400,000 customers and 30,000 SKUs in one specific (unnamed) product category during the month of March in 2018. In this article, we focus on the click and order data. In particular, each entry in the click data set (which contains close to 6 million customer-tied click records) includes the user and SKU IDs, therefore we can track a user’s browsing history. We also have the time-stamps associated with each click which allows us to construct a “click sequence,” that is, the sequence in which a customer browses different SKUs.

Upon visiting the online retailer’s website, the customer is presented an assortment of products. In order to make her purchase decision, the customer might decide to click on a subset (or all) of the available products in order to collect further information about them. Specifically, the customer can observe certain characteristics of each product on the search page. Such observed characteristics might include general product attributes such as product price, brand, and average customer reviews which are readily available to customers on the website (search page); the left panel of Figure 1 shows an example for earbuds on JD.com search page. Based on such observed characteristics, the customer might find a subset of products attractive to explore, in which case she clicks on these products and visits their description (product) pages. On each product page, the customer can obtain a much more detailed description of the product. Specifically, there are some characteristics of the product which are not observable to the customer on the initial search page and are revealed only after the customer clicks on that product. Such pre-click unobserved characteristics might include product description, product specifications, detailed customer reviews, and even an instruction video as illustrated in the right panel of Figure 1. Thus, we define the observed product utility as the utility the customer gets from the product based on the product characteristics that are visible on the search page before the click. Similarly, we define the unobserved product utility as the utility the customer gets from the product based on the product characteristics that are only visible on the product page after a customer clicks on that product.

Even though the customer makes her click decision based on the observed product utilities, the customer’s final purchase decision is based on all the information acquired about the products consisting of both the observed and unobserved utilities. Hence, customers’ click decisions are impacted only by products’ observed utilities while their purchase decisions are impacted by both the observed and unobserved utilities (i.e., products’ total utilities which are the sum of observed and unobserved utilities). Thus, there is an important distinction between the observed and unobserved product utilities and how they impact the chance of a product being clicked or purchased. The retailer can benefit from disentangling the observed and unobserved parts of the product utility as this allows the retailer to classify products into different groups based on their pre- and post-click attractiveness (utilities), as shown in Table 1.

Table 1.

Product classification based on the observed and total utilities.

Observed utility	Total utility	Product
before click	after click	type
High	High	Popular
High	Low	Overrated
Low	High	Diamond in the rough
Low	Low	Unpopular

As can be seen in Table 1, products can be divided into four categories based on their observed and total utilities. The first and last categories of products have either consistently high or low utilities both before and after the click and as a result, have simultaneously high or low chances of being clicked and purchased—we refer to these as “popular” and “unpopular” products, respectively. The second group of products, which we call “overrated,” are those with high observed but low total utilities (due to their low unobserved utilities). These are products that have a high chance of being clicked by customers because of their high “face” values (i.e., observed utilities), but their chance of being purchased is low due to their low total utilities. In other words, obtaining more information on the product page reduces the attractiveness of these products for customers. Hence, these products may discourage customers from continuing their search (and perhaps making a purchase), especially those with higher search costs. On the contrary, the third group of products, which we call “diamonds in the rough”, are those with low observed utilities but high total utilities (due to their high unobserved utilities). Thus, even though such products have a low chance of being clicked, they have a high chance of being purchased, if clicked.

To see the value of this categorization, suppose that a product has low sales. This product belongs to category 2, 3, or 4 with at least one of its observed or unobserved utilities being low. If the product belongs to the diamond in the rough category, the retailer may decide to promote it on the search page to compensate for its low observed utility and increase its chance of being clicked and consequently purchased. If the product belongs to the overrated category, the retailer may decide to demote it on the search page or even remove it from the assortment so that customers do not spend time browsing but not purchasing it. Finally, if the product belongs to the unpopular category, the retailer may decide to still keep or remove it based on other assortment considerations (i.e., assortment depth or relationships with vendors) as the product is somewhat “innocuous.”

Model. Motivated by the discussion above, we propose a structural framework to disentangle the observed and unobserved utilities of products using customers’ click and purchase (order) data. The structural model that we propose allows us to control for several factors that impact customers’ click and purchase decisions such as customer heterogeneity, products’ observed and unobserved utilities, click sequences, and product substitution (by controlling for the offered assortments). Controlling for all such factors is not possible using reduced-form models or other approaches that use aggregate sales and click data. For example, given that the substitution pattern between two products depends on all products in an offered assortment, reduced-form models cannot effectively capture the substitution patterns as these models need to control for each product and each possible assortment using separate sets of coefficients. This makes the estimation procedure computationally intractable given that the number of possible assortments grows exponentially in the number of products. Moreover, if a particular assortment or click sequence has not been observed in the data, making predictions or performing counterfactuals is not possible using reduced-form models.

In our structural framework, we model the customer click and purchase behavior using a sequential discrete choice model. We assume that customers are utility maximizers and choose the option that maximizes their utility. Specifically, at any point during the search process, the customer must decide whether to continue her search by clicking on an unexplored product, or stop her search and make a purchase/no-purchase decision among the clicked products. The search, however, is costly for customers. Therefore, customers should balance the trade-off between the search cost and the benefit of finding a more desirable product during the search process. We assume that customers follow a one-step lookahead policy to decide their search strategy. Specifically, we assume that customers anticipate their expected maximum utility from a subsequent purchase/no-purchase decision, if they were to make one additional click. We note that our model incorporates heterogeneity across customers and products by assuming that the utilities are customer and product dependent. We also allow the search cost to vary across customers.

The model that we propose has two distinct features. First, in our model the customer’s utility from each product has two main parts (in addition to a random shock), one of which is observed and the other one is unobserved to the customer prior to the click. This characterization allows us to disentangle the attractiveness (utility) of a product to a customer before and after it is clicked. Second, not only does our model consider the clicks of a customer, but also it considers the sequence (order) in which the products are clicked. This is inspired by an interesting observation that we make from the data: the product sales are decreasing in the rank (order) of products in the click sequences. This shows that the first couple of clicked products seem to have the highest chance of being purchased. Therefore, the sequence of products that a customer browses (clicks) might indicate the order of attractiveness of those products to that customer.

We take a data-driven approach to identify a subset of products that are substitutable. Specifically, we consider the customers’ click histories and find a subset of products that have been frequently clicked together. This ensures that these products have been considered by customers in the same shopping event, and hence, have a high degree of substitution. We then take a representative sample of 126 products and 30,000 random customers. We estimate our model on this sampled data using maximum likelihood estimation. Our main findings and contributions are discussed next.

Main Contributions. The contributions of this article are as follows:

Methodological Contribution: Our key contribution to the existing literature is methodological by proposing a structural framework, which uses customers’ click and purchase data to disentangle the observed and unobserved utilities of products. To the best of our knowledge, this is the first work that separately estimates the products’ attractiveness before and after the click.

Novel Empirical and Managerial Insights: Our estimation results reveal novel insights about customer click and purchase behavior. Such insights cannot be achieved using the structural or reduced-form models in the extant literature due to their shortcoming in controlling for several contributing factors that affect customer decisions. These insights are outlined below:

Value of Click: The estimation results show that the unobserved utilities of products vary significantly across products. Interestingly, our results show that the correlation between the observed part of product utility and the product-specific unobserved part is very small. Therefore, a product with a high observed utility may no longer be attractive to a customer after it is clicked, and vice versa. Therefore, our empirical results show that clicking on products is quite valuable for customers as it could lead to finding a more attractive product or discarding a less attractive one by learning the (pre-click) unobserved utilities of products.

Diamonds in the Rough: Our structural framework allows a retailer to identify different types of products. In particular, our framework indicates that there might exist underrated products that can be thought of as diamonds in the rough. These are products that have low observed but high total utilities (due to their high unobserved utilities). Hence, such a product has a low chance of being discovered (clicked) by customers, but a high chance of being purchased, if clicked. On the contrary, there might exist overrated products with high observed but low total utilities. Identifying such products can help the retailer better manage the product assortment based on the customer click and purchase behavior.

Insights for Assortment Planning. Through simulation studies, we illustrate how our model can be operationalized and used for improving assortment decisions. Specifically, we show that the optimal assortments under our model can significantly increase the expected revenue. In particular, we find that the optimal assortments suggested by our model improve the expected revenue by more than 24% on average over the actual assortments displayed by JD.com, and by more than 35% over two multinomial logit (MNL) models that either use the observed or total product utilities to find the optimal assortments. Moreover, we find that the revenue improvement seems to have an inverted-U shape as a function of the assortment size. That is, as the assortment size increases given a full set of available products, the revenue improvement generally increases up to a point and then starts decreasing. The intuition for this is that as the number of possible assortments that can be displayed to the customer increases, there is more potential for revenue gain by offering a different (optimal) assortment. There, the inverted-U shape is the result of increasing the assortment capacity size (i.e., choosing a larger subset of products), which first increases and then decreases the number of feasible assortments that can be offered. All in all, our results show that in settings where customers explore different products to find their ideal option, accounting for the customer’s search behavior and specifically the unobserved product utilities, which will be realized only after the click, can change the composition of the optimal assortment and significantly increase the revenue.

Organization of the Article. Section 2 provides a review of the relevant literature. Section 3 describes the data and the motivation for using a structural model. Section 4 presents the structural model of customer click and purchase behavior. Section 5 discusses the estimation approach and results, while Section 6 provides the managerial implications of our findings. Section7 provides concluding remarks. All proofs and complementary materials are relegated to the E-Companion in the appendix.

2. Literature Review

Our work is mainly related to two streams of literature: consumer search models, and choice models with consideration sets. In what follows, we review the most relevant literature in more detail.

The first stream of research related to our work studies consumer search models. In marketing there is an extensive and growing body of empirical literature on this topic. For instance, Hortaçsu and Syverson (2004) and Hong and Shum (2006) develop structural approaches to estimate the distribution of consumer search costs using aggregate data. In contrast, De los Santos et al. (2012) leverage individual-level data on web-browsing to study which classical search model is more consistent with observed data patterns. See also Koulayev (2014) and Honka and Chintagunta (2017) for similar studies. In contrast to our work, Koulayev (2014) observes click-stream data on search but not purchase activities; Honka and Chintagunta (2017) identify consumers’ search methods (sequential vs. simultaneous) by using a data set that contains individual-level information on consumers’ consideration sets and final purchases but not search sequences. We, however, study the click and purchase decisions of customers together.

There is also a great body of work in marketing which examines the effect of rankings (positions) on consumer choices. For example, De los Santos and Koulayev (2017), Ghose et al. (2012), Ghose et al. (2014), Koulayev (2014), and Ursu (2018) study the effect of rankings on consumer online choices in the hotel industry. We, however, disentangle the drivers of customer click versus purchase decisions, and use the obtained insights for assortment planning.

Closest to our work is that of Chen and Yao (2017) and Ursu (2018). Chen and Yao (2017) propose a structural model of consumer sequential search under uncertainty about attribute levels of products. Their model integrates consumers’ decisions of search and refinement (sorting and filtering) on an online platform. They use consumer click-stream data of an online hotel booking website and show that refinement tools have significant effects on consumer behavior and market structure. Our model is different from Chen and Yao (2017) in a number of ways: first, Chen and Yao (2017) focus on studying the impact of online refinement tools such as sorting and filtering on consumer search; our goal, however, is to disentangle the product attractiveness before and after the click and use the obtained insights for assortment planning; second, in Chen and Yao (2017) the customer’s optimal search strategy is characterized through the reservation utility model of Weitzman (1979), while we propose a one-step lookahead search policy as described in Section 4.3—a more detailed comparison of our search model with the reservation utility model of Weitzman (1979) is presented in Section 4.5. Ursu (2018) studies the impact of product rankings (positions) on online consumer search and purchase decisions. The author shows that rankings affect the consumer search by lowering the search costs, but conditional on search, do not affect purchases. There are a few major differences between the two works: first, Ursu (2018) focuses on studying the impact of product positions on consumer search, while we aim at disentangling the drivers of click versus purchase decisions and use the obtained insights for assortment planning; second, our models are different in that although we both consider unobserved product utilities which are revealed to customers after the click, Ursu (2018) assumes that the unobserved utilities follow a standard normal distribution (i.e., product utilities do not change on average after the click), while we estimate the unobserved utilities for each product separately by allowing them to change after the click—this, in turn, allows us to identify different types of products based on their attractiveness before and after the click; finally, Ursu (2018) uses the reservation utility model of Weitzman (1979) while we propose a one-step lookeahead search policy.

In the operations management literature several papers study the impact of consumer search cost on pricing, assortment planning, and market expansion (Cachon et al. 2005, 2008; Wang and Sahin 2018). In particular, Wang and Sahin (2018) propose a “consider-then-choose” model with search costs and study the assortment and pricing optimization problems. Our paper mainly differs from the aforementioned work in two ways: first, ours is a sequential search model, while that in Wang and Sahin (2018) is not a search model and only focuses on the consideration set formation (size) which is impacted by search costs; second, they study the assortment and pricing optimization problems analytically, while our contribution is largely empirical by developing a structural model for estimating the consumer search and purchase behavior. Another related work is Derakhshan et al. (2022) which proposes a polynomial-time approximation scheme for the online platform’s rank optimization problem in a setting with consumer sequential search. While Derakhshan et al. (2022) focus on optimizing the permutation of products on the platform, we focus on identifying different types of products by disentangling their attractiveness before and after the click. Moreover, Derakhshan et al. (2022) assume that all customers search the product positions in the exact same order (from top to bottom), while we do not limit the customer search pattern in our model.

The second stream of research related to our work studies consumer choice models with consideration sets. There is an extensive empirical literature in marketing and economics on consideration set formation. For example, Mehta et al. (2003), Kim et al. (2010), Kim et al. (2017), Honka (2014), and Chan and Park (2015) propose structural models for the formation of consideration sets resulting from consumer search, and model consumer purchase conditional on the consideration sets. The main difference between such works and ours is that the search sequences or the purchase decisions are not studied in the aforementioned papers. In contrast, we study both these decisions.

In recent years there has been a growing interest in the operations management literature in studying the “consider-then-choose” choice models where customers first form their consideration set and then make a purchase decision from that set. Aouad et al. (2024) study the assortment optimization problem under the click-based MNL choice model where given an assortment of products, each product has an exogenous chance of being clicked and thus considered by customers. Aouad et al. (2021) develop a dynamic programming framework to study the computational aspects of assortment optimization under a consider-then-choose model. Wang (2022) and Bai et al. (2024) study price and assortment optimization in settings where products with utilities above a certain threshold will comprise the customer’s consideration set. Farzaneh et al. (2023) study joint assortment and pricing under a feature-based consideration set choice model. Gallego and Li (2024) and Jagabathula et al. (2024) study demand estimation under uncertain (random) consideration sets. Our work mainly differs from such papers in that they focus on consideration set formation while we model the customer click and purchase behavior in a setting with search costs.

3. Data

The data sets available to us were provided by the MSOM society (Shen et al. 2020). These data sets provide transaction level data from JD.com which is one of the two largest Chinese online retailers. These data sets capture a “full customer experience cycle” that begins at the moment a customer browses through the products available on the platform before placing her order and ends at the moment the customer receives the product at her designated location. The data sets provide information on close to 400,000 customers and 30,000 SKUs during the month of March in 2018.

In this paper we focus on the click and order data. In particular, we study the click browsing history of customers and investigate the connection between such clicks and the purchase decisions of customers. We make several observations about the customers’ click behavior which inform the model that we develop in Section 4. We also have information on customer and product attributes (features) which we use in our model to capture the heterogeneity of customers and products.

The click data set contains close to 6 million customer-tied click records. Each entry in the click data set represents a user’s “click event” on a specific SKU’s product page. By visiting a product page, customers can review the detailed description of that product. Each click entry includes the user and SKU IDs, therefore we can track a user’s browsing history. We also have the time-stamps associated with each click which allows us to construct a “click sequence”; that is, the sequence in which a customer browses different SKUs. The click data contains clicks made not only by users who have made a purchase but also by users who did not end up purchasing any products. In fact, only 15% of users in the click data have placed an order. The order data set contains more than half a million order records that are associated with orders of close to half a million customers and more than 9,000 SKUs. The key information in each order entry includes the order, user, and SKU IDs, the order time, and SKU’s price and discount (if any).

The available data also contains information on customer and product attributes. The key information on products includes the SKU ID, product’s type (1 for JD-owned products or 2 for third-party merchants), and two numerical attributes—the specifics of such numerical attributes are not disclosed in the data set. The first attribute takes a value in ${1, 2, 3, 4}$ and the second attribute takes a value in ${30, 40, 50, \dots, 100}$ . In our analysis, we use dummy indicator variables to capture the information from these two product attributes. We also extracted the product prices from the order data. The key information on customers includes the user ID, and demographic information including gender, education, age group, marital status, and location (labeled as “city level” in the data). In addition, we can observe the plus membership status of customers (which is similar to Prime membership on Amazon), as well as two customer features created by JD.com which measure the purchase power and past purchase value of the customer (labeled as “user level” in the data).

Table 2 provides a few summary statistics for the distribution of the number of unique SKUs clicked by each user, after removing the outliers. As can be noted from Table 2, the average number of clicked products is 2.75 while the median is 2 and the maximum is 17. This shows that the distribution of the number of clicks is positively skewed indicating that while the majority of customers do not click on more than a few products, there are some users who engage in a lengthier search process. The average number of clicks is a small number in comparison to the average number of products in the assortments displayed to customers (e.g., 25-30) which indicates that search is costly for customers. Thus, the model we develop in Section 4 indeed incorporates customer search costs.

As mentioned before, we are interested in studying the click sequence of users to investigate the connection between the click events and the purchase decisions of users. In order to establish such connections, we need to first define the click-rank of products.

Table 2.
Summary statistics for the click data across users.

Std. 90th

Average deviation Median percentile Max

Number of unique

SKUs clicked 2.75 2.82 2.00 6.00 17

		Std.		90th
Number of unique
SKUs clicked	2.75	2.82	2.00	6.00	17

Definition 1 (Click-Rank)

Given a customer’s click history, the “click-rank” of a product is the index of the first time that the product appears in the customer’s click sequence.

Following this definition, in Figure 2 we illustrate the product sales as a function of the click-rank of SKUs among users who have clicked on 2 or more SKUs. For example, the first bar on the left side of Figure 2 shows the number of sold items that were the first product clicked by customers (i.e., click-rank=1). As can be seen in Figure 2, the product sales are decreasing in the click-rank of products; in other words, products with lower click-ranks seem to have a higher chance of being purchased. This observation indicates that the sequence of products that a customer browses (clicks) might indicate the order of attractiveness of those products to that customer. Therefore, the model that we develop in Section 4 should also account for the click-rank of products during the customer search process.

Figure 2.

Plot of the number of purchases (sales) as a function of the click-rank of SKUs for click-ranks 1 through 20 among users with 2 or more clicked SKUs.

3.1. Motivation for the Structural Estimation Approach

In what follows we motivate the need for a structural model in our setting. Any model for customer search (click) and purchase behavior needs to control for the offered assortments to capture the substitution patterns between products. For example, low sales of a product could be due to it being offered with a superior product in the assortment rather than the intrinsic unattractiveness of that product. The model also needs to control for the clicked products to capture the customers’ search behavior and cost. For example, a customer may not make a purchase from the offered assortment, but click on one of the products. Accounting for such a click is important as it could indicate the relative preference of the customer for that product over the other products in the assortment. Controlling for the substitution patterns and clicks efficiently is not possible using a reduced-form model for the following reasons:

Given that the substitution pattern between products depends on all products in an offered assortment, reduced-form models cannot effectively capture the substitution patterns as these models need to control for each product and each possible assortment using separate sets of coefficients. This makes the estimation procedure computationally intractable given that the number of possible assortments grows exponentially in the number of products.

If a particular assortment or click sequence has not been observed in the data, making predictions or performing counterfactual is not possible using reduced-form models.

The key idea of our paper is to disentangle the observed and unobserved product utilities to estimate the product attractiveness before and after the click. Such disentanglement, however, is not possible using a reduced-form approach that does not model the customer click and purchase behavior separately, and only tries to map product features to sales.

Thus, we propose a structural approach to address the issues above. As motivated by Figure 2, the click-rank of products should also be accounted for when estimating the customer search model. Before presenting our structural model in Section 4, we here provide reduced-form evidence to further justify why the sequence of clicks, which is a novel feature of our model, matters in understanding the customer click and purchase behavior. For this reduced-form analysis, we focus on the set of 131 substitutable products that we identify in Appendix EC.2 using a data-driven approach—please refer to that section for more details. We consider customers who have exclusively clicked on these products and run a logistic regression on their click and order data. Each row of this data indicates a customer’s click on a particular product. We use the binary purchase outcome as the response variable and the click-rank and product features as the independent variables. The logistic regression results are shown in Table 3. Except for the click-rank and price, the other product features are dummy indicator variables capturing the categorical information from the two available product attributes.

Table 3.
Logistic regression results for predicting the purchase outcome as a function of the click-rank and features of a product.

Coefficient

Click-Rank −0.0895^*

Price −0.0077^*

Att1_2 0.2671*

Att1_3 0.3796^*

Att1_4 0.1409

Att2_40 −0.5998^*

Att2_50 −0.1364^*

Att2_60 −0.2540^*

Att2_70 −0.5129^

Att2_80 −0.5127^*

Att2_90 −0.3231^*

Att2_100 −0.5457^*

Constant −1.8140^***

Log-Likelihood $- 127, 060$

Observations 493,094

	Coefficient
Click-Rank	−0.0895^***
Price	−0.0077^***
Att1_2	0.2671*
Att1_3	0.3796^***
Att1_4	0.1409
Att2_40	−0.5998^***
Att2_50	−0.1364^***
Att2_60	−0.2540^***
Att2_70	−0.5129^**
Att2_80	−0.5127^***
Att2_90	−0.3231^***
Att2_100	−0.5457^***
Constant	−1.8140^***
Log-Likelihood $- 127, 060$
Observations 493,094

$^{*} p < 0.10$ , $^{* *} p < 0.05$ , $^{* * *} p < 0.01$ .

As can be seen from Table 3, the click-rank of a product is statistically significant in predicting the purchase outcome for that product. Moreover, its coefficient has a negative sign, indicating that a lower click-rank implies a higher purchase probability. These results further emphasize that our structural model should account for the click-rank of products during the search process—we discuss the role of click-rank on the customer click and purchase behavior in more detail at the end of Section 4.4 and in Proposition 1. The next section presents our customer click and purchase model.

4. Customer Click and Purchase Model

In this section we present a structural model for customer click and purchase behavior which is informed by our observations from the data, as discussed in Section 3. Our model is a sequential discrete choice model where customers repeatedly decide between purchasing, clicking, or leaving the platform without a purchase. We first discuss the model preliminaries and basic notation in Section 4.1. We formally introduce the notion of observed and unobserved utilities in Section 4.2. We then introduce the customer choice model in Section 4.3, and discuss the choice probabilities in Section 4.4. We provide a discussion about some key aspects of the model in Section 4.5.

4.1. Model Preliminaries

We denote by $i \in {1, \dots, N}$ the index of customers arriving to the website, where $N$ denotes the total number of customers in the data. Suppose that customer $i$ is presented an assortment $A_{i}$ from the full set of $M$ available substitutable products at the beginning of her search. We index the products in the assortment by $A_{i} = {1, \dots, J_{i}}$ , where $J_{i}$ indicates the size of assortment $A_{i}$ .

Suppose that customer $i$ has clicked on $k$ products in $A_{i}$ . We denote the customer’s click history after her $k^{t h}$ click by $C_{i, k}$ . We also denote by $\bar{C_{i, k}} = A_{i} - C_{i, k}$ the set of products that customer has not clicked on yet after her $k^{t h}$ click. Note that at the beginning of the search, we have $k = 0$ , $C_{i, k} = \emptyset$ , and $\bar{C_{i, k}} = A_{i}$ . We make the following assumption inspired by our data:

Assumption 1
Customers do not purchase a product that they have not clicked on.

Assumption 1 implies that if after clicking on $k$ items, the customer decides to purchase a product, that product will belong to the set $C_{i, k}$ . This assumption is supported by the JD.com data, as there are only 5% of customers who place an order but purchase a product they have not clicked on. Indeed, we do not include such customers in our analysis.
4.2. Observed and Unobserved Utilities

The key idea of the article is to disentangle and separately estimate the attractiveness of a product to a customer before and after the click. Specifically, when customers search for a product, only a subset of product features (information) is observable on the search page. Such observable information drives the attractiveness (utility) of the product to the customer on the search page, before the click. We thus refer to the pre-click utility of the product as the observed utility. However, such observed information is usually not enough for customers to make their purchase decisions. Thus, customers incur a search cost and click on a product to investigate the product page of that item, which provides much more detailed information about the product. Such information can change the utility of the product to the customer. We refer to the post-click utility of a product as the total utility and call the change in product’s utility after the click the unobserved utility. In other words, the product page contains information that is not observable on the search page and thus, can change the utility of the product to the customer after the click.

To formalize the notions above, let $x_{j} \in R^{1 \times l}$ denote the vector of observed characteristics of product $j$ on the search page, with $l$ denoting the number of product attributes observable on the search page. Such observed characteristics might include price, brand, and average customer reviews. Also, let $α_{i} \in R^{1 \times l}$ denote the pre-click preference vector of customer $i$ , which is a function of customer $i$ ’s attributes—we discuss the functional form of $α_{i}$ in Section 5.1 when discussing the estimation strategy. The (pre-click) observed utility of product $j$ for customer $i$ is then given by:

\begin{aligned} α_{i} x_{j}^{T} . \end{aligned}

(1)

The observed utility

α_{i} x_{j}^{T}

represents customer

i

’s utility for the observed features

x_{j}

, which are visible on the search page. The observed utilities are product- and customer-specific as they are a function of the product and customer attributes through

x_{j}

and

α_{i}

, respectively.

If customer $i$ decides to click on product $j$ and investigate its product page, she incurs a search cost $c_{i}$ to realize the unobserved utility of the product through the information that is only available on the product page. Specifically, the (pre-click) unobserved utility of product $j$ for customer $i$ is given by:

\begin{aligned} δ_{i} x_{j}^{T} + ξ_{j} . \end{aligned}

(2)

The first term

δ_{i} x_{j}^{T}

in (2) represents the change in customer

i

’s utility for the observed features

x_{j}

after the click, where

δ_{i} \in R^{1 \times l}

. Given that the product page contains detailed information about the product that is only observable to the customer after the click, the customer’s preference for the observed features

x_{j}

might change after the click¹. For example, the product’s average customer review displayed on the search page may be high, but the customer’s evaluation of the product changes (decreases) after reading the detailed customer reviews on the product page. We refer to

δ_{i} x_{j}^{T}

as the change-in-observed-utility. The second term

ξ_{j}

in (2) captures the utility of the customer for product

j

based on the information that is revealed after clicking on the product and exploring the product page. Specifically, the product page contains detailed information about the product and possibly the seller. Accounting for all such information is not possible as the information on the product page might even be unstructured and hard to capture with data. For example, there might be several photos or videos, as well as detailed customer reviews in the form of texts, Q&A, and pictures. Thus, the term

ξ_{j}

controls for all such information on the product page that is visible only to the customer but not the researcher. We refer to

ξ_{j}

as the product-specific unobserved utility. We note that the unobserved utilities

δ_{i} x_{j}^{T} + ξ_{j}

are both customer and product dependent: similar to the observed utility, the term

δ_{i} x_{j}^{T}

is both product- and customer-specific through

x_{j}

and

δ_{i}

, respectively; however,

ξ_{j}

is only product-specific and acts as a product “fixed effect.”

The customer’s final purchase decision will be based on all the information she has collected about a product. We call this the total utility of the product and define it as follows:

\begin{aligned} β_{i} x_{j}^{T} + ξ_{j}, \end{aligned}

(3)

where

β_{i} := α_{i} + δ_{i}

, and

β_{i} \in R^{1 \times l}

. Here,

β_{i} x_{j}^{T}

represents customer

i

’s utility for the observed features

x_{j}

after the click. Because we only observe the click and purchase events in the data, for identification purposes, we will estimate

α_{i}

and

β_{i}

. This allows us to estimate the change-in-observed-utility coefficient

δ_{i}

by calculating

β_{i} - α_{i}

. In the next section we formalize the customer choice model.

4.3. Customer Actions

Our model is a sequential discrete choice model where customers repeatedly decide between purchasing, clicking, or leaving the platform without a purchase. We assume that customers are utility-maximizers and choose the option with the highest utility at any stage of the search. Specifically, the customer assigns a utility to each option and selects the one with the highest utility.

Let $a_{i, k}$ denote the action of customer $i$ after her $k^{t h}$ click. The customer has three possible actions: $(i)$ exiting the system without purchasing a product; $(i i)$ purchasing a product from the set of clicked products in $C_{i, k}$ and then exiting the system; $(i i i)$ continuing to search by clicking on one of the unexplored products in $\bar{C_{i, k}}$ ; We next discuss the details of the customer actions and their corresponding utilities.

Exit without Purchase: Customer $i$ can choose to end her search without purchasing any product (i.e., opting for the no-purchase option denoted by product 0). We denote this action by $a_{i, k} = b u y (0)$ . The utility of this action is given by $u (b u y (0) | C_{i, k}) = 0 + ϵ_{i, 0, k}$ , where $ϵ_{i, 0, k}$ is a random shock that is i.i.d. across customers and click epochs. As is customary for the discrete choice models, we set the mean utility of this action to 0 for identification purposes.

Purchase and Exit: Customer $i$ can choose to purchase a product $j$ from the set of explored (clicked) products in $C_{i, k}$ and end the search. We denote such an action by $a_{i, k} = b u y (j)$ for $j \in C_{i, k}$ . If customer $i$ purchases product $j \in C_{i, k}$ , her utility is given by

\begin{aligned} u (b u y (j) | C_{i, k}) = β_{i} x_{j}^{T} + ξ_{j} + ϵ_{i, j, k}^{p}, j \in C_{i, k} . \end{aligned}

(4)

Recall that

β_{i} x_{j}^{T} + ξ_{j}

denotes the expected total utility of product

j

for customer

i

after clicking on product

j

and investigating its product page. The variable

ϵ_{i, j, k}^{p}

denotes the idiosyncratic random shock of customer

i

for purchasing product

j

after her

k^{t h}

click—superscript

p

stands for “purchase.” Consistent with the extant discrete choice literature (Nair 2007), we assume that the idiosyncratic shocks are i.i.d. across customers, products, and click epochs. We also assume that the idiosyncratic shocks are known to the customer before her action, which is also consistent with the literature.

Click: Customer $i$ can choose to continue her search by clicking on one of the unexplored products $j$ from the set $\bar{C_{i, k}}$ . We denote this action by $a_{i, k} = c l i c k (j)$ for $j \in \bar{C_{i, k}}$ . As discussed in the previous section, the unobserved utility of a product is a function of $ξ_{j}$ , which accounts for all the information that is only visible to the customer after the click (on the product page). However, the customer does not know $ξ_{j}$ before the click. We assume that all customers have a common belief $μ$ about $ξ_{j}$ before clicking on the product. Customers then follow a one-step lookahead policy to decide their click strategy—we elaborate on this modeling choice in Section 4.5. Specifically, we assume that the customer’s utility from clicking on product $j \in \bar{C_{i, k}}$ is equal to her expected maximum utility if she were to click on product $j$ , and then stop her search and purchase the option with the highest utility among the clicked products (including the no-purchase alternative). In other words, the customer’s expected utility from a click is given by her expected maximum utility from a subsequent purchase/no-purchase decision, if she were to make one additional click and stop her search afterwards.

Suppose that customer $i$ has already clicked on products in the set $C_{i, k}$ . Let $V_{C_{i, k}; j}$ denote the expected click utility of customer $i$ if she decides to click on product $j \in \bar{C_{i, k}}$ . Following Ben-Akiva and Lerman (1985) and assuming the type-I extreme value distribution for the idiosyncratic random shocks ( $ϵ_{i, j, k}^{p}$ and $ϵ_{i, 0, k}$ ), the expected click utility $V_{C_{i, k}; j}$ is thus given by:

\begin{aligned} V_{C_{i, k}; j} = \ln (1 + \sum_{l \in C_{i, k}} \exp (β_{i} x_{l}^{T} + ξ_{l}) + \exp (α_{i} x_{j}^{T} + μ)) . \end{aligned}

(5)

Note that the terms inside the logarithmic function above correspond to the attractiveness (i.e., exponentiated-mean-utility) of different options available to the customer after clicking on product

j

. Specifically, 1 is the attractiveness of the no-purchase option, the second term is the sum of the attractiveness of purchasing one of the clicked items, and the last term is the attractiveness of purchasing product

j

. It is important to note that the customer has to calculate

V_{C_{i, k}; j}

before clicking on product

j

, thus she uses her beliefs

α_{i} x_{j}^{T}

and

μ

about the unknown (unobserved) values

β_{i} x_{j}^{T}

and

ξ_{j}

, respectively, to calculate

V_{C_{i, k}; j}

. We thus have that customer

i

’s utility from clicking on product

j \in \bar{C_{i, k}}

is given by:

\begin{aligned} u (c l i c k (j) | C_{i, k}) = - c_{i} + V_{C_{i, k}; j} + ϵ_{i, j, k}^{c}, j \in \bar{C_{i, k}}, \end{aligned}

(6)

where

c_{i}

denotes the search cost of customer

i

, and

V_{C_{i, k}; j}

is customer

i

’s expected utility from clicking on product

j

as defined in (5). Finally,

ϵ_{i, j, k}^{c}

denotes the random shock from clicking on product

j

—superscript

c

stands for “click.” We assume that

ϵ_{i, j, k}^{c}

are i.i.d. across customers, products, and click epochs.

4.4. Customer Choice Probabilities

Customers are utility maximizers and choose the action that provides them with the highest utility. As a result, the optimal action of customer $i$ after her $k^{t h}$ click denoted by ${\tilde{a}}_{i, k}$ is given by

\begin{aligned} {\tilde{a}}_{i, k} = {argmax}_{a_{i, k} \in {b u y (0), b u y (j)_{j \in C_{i, k}}, c l i c k (j)_{j \in \bar{C_{i, k}}}}} u (a_{i, k} | C_{i, k}) . \end{aligned}

(7)

We assume that the idiosyncratic shocks for all actions are i.i.d. and have type-I extreme value distributions with zero mean. Consequently, the probability of choosing action

a_{i, k}

by customer

i

denoted by

P_{i} (a_{i, k} | C_{i, k})

is a logistic function and given by

\begin{aligned} P_{i} (a_{i, k} | C_{i, k}) \\ = \frac{\exp (E (u (a_{i, k} | C_{i, k})))}{1 + \sum_{l \in \bar{C_{i, k}}} \exp (- c_{i} + V_{C_{i, k}; l}) + \sum_{l \in C_{i, k}} \exp (β_{i} x_{l}^{T} + ξ_{l})}, \end{aligned}

(8)

where the expectation in the numerator of (8) is taken over the distribution of the random shocks

ϵ

. Hence, we have

E (u (c l i c k (j) | C_{i, k})) = - c_{i} + V_{C_{i, k}; j}

E (u (b u y (j) | C_{i, k})) = β_{i} x_{j}^{T} + ξ_{j}

, and

E (u (b u y (0) | C_{i, k})) = 0

As discussed in Sections 3 and illustrated in Figure 2 and Table 3, the click-rank of products seems to provide insights about the order of attractiveness of products to customers. Thus, we propose a sequential search model in which the order of clicks matters in the customer search process. Specifically, the available information on the search page (i.e., the observed utilities) is the main driver of the customers’ click decisions. In other words, a product with a higher observed utility should have a higher probability of being clicked. This is formalized in the following Proposition.

Proposition 1

The click probability of a product is strictly increasing in its observed utility. That is, $P_{i} (c l i c k (j) | C_{i, k})$ is strictly increasing in $α_{i} x_{j}^{T}$ , for any customer $i$ , product $j \in \bar{C_{i, k}}$ , and any click history $C_{i, k}$ .

Proposition 1 shows that the click-rank of products for a customer should be, on average, in the order of the products’ attractiveness on the search page (i.e., the observed utilities $α_{i} x_{j}^{T}$ ). That is, a product with a higher observed utility should have a lower click-rank on average. Thus, the click-rank of products impacts the customer choice probabilities, and helps with identifying the customer preference vector $α_{i}$ . Proposition 1, however, does not imply that the purchase probability of products is necessarily in the order of click-ranks. This is because the unobserved information on the product page (i.e., $δ_{i} x_{j}^{T} + ξ_{j}$ ) can change the customer’s utility for the product after the click. Hence, by separately estimating the products’ attractiveness before and after the click, one can measure how the utility of products changes for customers after the click. This, in turn, allows us to identify underrated and overrated products as we illustrate in Section 6.2.

4.5. Model Discussion

We assume that customers follow a one-step lookahead search policy to decide their click strategy. This policy is a simplification of a search policy where customers solve a discrete-time dynamic program to calculate their expected click utilities. In that dynamic program, $V_{C_{i, k}; j}$ represents the customer’s expected future utility from clicking on product $j$ , assuming that the customer gives herself the option of continuing her search after clicking on product $j$ by potentially clicking on all unexplored items. We favor the one-step lookahead search policy over the dynamic program for the following reasons:

Search Costs and Click Behavior: In our model, customers face search costs, meaning that every additional click incurs a cost, which discourages customers from exploring the entire set of available products. Given that most customers typically click on only a few products before making a decision (as evident from Table 2), a one-step lookahead search strategy provides a more practical approach for modeling the customer search behavior compared to the full dynamic program, which would assume that customers fully optimize over all future clicks.

Diminishing Marginal Returns: Customers often experience diminishing marginal returns as they explore more products, which is reflected in the decreasing likelihood of purchase as click-rank increases (as shown in our reduced-form findings in Table 3, and Figure 2). This suggests that the additional value of future clicks becomes less significant after each successive click. Thus, optimizing for one additional click, rather than over a full sequence, is sufficient for capturing most of the relevant search behavior.

Computational Tractability: The one-step lookahead search policy makes the search model more tractable for estimation and running counterfactuals, as we do not need to deal with the curse of dimensionality to evaluate/approximate the dynamic program’s value functions which computationally is very burdensome, especially as the number of products and customer heterogeneity increases.

Consistency with Literature: This one-step lookahead model is consistent with the “reservation utility” search model of Weitzman (1979) that has been widely used in the marketing literature (Chen and Yao 2017; Ursu 2018). In that model, customers calculate the reservation utility of an unsearched product by equating the marginal gain from searching that product with the marginal cost. That is, the click utility of a product is given by the marginal gain in utility if the customer decides to click on the product and collects more information from the product page. The strength of our model compared to that of Weitzman (1979) is in that ours allows us to disentangle and separately estimate the product utilities before and after the click, while in the reservation utility model, the distribution of unobserved utilities is assumed to be known to the customer and i.i.d. (standard normal) across customers and products (Chen and Yao 2017; Ursu 2018; Weitzman 1979).

We also note that the inclusion of the term $ξ_{j}$ in the unobserved utility is similar to the setup of a choice model with product fixed effect (Moon et al. 2018). However, the customer does not know $ξ_{j}$ before clicking on product $j$ . As discussed in Section 4.3, we assume that all customers have a common belief $μ$ as to what the unknown $ξ_{j}$ would be if they were to click on the product. Thus, even though $ξ_{j}$ acts as a product fixed effect, it only impacts the customer’s purchase decision, but not the click decision. Thus, the overall impact of $ξ_{j}$ on the customer’s choice is different from that in an MNL or a BLP model with product fixed effects.

Finally, we assume the customer’s pre- and post-click preference vectors $α_{i}$ and $β_{i}$ to be independent of the customer’s click history. That is, each customer $i$ ’s preferences for product features do not change during the search (click) process. Indeed, a more general model could control for such click-history-dependent patterns; however, that would make the customer choice model much more complex and potentially intractable. In addition, a much richer dataset is needed for developing such a model. For example, the dataset needs to capture multiple customer visits (purchases) as well as detailed information about how customers evaluate different product features during the search process. In the case of our data, given that the majority of customers make only a few clicks (as shown in Table 2), the assumption of click-history-independent preference vectors seems reasonable.

5. Estimation

In this section, we discuss our estimation strategy and results. Specifically, in Section 5.1 we discuss the details of our estimation procedure. We take a data-driven approach to select a representative (random) sample of the whole data to estimate our model. The summary of our sampling approach is discussed in Section 5.2. The estimation results are presented in Section 5.3.

5.1. Estimation Strategy

In this section we lay out a maximum likelihood procedure to estimate the parameters of the model. To account for customer heterogeneity, we assume that the preference vector of customers ( $α_{i}$ and $β_{i}$ ) and their search cost ( $c_{i}$ ) may differ based on the customer attributes. In our data set, the customer information is comprised of their demographic features as discussed in Section 3. Let $D_{i} \in R^{1 \times m}$ denote the vector of customer demographics, where $m$ denotes the number of customer demographic attributes. We assume that the parameters $α_{i}$ , $β_{i}$ , and $c_{i}$ are linear functions of $D_{i}$ :

\begin{aligned} α_{i} & = π_{α}^{0} + D_{i} π_{α}, \\ β_{i} & = π_{β}^{0} + D_{i} π_{β}, \\ c_{i} & = π_{c}^{0} + D_{i} π_{c}, \end{aligned}

(9)

where

π_{α}^{0} \in R^{1 \times l}, π_{α} \in R^{m \times l}, π_{β}^{0} \in R^{1 \times l}, π_{β} \in R^{m \times l}, π_{c}^{0} \in R, π_{c} \in R^{m \times 1}

, and recall that

l

denotes the number of product attributes. The assumption of linear dependence between the model parameters above (

α_{i}, β_{i}, c_{i}

) and customer feature vector

D_{i}

is standard in the literature—see the seminar work of Nevo (2000), De los Santos et al. (2012), Honka and Chintagunta (2017), and Ursu (2018) for a few examples. Hence, the structural parameters of the model to be estimated are as follows:

The vector of customer preference parameters before the click: $Θ_{α} = [π_{α}^{0}, π_{α}]$ .

The vector of customer preference parameters after the click: $Θ_{β} = [π_{β}^{0}, π_{β}]$ .

The vector of search cost parameters: $Θ_{c} = [π_{c}^{0}, π_{c}]$ .

The vector of unobserved product-specific utilities: $Ξ = [ξ_{1}, \dots, ξ_{M}]$

The common belief $μ \in R$ .

Suppose that the sequence of actions of customer $i$ in the data is given by $[a_{i, k}]_{k = 1, \dots, K_{i}}$ , where $K_{i}$ denotes the total number of actions of customer $i$ . In the data, a customer clicks on at least one product, that is, $K_{i} \geq 1$ . Given that the final decision of a customer is to either purchase a product or exit the system without any purchase, the parameter $K_{i}$ is equal to the number of customer clicks plus one. In other words, the number of clicks of customer $i$ is $K_{i} - 1$ . Let $Θ = [Θ_{α}, Θ_{β}, Θ_{c}]$ , and $L (Θ, Ξ, μ)$ denote the likelihood function. The maximum likelihood estimation (MLE) problem is therefore given by

\begin{aligned} max_{{Θ, Ξ, μ}} L (Θ, Ξ, μ) = \prod_{i \in {1, \dots, N}} (\prod_{k \in {1, \dots, K_{i}}} P_{i} (a_{i, k} | C_{i, k})), \end{aligned}

(10)

where the term inside the parenthesis above denotes the likelihood of actions of customer

i

, and

P_{i} (a_{i, k} | C_{i, k})

is defined in (8).

Let us discuss the impact of dimensionality on the estimation. Recall that $M$ denotes the number of products, $l$ denotes the number of product attributes, and $m$ denotes the length of customer demographics vector. Given the size of the parameter vectors above, the number of elements of $Θ$ is $(m + 1) (2 l + 1)$ . Because the size of vector $Ξ$ is equal to the number of SKUs ( $M$ ), the total number of parameters to be estimated (including $μ$ ) is $M + (m + 1) (2 l + 1) + 1$ . Thus, the number of parameters to be estimated (and roughly speaking, our estimation computational time) grows linearly in the number of products ( $M$ ), and quadratically in $q$ , where $q := min {l, m}$ is the minimum of the number of customer attributes and product attributes. As we discuss next, we take a representative sample of data for estimation by taking a data-driven approach to pick a subset of products and customers from the data set. This reduces the computational time of estimating our model, while allowing us to illustrate the key takeaways and insights from our model. Given that the choice probabilities (8) are logistic functions, and the number of model parameters is polynomial in the input size, one should be able to estimate our model on the whole data set without a problem.

5.2. Estimation Sample

As discussed in the previous section, the number of parameters of the MLE problem and consequently the complexity of the estimation problem increases by the number of SKUs. Given that online retailers have thousands of products, scalability of the estimation problem could be of concern. However, when shopping online for a specific product (category), customers do not necessarily evaluate/search products from different categories in the same shopping visit. In other words, each customer searches within a specific product category (e.g., earbuds or laptops) comprised of substitutable products. As a result, customers’ click sequences as well as purchases will belong to a specific category of substitutable products. Hence, the retailer can estimate the parameters for each category of substitutable products in isolation. Moreover, customers might use different filtering tools on the website in order to narrow down the general assortment of products that they browse and click. The seller can leverage such information to further pinpoint the set of available products into a smaller subset of substitutable products which deem “most relevant” (e.g., products that have been clicked and/or purchased more often). Such insights on products can significantly reduce the number of products that should be considered for parameter estimation which, in turn, significantly reduces the computational burden of the MLE problem in (10).

One of the shortcomings of the provided data from JD.com is that even though all products belong to the same category, the name and details of that particular product category is not disclosed to us. Moreover, the attributes of the products in the provided data set are abstract numbers without definitions. As a result, identifying substitutable products is more cumbersome than in a setting where all products are identifiable. To address this issue, we take a data-driven approach to identify a subset of substitutable products. To do so, we first focus on products with average prices close to the median price. We then select a subset of SKUs that have frequently appeared together in customers’ click histories, which indicates that these products have a high degree of substitution. Finally, we take a random sample of customers who have exclusively clicked on these products. This results in a subset of the data with 30,000 customers and 126 SKUs. We note that in the selected sample of transactions, each customer purchases at most one product; thus, there are no multiple purchases in the selected data. Details of this data selection are discussed in Appendix EC.2. The goal here is to pick a subset of data that is representative of the entire data set to illustrate the value of our proposed framework and estimation strategy to obtain useful managerial insights.

5.3. Estimation Results

To estimate the parameters of the model, we solve the optimization problem in (10) using MATLAB optimization package. We solve the optimization for 20 random starting points. The optimization algorithm converges to the same solution for all starting points. To calculate the standard errors, we use the non-parametric bootstrap method (see Chapter 3 of Manski and McFadden 1981). The log-likelihood of the MLE in (10) is -167,829.35.

The estimates and standard errors of the customer preference parameters before and after the click are shown in Tables 4 and 5, respectively. In these tables, the indices $f_{1}$ to $f_{11}$ label the parameters corresponding to features 1 to 11 of the SKUs, and the indices $d_{1}$ to $d_{5}$ label the parameters corresponding to demographic features $d_{1}$ to $d_{5}$ of customers—all customer and product features, except price, are indicator variables capturing the categorical attributes of customers and products; details of product and customer attributes are discussed in Appendix EC.2. For example, customer $i$ ’s pre-click sensitivity to feature $f_{2}$ of products is given by $π_{α}^{0} (2) + D_{i} π_{α} (2)$ , where $π_{α}^{0} (2)$ denotes the second element of $π_{α}^{0}$ , $π_{α} (2)$ denotes the second column of matrix $π_{α}$ (which is transposed in Table 4), and $D_{i}$ is the vector of customer $i$ ’s demographic features. As can be seen from Tables 4 and 5, most of the customer preference parameters are statistically significant, which shows that the pre- and post-click customers’ preferences for product features indeed depend on their demographic attributes. Moreover, such preferences differ across customers which indicates the value of capturing heterogeneity in the customers’ preferences.

Table 4.
The estimates and standard errors (in parenthesis) for the parameters of customer preferences before the click ( $π_{α}^{0}$ , $π_{α}$ ).

$π_{α}$

Estimates $π_{α}^{0}$ $d_{1}$ $d_{2}$ $d_{3}$ $d_{4}$ $d_{5}$

$f_{1}$ −0.386^* (0.032) −0.286^ (0.084) 0.188 (0.096) −1.351^* (0.164) −1.370^* (0.392) −0.401^* (0.015)

$f_{2}$ −0.670^* (0.013) 0.028^* (0.001) 0.129^* (0.001) 0.088^* (0.000) −1.554^* (0.000) 1.098^* (0.000)

$f_{3}$ 4.345^* (0.143) 0.855^* (0.042) 0.788^* (0.042) 1.139^* (0.095) 6.950^* (0.055) −5.702^* (0.106)

$f_{4}$ −2.523^* (0.000) 0.368^* (0.000) 0.019^* (0.000) −0.330^* (0.000) −1.954^* (0.000) 1.784^* (0.000)

$f_{5}$ −0.881^* (0.003) −0.659^* (0.000) −0.686^* (0.000) −0.475^* (0.000) 0.079^* (0.000) 0.079^* (0.000)

$f_{6}$ −1.247^* (0.000) 0.148^* (0.000) 0.035^* (0.000) −0.029^* (0.000) −0.635^* (0.000) 0.806^* (0.000)

$f_{7}$ −1.153^* (0.120) −0.001 (0.032) 0.333^* (0.032) 0.341^* (0.000) −2.185^* (0.000) 2.332^* (0.000)

$f_{8}$ 3.599^* (0.190) 0.356^* (0.047) 0.780^* (0.046) 1.077^* (0.095) −0.046 (0.048) 2.282^* (0.106)

$f_{9}$ −0.695^* (0.046) −0.047^* (0.013) 0.256^* (0.013) −0.090^* (0.000) −3.166^* (0.000) 3.806^* (0.000)

$f_{10}$ −0.291^* (0.000) −0.432^* (0.000) −4.784^* (0.000) −5.565^* (0.000) 0.473^* (0.000) 0.472^* (0.000)

$f_{11}$ −2.489^* (0.027) −0.074^* (0.006) 0.445^* (0.006) 0.102^* (0.001) 13.041^* (0.012) −13.126^*** (0.000)

		$π_{α}$
$f_{1}$	−0.386^*** (0.032)	−0.286^** (0.084)	0.188 (0.096)	−1.351^*** (0.164)	−1.370^*** (0.392)	−0.401^*** (0.015)
$f_{2}$	−0.670^*** (0.013)	0.028^*** (0.001)	0.129^*** (0.001)	0.088^*** (0.000)	−1.554^*** (0.000)	1.098^*** (0.000)
$f_{3}$	4.345^*** (0.143)	0.855^*** (0.042)	0.788^*** (0.042)	1.139^*** (0.095)	6.950^*** (0.055)	−5.702^*** (0.106)
$f_{4}$	−2.523^*** (0.000)	0.368^*** (0.000)	0.019^*** (0.000)	−0.330^*** (0.000)	−1.954^*** (0.000)	1.784^*** (0.000)
$f_{5}$	−0.881^*** (0.003)	−0.659^*** (0.000)	−0.686^*** (0.000)	−0.475^*** (0.000)	0.079^*** (0.000)	0.079^*** (0.000)
$f_{6}$	−1.247^*** (0.000)	0.148^*** (0.000)	0.035^*** (0.000)	−0.029^*** (0.000)	−0.635^*** (0.000)	0.806^*** (0.000)
$f_{7}$	−1.153^*** (0.120)	−0.001 (0.032)	0.333^*** (0.032)	0.341^*** (0.000)	−2.185^*** (0.000)	2.332^*** (0.000)
$f_{8}$	3.599^*** (0.190)	0.356^*** (0.047)	0.780^*** (0.046)	1.077^*** (0.095)	−0.046 (0.048)	2.282^*** (0.106)
$f_{9}$	−0.695^*** (0.046)	−0.047^*** (0.013)	0.256^*** (0.013)	−0.090^*** (0.000)	−3.166^*** (0.000)	3.806^*** (0.000)
$f_{10}$	−0.291^*** (0.000)	−0.432^*** (0.000)	−4.784^*** (0.000)	−5.565^*** (0.000)	0.473^*** (0.000)	0.472^*** (0.000)
$f_{11}$	−2.489^*** (0.027)	−0.074^*** (0.006)	0.445^*** (0.006)	0.102^*** (0.001)	13.041^*** (0.012)	−13.126^*** (0.000)

^**, and ^*** denote significance at the 5% and 1%, respectively.

Table 5.

The estimates and standard errors (in parenthesis) for the parameters of customer preferences after the click ( $π_{β}^{0}$ , $π_{β}$ ).

Estimates	$π_{β}^{0}$	$d_{1}$	$d_{2}$	$d_{3}$	$d_{4}$	$d_{5}$
$f_{1}$	−0.012^*** (0.000)	−0.015^** (0.015)	0.013^** (0.015)	0.002^*** (0.001)	0.001 (0.001)	−0.002^** (0.001)
$f_{2}$	−1.459^*** (0.031)	2.059^*** (0.208)	−1.544^*** (0.227)	0.039 (0.080)	0.462^*** (0.116)	0.367^** (0.109)
$f_{3}$	−1.547^*** (0.037)	−0.319^*** (0.197)	0.240^** (0.174)	0.158 (0.082)	0.572^*** (0.118)	0.093 (0.126)
$f_{4}$	−1.766^*** (0.059)	−0.419^*** (0.079)	0.011 (0.076)	0.312^** (0.092)	1.156^*** (0.157)	0.127 (0.120)
$f_{5}$	−1.506^*** (0.070)	1.283^*** (0.175)	−1.036^*** (0.235)	−0.114* (0.098)	−1.239^*** (0.242)	0.161 (0.196)
$f_{6}$	−1.811^*** (0.052)	−0.435^*** (0.074)	0.035 (0.104)	0.454^*** (0.070)	0.090 (0.100)	−0.176 (0.084)
$f_{7}$	−1.326^*** (0.048)	0.034 (0.200)	0.256* (0.184)	0.139^*** (0.061)	−0.254* (0.092)	−0.049 (0.106)
$f_{8}$	−1.377^*** (0.041)	−0.074 (0.199)	−0.131 (0.194)	−0.072 (0.074)	−0.326^*** (0.092)	−0.118 (0.102)
$f_{9}$	−1.350^*** (0.037)	−0.484^*** (0.050)	0.695^*** (0.089)	0.049 (0.088)	−0.412^*** (0.099)	0.230^*** (0.082)
$f_{10}$	−0.828^*** (0.080)	−0.443^*** (0.002)	−4.794^*** (0.002)	0.896 (0.660)	0.020 (0.325)	−1.108^*** (0.151)
$f_{11}$	−1.397^*** (0.043)	0.590 (0.200)	−0.396 (0.206)	0.242^*** (0.056)	−0.424^*** (0.097)	−0.340 (0.102)

$^{*}$ , ^**, and ^*** denote significance at the 10%, 5%, and 1%, respectively.

The estimates and standard errors for the parameters of customer search costs ( $π_{c}^{0}$ , $π_{c}$ ) are shown in Table 6. Recall that $π_{c}$ captures the sensitivity of search cost to customer demographics. As can be seen from this table, some of the search cost parameters are statistically significant, which shows that the search cost is (slightly) heterogeneous across customers. The search cost values, however, do not differ much across customer types—this is expected given the values in Table 6, and that the demographic features are dummy indicator variables. Specifically, the average search cost across customer types is 3.04 with a standard deviation of 0.005.

Table 6.

The estimates and standard errors for the parameters of customer search cost ( $π_{c}^{0}$ , $π_{c}$ ).

		Estimates	St. Err.
	$π_{c}^{0}$	3.036^***	0.001
	$d_{1}$	−0.052^**	0.017
	$d_{2}$	0.036	0.017
$π_{c}$	$d_{3}$	0.001	0.002
	$d_{4}$	0.008^**	0.003
	$d_{5}$	0.004	0.003

^** and ^*** denote significance at the 5% and 1%, respectively.

The common belief parameter $μ$ is estimated to be 5.19 with a p-value well below 1%. Thus, the customers’ expectation of the utility they would derive from the information available only on the product page is 5.19. The estimates and standard errors for the product-specific unobserved utilities $ξ_{j}$ are reported in Table EC.1 in Appendix EC.3. We find that the majority of these parameters are statistically significant. Our estimates show that the product-specific unobserved utilities $ξ_{j}$ vary significantly across products. Such values range between $- 2.24$ to 2.50 with an average of 0.47 and standard deviation of 0.97. We investigate the impact of unobserved utilities on the customers’ purchase decisions in more detail below, as we discuss the managerial implications of our paper in the next section.

6. Managerial Implications

Our estimation results reveal interesting insights about the customer click and purchase behavior. In this section we discuss such empirical insights and their managerial implications. In particular, in Section 6.1 we illustrate how clicks provide valuable information to customers. Section 6.2 discusses how our model can be used to identify different types of products based on their observed and unobserved utilities and in particular identify the underrated “diamond-in-the-rough” products. Finally, we illustrate in Section 6.3 how our model can be operationalized in practice and used for improving the assortment decisions. We have also performed several robustness checks, which are discussed in Appendix EC.4.

6.1. Value of Click

Our framework enables the retailer to not only measure the (pre-click) observed utilities of products for customers but also their unobserved utilities that will be revealed only after clicking on the product. As mentioned before, customers’ click behavior is not impacted by the unobserved utilities of products as they are not known to the customer prior to the click. Nevertheless, the unobserved utilities affect the total utility of the product for customer and, in turn, impact the purchase decision. Specifically, we find that the utility of observed product features to customers after the click (i.e., $β_{i} x_{j}^{T}$ ) is on average $- 3.60$ across customers and products, ranging from $- 5.40$ to $- 2.27$ —the estimated negative utility values are common in online retail settings, as the majority of customer transactions end up without a purchase (Agrawal et al., 2019); in our sampled data used for estimation, 93% of customer click sequences result in no purchase. We also find that product-specific unobserved utility $ξ_{j}$ is 0.47 on average and ranges from $- 2.24$ to 2.50 across products. We recall from (3) that the total utility of product $j$ for a customer $i$ is given by $β_{i} x_{j}^{T} + ξ_{j}$ . Now, comparing the range and average values of $β_{i} x_{j}^{T}$ and $ξ_{j}$ , we can see that the product-specific unobserved utilities $ξ_{j}$ have the power to change the customer utility for some products after the click and, in turn, impact the customer’s purchase decision. In other words, the specific information that is available only on the product pages can change the customer’s final purchase decision. Thus, clicking on products is valuable to customers as it allows them to reveal the unobserved utility of products and potentially change their final purchase decision.

To shed further light on the value of click to customers, we also investigate whether there is any correlation between the (pre-click) observed utility of products (i.e., $α_{i} x_{j}^{T}$ ) and the unobserved utilities (i.e., $δ_{i} x_{j}^{T} + ξ_{j}$ ). As mentioned before, we cannot estimate the change-in-observed-utility vector $δ_{i}$ directly for identification purposes. What we do instead is to first estimate customer $i$ ’s preference vectors $β_{i}$ and $α_{i}$ . We then calculate $δ_{i}$ as $β_{i} - α_{i}$ . Thus, $δ_{i} x_{j}^{T}$ and $α_{i} x_{j}^{T}$ are expected to be highly correlated. We therefore focus on the correlation between $α_{i} x_{j}^{T}$ and $β_{i} x_{j}^{T}$ . To do so, for each product $j$ we calculate its average observed utility across all customers before and after the click. Specifically, we calculate

\begin{aligned} \bar{α} x_{j}^{T} = \frac{\sum_{i = 1}^{N} α_{i} x_{j}^{T}}{N} and \bar{β} x_{j}^{T} = \frac{\sum_{i = 1}^{N} β_{i} x_{j}^{T}}{N}, \end{aligned}

(11)

where we recall that

N

denotes the total number of customers in the data. In essence, the parameters

\bar{α} x_{j}^{T}

and

\bar{β} x_{j}^{T}

represent the observed utility of product

j

for an average customer (on this online platform) before and after the click. We find that the correlation between

\bar{α} x_{j}^{T}

and

\bar{β} x_{j}^{T}

is 0.721 across products. Thus, an average customer’s preferences for the product features observable on the search page are highly correlated before and after the click. More interestingly, though, we find that the correlation between

\bar{α} x_{j}^{T}

and

ξ_{j}

is -0.046 across products. This shows that there is very little correlation between the pre-click observed product utility from the information on the search page and the post-click product-specific unobserved utility from the information that is only available on the product page. In other words, a product with a high face value (i.e., high observed utility) may no longer be attractive to a customer after it is clicked, and vice versa. Therefore, our empirical results show that clicking on products is quite valuable for customers as it could lead to finding a more attractive product or discarding a less attractive one by learning the (pre-click) unobserved characteristics of products.

6.2. Diamonds in the Rough Versus Overrated Products

Our framework enables the retailer to identify two specific types of products which we call “diamonds in the rough” and “overrated.” In what follows, we discuss what we mean by diamond-in-the-rough (henceforth, DIR) and overrated products, and the implication of identifying such products for the retailer’s assortment decisions and revenue.

DIR products are generally underrated products that do not have a high observed utility ( $α_{i} x_{j}^{T}$ ) but have a high total utility ( $β_{i} x_{j}^{T} + ξ_{j}$ ) due to their high unobserved utility ( $δ_{i} x_{j}^{T} + μ_{j}$ ) that will be revealed to customers only after the click. As a result, a DIR product has a low chance of being clicked; however, given that such a product has a high total utility, it has a high chance of being purchased if the product is clicked. On the contrary, an overrated product is one with a high observed but low total utility. Such overrated products are likely to receive a lot of clicks but not many purchases because the customer finds these products attractive before the click, but no longer attractive after clicking on them and exploring their product pages.

To illustrate the ideas better, we use our estimation results to rank the 126 SKUs based on the following values: (a) product’s average observed utility $\bar{α} x_{j}^{T}$ given by (11) as a proxy for its attractiveness before the click; (b) product’s average total utility ( $\bar{β} x_{j}^{T} + ξ_{j}$ ) as a proxy for its attractiveness after the click, where $\bar{β} x_{j}^{T}$ is defined in (11). In other words, a higher $\bar{α} x_{j}^{T}$ means a higher probability of click for the product, while a higher $\bar{β} x_{j}^{T} + ξ_{j}$ means a higher probability of purchase, after the product is clicked. Table 7 shows the top 20 products with the highest average total utility and their ranking based on the average observed utility—a lower ranking means a higher utility.

Table 7.
The top 20 products with the highest average total utility and their ranking based on the average observed utility—a lower ranking means a higher utility.

Ranking based on average Ranking based on average

SKU ID total utility ( $\bar{β} x_{j}^{T} + ξ_{j}$ ) observed utility ( $\bar{α} x_{j}^{T}$ )

SKU 125 1 50

SKU 67 2 44

SKU 110 3 1

SKU 64 4 54

SKU 113 5 48

SKU 107 6 45

SKU 123 7 107

SKU 83 8 105

SKU 59 9 7

SKU 117 10 31

SKU 108 11 28

SKU 95 12 49

SKU 105 13 97

SKU 71 14 52

SKU 18 15 14

SKU 39 16 3

SKU 114 17 10

SKU 13 18 34

SKU 17 19 26

SKU 70 20 57

	Ranking based on average	Ranking based on average
SKU 125	1	50
SKU 67	2	44
SKU 110	3	1
SKU 64	4	54
SKU 113	5	48
SKU 107	6	45
SKU 123	7	107
SKU 83	8	105
SKU 59	9	7
SKU 117	10	31
SKU 108	11	28
SKU 95	12	49
SKU 105	13	97
SKU 71	14	52
SKU 18	15	14
SKU 39	16	3
SKU 114	17	10
SKU 13	18	34
SKU 17	19	26
SKU 70	20	57

The DIRs are products with a high ranking index based on the average observed utility but a low ranking index based on the average total utility. For example, SKU 125 and SKU 67 are two such products. They are ranked 1 $^{s t}$ and 2 $^{n d}$ , respectively, based on the average total utility and hence, have a high chance of being purchased if a customer clicks on them; however, their ranking based on the average observed utility is 50 $^{t h}$ and 44 $^{t h}$ , respectively, which means they have a low chance of being clicked. Using a similar analysis, we can identify products that we call overrated: these are products that have a high chance of being clicked but not a high chance of being purchased eventually. One of these products is SKU 46 (not displayed in Table 7). This product is ranked 2 $^{n d}$ based on the average observed utility, but its ranking based on the average total utility is 59 $^{t h}$ . In what follows, we propose a metric called “rank-ratio” which can help identify the DIR and overrated products more systematically.

Rank-Ratio Metric. After estimating our model and ranking products based on their average observed and total utilities (as in Table 7), we divide each product’s average total utility ranking by its average observed utility ranking and call this the “rank-ratio” of a product. For example, the rank-ratio of SKU 125 in the first row of Table 7 is given by $1 / 50 = 0.02$ . A lower rank-ratio indicates that the product is more likely to be a DIR as its average total utility ranking is small relative to its average observed utility ranking, or said differently, its average total utility is high while its average observed utility is low. On the contrary, a product with a high rank-ratio is more likely to be an overrated product as its average observed utility is high, but its average total utility is low. The products with medium rank-ratios lie somewhere in between DIR and overrated products. We call these the “medium-ranked” products. Indeed, this product classification serves as a general guideline and in practice, business and domain knowledge can also be used to fine-tune the identification of different types of products. Among the 126 products that we have used in our estimation, we classify the 30 products with the lowest rank-ratio as the DIRs as this roughly corresponds to the bottom 25th-percentile of rank-ratios. We classify the 30 products with the highest rank-ratio as the overrated products, and classify the remaining 66 products as the medium-ranked products—we use this product classification in the assortment numerical experiments in Section 6.3.

If the retailer could bring the DIR products to customers’ attention, the chance of click and consequently purchase of these products could be increased. Thus, the retailer might benefit from experimenting with giving more visibility to DIR products on the search page. There are different mechanisms through which the retailer could shine the spotlight on such products. For example, the retailer can highlight these products on the search page or add tags such as “spotlight product” (similar to “Amazon’s Choice” tags on Amazon.com or “Our Experts Recommend” at Best Buy) to entice customers to click on them. Another option is to display such products in the top positions on the search page. On the contrary, the retailer potentially could demote the overrated products on the search page (or remove them from the assortment altogether) so that customers do not spend time browsing but not purchasing them. In Section 6.3, we illustrate how the assortment decisions can be improved using our model by accounting for the unobserved product utilities and differentiating the DIR from overrated products.

6.3. Insights for Assortment Planning

In this section we illustrate how our model can be operationalized in practice and used for improving assortment decisions—this can be achieved by personalizing the assortment offerings for each customer type (Bernstein et al. 2019, 2022), or highlighting (promoting) certain products from the assortment on the search page. We focus on our subset of 126 substitutable products as a representative sample of the data and run simulations to compare the expected revenue of the optimal assortment under our model and the following three benchmarks: $(i)$ an MNL model that only uses the observed product utilities to find the optimal assortment; that is, this benchmark finds the optimal assortment under an MNL model where the product utilities are given by $α_{i} x_{j}^{T}$ ; $(i i)$ an MNL model that uses the products’ total utilities to find the optimal assortment; that is, this benchmark finds the optimal assortment under an MNL model where the product utilities are given by $β_{i} x_{j}^{T} + ξ_{j}$ ; and $(i i i)$ actual assortments displayed by JD.com.

The comparison with the two MNL benchmarks illustrates the importance of accounting for the customer search behavior when making assortment decisions. Specifically, the first MNL benchmark assumes that the most attractive products on the search page have the highest chance of being purchased, ignoring the possibility that the information provided on the product pages (i.e., the unobserved utilities) might change the customers’ utilities for products. The second MNL benchmark adjusts the product utilities by accounting for the unobserved part, but still does not consider that customers engage in a costly search process by considering only a few products from the full assortment by clicking on them and exploring them in more detail—the first MNL benchmark has the same shortcoming as well. The comparison with the assortments displayed by JD.com illustrates the potential gain relative to the status quo.

We will discuss the details of the numerical experiments next, but in short, we find that the optimal assortments under our model can significantly improve the expected revenue and result on average in more than 35% improvement in revenue over the MNL optimal assortments, and more than 24% improvement over the actual assortments.

6.3.1. Comparison With MNL Optimal Assortments

We run simulation experiments where we assume that the ground truth is based on our model; that is, customers search for products according to our model. We then use Monte Carlo simulation to estimate the expected revenue generated from different assortments. We compare the expected revenue of the optimal assortment under our model, with that of the two MNL benchmarks discussed above. We note that when calculating the expected revenues under different assortments, we use the same model (i.e., the model described in Section 4) as the ground truth so that the comparison is fair. The difference in revenues comes from the fact that the displayed assortments are optimal under different models.

We first illustrate the idea through an example and then report the results across 1000 random instances. We use the rank-ratio metric to identify the DIR, overrated, and medium-ranked products, as discussed in Section 6.2. We then select the 5 DIR products with the lowest rank-ratio, the 5 overrated products with the highest rank-ratio, an 5 randomly selected medium-ranked products. Specifically, the full assortment is comprised of SKUs 64, 67, 83, 123, 125 as the DIRs, SKUs 46, 91, 92, 102, 119 as the overrated products, and SKUs 3, 4, 7, 8, 9 as the medium-ranked products. We compare the optimal assortments of size 10 suggested by different benchmarks for one customer profile as a representative example. We find that the optimal assortment under different benchmarks includes all products except the following SKUs: ${7, 8, 46, 91, 119}$ under our model, ${9, 64, 83, 123, 125}$ under the first MNL benchmark that only uses the product observed utilities, and ${46, 91, 92, 102, 119}$ under the second MNL benchmark that uses the product total utilities. Comparing the SKUs that are not included in these assortments, we observe that the first MNL benchmark does not include almost all of the DIRs as this benchmark is oblivious to the fact that such DIR products indeed have high total utilities despite their low observed utilities. The second MNL benchmark excludes all the overrated products as such products have a low total utility. Finally, the optimal assortment under our model does not include three overrated and two medium-ranked products. This in turns results in the expected revenue of our optimal assortment to be three times more than the first MNL benchmark and 30% more than the second MNL benchmark.

To investigate a more comprehensive set of instances and reduce the chance of cherry-picking, we generate 1,000 random assortment instances of size $N = 15$ where 5 DIR products are randomly selected from the bottom-30 rank-ratios, 5 overrated products are randomly selected from the top-30 rank-ratios, and 5 medium-ranked products are randomly selected from the remaining 66 products. We compare the expected revenues of the optimal assortments of size $J = 10$ under our model and the two MNL benchmarks. We find that the optimal assortments under our model result in more than 39% and 35% higher revenues on average (across different customer profiles) than the first and second MNL benchmarks, respectively, after discarding the outliers. As might be expected, the second MNL benchmark (which uses the total utilities) performs better than the first one (which only uses the observed utilities) as it uses more information to find the optimal assortments.

6.3.2. Comparison With Actual Assortments

We replicate the assortment experiments from the previous section but compare our optimal assortments with the actual assortments displayed by the platform (i.e., JD.com). There are 30,000 customers in the sampled data that we have used for estimation in Section 5.3. For each of these customers, we consider the first 15 products (based on the order of the customer click) that JD.com has displayed to that customer as the full assortment of products. We then consider the first 10 products as the “actual” (most relevant) assortment displayed to the customer by JD.com.

JD.com uses its own algorithm to decide what assortment to display to each customer and indeed does not assume that our model is the ground truth, thus the identified actual assortments represent the status quo. We then find the optimal assortment of size 10 (among the full assortment of 15 products) through enumeration and find the expected revenue gap between the actual and optimal assortments. We find that on average (across 30,000 customers), the optimal assortments suggested by our model improve the expected revenue by more than 24% (after discarding the outliers) over the actual assortments used by JD.com, based on our model as the ground truth.

6.3.3. Impact of Assortment Size on Revenue

We also study how the revenue improvement changes with the assortment size. To do so, we repeat the exercise from Section 6.3.2 by taking a random sample of 1,000 customers and calculating the revenue improvement of the optimal assortment suggested by our model over the actual assortment used by JD.com. The main difference now is that we vary the assortment (capacity) size from 2 to 10 (while still using a full assortment of size 15, as in Section 6.3.2). The revenue improvements (in percentage) are reported in Table 8 for different assortment sizes. As can be noted from the table, the revenue improvement seems to have an inverted-U shape as a function of the assortment size. That is, as the assortment (capacity) size increases the revenue improvement generally increases up to a point and then starts decreasing. The intuition for this is that as the number of possible assortments that can be displayed to the customer increases, there is more potential for revenue gain by offering a different (optimal) assortment. There, the inverted-U shape is the result of increasing the assortment capacity size (i.e., choosing a larger subset of products), which first increases and then decreases the number of feasible assortments that can be offered. For example, for an assortment size of 2, there are 105 possible assortments that can be selected from the full set of 15 products, while for an assortment size of 7 or 8, there are 6,435 possible assortments that can be offered.

Table 8.
Revenue improvements of the optimal assortments over JD.com assortments for different assortment sizes in a sample of 1,000 random customers.

Assortment size Revenue improvement

2 15.2%

3 26.4%

4 30.9%

5 35.1%

6 36.1%

7 35.7%

8 31.2%

9 32.3%

10 26.6%

Assortment size	Revenue improvement
2	15.2%
3	26.4%
4	30.9%
5	35.1%
6	36.1%
7	35.7%
8	31.2%
9	32.3%
10	26.6%

To sum up, the results in Section 6.3 show the potential of using our model for improving the assortment decisions. This, however, is with the caveat that our results are likely an overestimation of what might happen in practice because they are obtained under the assumption that the ground truth is based on our model. Thus, we cannot necessarily claim that these exact revenue improvements are generalizable to other e-commerce settings; however, our developed methodology can be applied to and leveraged in other product search (e-commerce) settings to identify underrated/overrated products, and our results showcase the value of operationalizing this model in improving the assortment decisions. In most practical settings, customers indeed engage in a costly search process by clicking on different products to collect more information. Our results show that in such settings accounting for the customer search behavior as well as the unobserved product utilities (which will be revealed after the click) can change the composition of the optimal assortment and significantly increase the revenue.

7. Conclusion

In this article, we propose a structural model to study the customer click and purchase behavior in an online retail setting. Our work is motivated by the data sets from JD.com which were shared by the MSOM society. These data sets provide information on close to 400,000 customers and 30,000 SKUs in one specific (unnamed) product category during the month of March in 2018. We combine customers’ click and order data to study their search behavior. In particular, we construct the click sequence of customers in order to disentangle the drivers of customer click versus purchase decisions.

We propose a structural framework to model the customer click and purchase behavior using a sequential discrete choice model. In particular, we assume that the customer’s utility from each product has an observed and unobserved part (in addition to a random shock). The observed part of the utility is known to the customer prior to the click based on the product characteristics displayed on the search page—such observed characteristics might include general product information such as price and brand. However, the unobserved characteristics (utility) of the product can only be learned by the customer if she clicks on the product and explores the product page. Such unobserved characteristics might include detailed customer reviews, product specifications, etc., which are only displayed on the product page after the click. Given that the search is costly for the customer, she must decide whether to continue the search by clicking on an unexplored product, or stop the search and make a purchase/no-purchase decision. The customer assigns a utility to each option and chooses the one with the highest utility at any stage of the search. We assume that customers follow a one-step lookahead policy to decide their search strategy. Specifically, we assume that customers anticipate their expected maximum utility from a subsequent purchase/no-purchase decision, if they were to make one additional click.

We take a representative sample of the data and estimate the model parameters on that set. Our estimation results show that the value of click for customers can be quite significant. This is evidenced by the fact that the estimates of the product-specific unobserved utilities vary significantly across products and can significantly change the product utility after the click; moreover, the product-specific unobserved utilities have almost no correlation with the observed utilities. Most importantly, our structural framework allows us to disentangle the observed and unobserved parts of product utilities and identify the diamonds in the rough: these are underrated products with low observed utilities but high total utilities (i.e., sum of observed and unobserved utilities) due to their high unobserved utilities. Given that the click decisions are only impacted by observed utilities but purchase decisions are driven by total utilities, such diamond-in-the-rough products have a low chance of being clicked, but a high chance of being purchased, if clicked. Consequently, the retailer might benefit from bringing such products into the spotlight and promoting them to customers on the search page. Through simulation studies we illustrate how our model can be operationalized and used in practice for improving assortment decisions. In particular, we find that by accounting for the unobserved product utilities and customer search cost, our model can change the composition of the optimal assortment and significantly increase the revenue. Specifically, we focus on a subset of 126 substitutable products as a representative sample of the data and find that based on our model, the optimal assortments (suggested by our model) improve the expected revenue by more than 24% over the actual assortments displayed by JD.com, and by more than 35% on average over two MNL benchmarks that use either the product observed or total utilities to find the optimal assortments.

Future research can study and quantify the impact on sales of spotlighting the diamond-in-the-rough products in a randomized experiment. To this end, one of the shortcomings of the JD.com data set is that the product location information on the search page is not available—in the online appendix, we perform a robustness check using a proxy for product location. Thus, although our proposed framework allows us to identify the underrated diamond-in-the-rough products and use the insights for improving the assortment decisions, it is not possible for us to directly measure the sales impact of promoting such products on the search page. This can be investigated in future research where the retailer experiments with spotlighting the diamond-in-the-rough products by giving them more visibility on the search page. Moreover, if a more detailed set of product and customer features is available in the data, one can incorporate non-linear terms in the utility function to further improve the model flexibility. In addition, the dataset captures only one visit (purchase) from each customer, and provides limited information as to how customers evaluate products during their search process. Thus, having access to a much richer data set could help make the obtained results and insights even more robust. Finally, because the data is from March 2018, which coincides with the Chinese New Year, the data period may not reflect typical consumer purchasing processes.

Supplemental Material

sj-pdf-1-pao-10.1177_10591478251350097 - Supplemental material for Diamonds in the Rough: Leveraging Click Data to Spotlight Underrated Products

Supplemental material, sj-pdf-1-pao-10.1177_10591478251350097 for Diamonds in the Rough: Leveraging Click Data to Spotlight Underrated Products by Sajad Modaresi, Seyed Morteza Emadi and Vinayak Deshpande in Production and Operations Management

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Sajad Modaresi

Seyed Morteza Emadi

Vinayak Deshpande

Supplemental Material

Supplemental material for this article is available online (doi: ).

Notes

How to cite this article

Modaresi S, Emadi SM and Deshpande V (2026) Diamonds in the Rough: Leveraging Click Data to Spotlight Underrated Products. Production and Operations Management 35(2) 606–624.

References

Agrawal

Avadhanula

Goyal

Zeevi

(2019) Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research 67(5): 1453–1485.

Aouad

Farias

Levi

(2021) Assortment optimization under consider-then-choose choice models. Management Science 67(6): 3368–3386.

Aouad

Feldman

Segev

, et al. (2024) The click-based mnl model: A framework for modeling click data in assortment optimization. Management Science, Forthcoming.

Bai

Feldman

Topaloglu

Wagner

(2024) Assortment optimization under the multinomial logit model with utility-based rank cutoffs. Operations Research 72(4): 1453–1474.

Ben-Akiva

Lerman

(1985) Discrete Choice Analysis: Theory and Application to Travel Demand. Cambridge, MA: MIT Press.

Bernstein

Modaresi

Sauré

(2019) A dynamic clustering approach to data-driven assortment personalization. Management Science 65(5): 2095–2115.

Bernstein

Modaresi

Sauré

(2022) Exploration optimization for dynamic assortment personalization under linear preferences. Available at SSRN 4115721 Working paper, UNC Chapel Hill.

Cachon

Terwiesch

(2005) Retail assortment planning in the presence of consumer search. Manufacturing & Service Operations Management 7(4): 330–346.

Cachon

Terwiesch

(2008) On the effects of consumer search and firm entry in a multiproduct competitive market. Marketing Science 27(3): 461–473.

10.

Chan

Park

(2015) Consumer search activities and the value of ad positions in sponsored search advertising. Marketing Science 34(4): 606–623.

11.

Chen

Yao

(2017) Sequential search with refinement: Model and application with click-stream data. Management Science 63(12): 4345–4365.

12.

De los Santos

Hortaçsu

Wildenbeest

(2012) Testing models of consumer search using data on web browsing and purchasing behavior. American Economic Review 102(6): 2955–80.

13.

De los Santos

Koulayev

(2017) Optimizing click-through in online rankings with endogenous search refinement. Marketing Science 36(4): 542–564.

14.

Derakhshan

Golrezaei

Manshadi

Mirrokni

(2022) Product ranking on online platforms. Management Science 68(6): 4024–4041.

15.

Farzaneh

Modaresi

Venkataraman

(2023) A feature-based consideration set choice model for online retailing. Available at SSRN 4357465 Working paper, University of Texas at Dallas.

16.

Gallego

(2024) A random consideration set model for demand estimation, assortment optimization, and pricing. Operations Research 72(6): 2358–2374.

17.

Ghose

Ipeirotis

(2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Science 31(3): 493–520.

18.

Ghose

Ipeirotis

(2014) Examining the impact of ranking on consumer behavior and search engine revenue. Management Science 60(7): 1632–1654.

19.

Hong

Shum

(2006) Using price distributions to estimate search costs. The RAND Journal of Economics 37(2): 257–275.

20.

Honka

(2014) Quantifying search and switching costs in the US auto insurance industry. The RAND Journal of Economics 45(4): 847–884.

21.

Honka

Chintagunta

(2017) Simultaneous or sequential? search strategies in the US auto insurance industry. Marketing Science 36(1): 21–42.

22.

Hortaçsu

Syverson

(2004) Product differentiation, search costs, and competition in the mutual fund industry: A case study of s&p 500 index funds. The Quarterly Journal of Economics 119(2): 403–456.

23.

Jagabathula

Mitrofanov

Vulcano

(2024) Demand estimation under uncertain consideration sets. Operations Research 72(1): 19–42.

24.

Kim

Albuquerque

Bronnenberg

(2010) Online demand under limited consumer search. Marketing Science 29(6): 1001–1023.

25.

Kim

Albuquerque

Bronnenberg

(2017) The probit choice model under sequential search with an application to online retailing. Management Science 63(11): 3911–3929.

26.

Koulayev

(2014) Search for differentiated products: Identification and estimation. The RAND Journal of Economics 45(3): 553–575.

27.

Manski

McFadden

(1981) Structural Analysis of Discrete Data with Econometric Applications. Cambridge, MA: MIT Press.

28.

Mehta

Rajiv

Srinivasan

(2003) Price uncertainty and consumer search: A structural model of consideration set formation. Marketing Science 22(1): 58–84.

29.

Moon

Shum

Weidner

(2018) Estimation of random coefficients logit demand models with interactive fixed effects. Journal of Econometrics 206(2): 613–644.

30.

Nair

(2007) Intertemporal price discrimination with forward-looking consumers: Application to the us market for console video-games. Quantitative Marketing and Economics 5(3): 239–292.

31.

Nevo

(2000) A practitioner’s guide to estimation of random-coefficients logit models of demand. Journal of Economics & Management Strategy 9(4): 513–548.

32.

Shen

Tang

Yuan

Zhou

(2020) Jd.com: Transaction-level data for the 2020 msom data driven research challenge. Manufacturing & Service Operations Management.

33.

Ursu

(2018) The power of rankings: Quantifying the effect of rankings on online consumer search and purchase decisions. Marketing Science 37(4): 530–552.

34.

Wang

(2022) The threshold effects on consumer choice and pricing decisions. Manufacturing & Service Operations Management 24(1): 448–466.

35.

Wang

Sahin

(2018) The impact of consumer search cost on assortment planning and pricing. Management Science 64(8): 3649–3666.

36.

Weitzman

(1979) Optimal search for the best alternative. Econometrica: Journal of the Econometric Society : –.

37.

Wooldridge

(2010) Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT Press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.23 MB

		Std.		90th
	Average	deviation	Median	percentile	Max
Number of unique
SKUs clicked	2.75	2.82	2.00	6.00	17

		$π_{α}$
Estimates	$π_{α}^{0}$	$d_{1}$	$d_{2}$	$d_{3}$	$d_{4}$	$d_{5}$
$f_{1}$	−0.386^*** (0.032)	−0.286^** (0.084)	0.188 (0.096)	−1.351^*** (0.164)	−1.370^*** (0.392)	−0.401^*** (0.015)
$f_{2}$	−0.670^*** (0.013)	0.028^*** (0.001)	0.129^*** (0.001)	0.088^*** (0.000)	−1.554^*** (0.000)	1.098^*** (0.000)
$f_{3}$	4.345^*** (0.143)	0.855^*** (0.042)	0.788^*** (0.042)	1.139^*** (0.095)	6.950^*** (0.055)	−5.702^*** (0.106)
$f_{4}$	−2.523^*** (0.000)	0.368^*** (0.000)	0.019^*** (0.000)	−0.330^*** (0.000)	−1.954^*** (0.000)	1.784^*** (0.000)
$f_{5}$	−0.881^*** (0.003)	−0.659^*** (0.000)	−0.686^*** (0.000)	−0.475^*** (0.000)	0.079^*** (0.000)	0.079^*** (0.000)
$f_{6}$	−1.247^*** (0.000)	0.148^*** (0.000)	0.035^*** (0.000)	−0.029^*** (0.000)	−0.635^*** (0.000)	0.806^*** (0.000)
$f_{7}$	−1.153^*** (0.120)	−0.001 (0.032)	0.333^*** (0.032)	0.341^*** (0.000)	−2.185^*** (0.000)	2.332^*** (0.000)
$f_{8}$	3.599^*** (0.190)	0.356^*** (0.047)	0.780^*** (0.046)	1.077^*** (0.095)	−0.046 (0.048)	2.282^*** (0.106)
$f_{9}$	−0.695^*** (0.046)	−0.047^*** (0.013)	0.256^*** (0.013)	−0.090^*** (0.000)	−3.166^*** (0.000)	3.806^*** (0.000)
$f_{10}$	−0.291^*** (0.000)	−0.432^*** (0.000)	−4.784^*** (0.000)	−5.565^*** (0.000)	0.473^*** (0.000)	0.472^*** (0.000)
$f_{11}$	−2.489^*** (0.027)	−0.074^*** (0.006)	0.445^*** (0.006)	0.102^*** (0.001)	13.041^*** (0.012)	−13.126^*** (0.000)

	Ranking based on average	Ranking based on average
SKU ID	total utility ( $\bar{β} x_{j}^{T} + ξ_{j}$ )	observed utility ( $\bar{α} x_{j}^{T}$ )
SKU 125	1	50
SKU 67	2	44
SKU 110	3	1
SKU 64	4	54
SKU 113	5	48
SKU 107	6	45
SKU 123	7	107
SKU 83	8	105
SKU 59	9	7
SKU 117	10	31
SKU 108	11	28
SKU 95	12	49
SKU 105	13	97
SKU 71	14	52
SKU 18	15	14
SKU 39	16	3
SKU 114	17	10
SKU 13	18	34
SKU 17	19	26
SKU 70	20	57

Diamonds in the Rough: Leveraging Click Data to Spotlight Underrated Products

Abstract

Keywords

1. Introduction

3. Data

Table 2. Summary statistics for the click data across users. Std. 90th Average deviation Median percentile Max Number of unique SKUs clicked 2.75 2.82 2.00 6.00 17

4.1. Model Preliminaries

5. Estimation

5.1. Estimation Strategy

5.3. Estimation Results

6.1. Value of Click

6.3.1. Comparison With MNL Optimal Assortments

6.3.2. Comparison With Actual Assortments

6.3.3. Impact of Assortment Size on Revenue

Table 8. Revenue improvements of the optimal assortments over JD.com assortments for different assortment sizes in a sample of 1,000 random customers. Assortment size Revenue improvement 2 15.2% 3 26.4% 4 30.9% 5 35.1% 6 36.1% 7 35.7% 8 31.2% 9 32.3% 10 26.6%

Supplemental Material

sj-pdf-1-pao-10.1177_10591478251350097 - Supplemental material for Diamonds in the Rough: Leveraging Click Data to Spotlight Underrated Products

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Supplemental Material

Notes

How to cite this article

References

Supplementary Material

Table 2.
Summary statistics for the click data across users.

Std. 90th

Average deviation Median percentile Max

Number of unique

SKUs clicked 2.75 2.82 2.00 6.00 17

Table 8.
Revenue improvements of the optimal assortments over JD.com assortments for different assortment sizes in a sample of 1,000 random customers.

Assortment size Revenue improvement

2 15.2%

3 26.4%

4 30.9%

5 35.1%

6 36.1%

7 35.7%

8 31.2%

9 32.3%

10 26.6%