Abstract
Human decision making in the newsvendor context has been analyzed intensively in laboratory experiments, where various decision biases have been identified. However, it is unclear whether the biases also exist in practice. We analyze the ordering decisions of a manufacturer who faces a multiproduct newsvendor problem with an aggregate service‐level constraint. We find that the manufacturer broadly exhibits the same biases as subjects in the laboratory and is prone to another bias that has not been identified before, that is, group aggregation. The bias can be attributed to the multi‐product problem of the manufacturer, and refers to the observation that the service levels are not optimized for individual products, but rather for product groups. Our data allow us to analyze the performance of a manufacturer in detail and we find that target service levels are achieved effectively, but not efficiently. We provide rationales for the manufacturer's ordering behavior, discuss managerial implications, and quantify the financial benefits of debiasing ordering decisions.
Keywords
INTRODUCTION
The newsvendor problem is one of the fundamental problems in operations management. The basic model considers a decision maker who is facing stochastic demand for a perishable product and must decide how much of the product to order to maximize expected profit. The model was introduced by Edgeworth (1888) and many variations and extensions of the model have been developed (Choi, 2012).
In their seminal behavioral operations paper, Schweitzer and Cachon (2000) analyzed ordering decisions of human decision makers in a newsvendor setting. They conducted experiments and found that orders deviated from the normative predictions of the newsvendor model. The subjects overreacted to recent demand realizations and their average order quantities were pulled toward the mean demand. The biases are robust and have been observed under different demand distributions (Benzion et al., 2008), with different subject pools (Bolton et al., 2012; Lee et al., 2018; Moritz et al., 2013), under various framings (Katok & Wu, 2009; Kremer et al., 2010), and are stable over time (Bolton & Katok, 2008; Lurie & Swaminathan, 2009; Ockenfels & Selten, 2015). The common explanations of the biases are demand chasing (Bolton & Katok, 2008; Lau & Bearden, 2013), anchoring (Bostian et al., 2008), and inventory error minimization (Ho et al., 2010; Kremer et al., 2014). For a comprehensive review on behavioral inventory papers, we refer to Becker‐Peth and Thonemann (2018).
There exist rich bodies of literature on normative newsvendor models and newsvendor laboratory experiments, but the ordering behavior of actual decision makers in a newsvendor environment has not been analyzed. We take a first step toward filling this gap. In this paper, we analyze the orders of a manufacturer who distributes bakery products via the stores of one of Europe's largest retailers. The problem that the manufacturer faces is a newsvendor‐type problem: Demand is uncertain, products are perishable, and the products' shelf life and selling period are one day without replenishment during the day. However, unlike in the standard newsvendor model, the manufacturer does not face an unconstrained optimization problem, but must ensure a minimum service level. More precisely, the manufacturer has to ensure that a certain fraction of the products is in stock at the end of the day, that is, he must achieve an aggregated target service level.
The newsvendor problem with the specific aggregate service‐level constraint that the manufacturer faces has not been analyzed before in the literature. We analytically derive the optimal ordering decision for the problem. In the optimal solution, service levels are differentiated across products based on the demand uncertainties, unit revenues, and unit costs of products. We compare the actual order quantities with the optimal quantities and find that the manufacturer's ordering decisions are effective, but not efficient. The manufacturer achieves the aggregated target service level generally, but at costs well above the optimal cost. The efficiency gap can be attributed to three effects: behavioral forecasting, inventory error minimization, and group aggregation. The first two effects were identified in previous laboratory experiments and we find empirical evidence for them in the order data of the manufacturer. The third effect, group aggregation, has not been identified before. It refers to the observation that service levels are differentiated per product groups, but not per individual product. This approach simplifies the task but harms efficiency.
Our results have important implications for behavioral operations management research. Research in this area has relied on analytical models and laboratory experiments, and the behavioral operations management community has discussed extensively how well the results of laboratory experiments translate into practice. We analyze the inventory decisions of an actual manufacturer in a newsvendor environment and show that the biases that have been identified in the laboratory also exist in practice.
Our results are of interest not only to researchers but also to practitioners. We quantify the magnitude of the financial benefits of eliminating decision biases by comparing the manufacturer's performance with the performance that the manufacturer would have achieved if he had implemented an unbiased solution. The results show that eliminating behavioral biases increases the operating profit substantially.
ANALYTICAL MODEL
The manufacturer that we analyze in this paper distributes
Depending on the contribution margins of the products, the expected profit maximizing orders can result in lower service levels than the retailer requires. To avoid low availability of products at the end of a day, the retailer requires that the manufacturer achieves a service level per store of at least
The manufacturer's optimization problem can be solved for each store individually and we next consider a single store. The solution to the optimization problem takes place in two stages. First, the manufacturer estimates the demand distribution for each product and then solves a multiproduct newsvendor problem.
Demand forecasting
We denote the demand for product
In retail environments like the one that we consider, demand is typically autocorrelated (van Donselaar et al., 2010). It can be modeled as an ARIMA(0,1,1) time series, for which single exponential smoothing is the mean squared error minimizing forecasting method (Chatfield, 2001), and we use this approach in our model.
The manufacturer receives censored demand information, that is, he does not observe demands
Based on the estimate of the demand of the previous period,
The variance
We neglect demand substitution effects, which seems reasonable in the setting that we analyze. In our application, substitution rates are small and the impact on the resulting order quantities and profits is negligible. We evaluated the effect of substitution on the manufacturer's profit and found that order quantities differ by 1.1% on average and the realized profit would be 0.6% higher than the optimal solution without substitution. To keep the following models analytically tractable and the behavioral analyses technically feasible, we disregard substitution effects in the model.
Newsvendor system approach
The manufacturer's objective is to maximize expected profit, subject to a constraint on the service level. The probability that product
The optimization problem is
If Constraint (4) is not binding, optimal order quantities are determined for each product individually by the standard newsvendor formula, that is, The optimal order quantities fulfill the following conditions:
The theorem states the optimality conditions in terms of the order quantities. Some of our analyses will be based on service levels. For such analyses, it is more convenient to use the service‐level definition
The optimality conditions have an intuitive interpretation. The numerator is the expected cost increase of a marginal order quantity increase in product
In the optimal solution, the service levels are differentiated based on the characteristics of the products and the demand, which is formally stated in Theorem 2. The optimal service level increases in the unit revenue decreases in the unit cost decreases in the standard deviation of the demand
BEHAVIORAL MODELS
The planning task consists of two subtasks, demand forecasting and inventory optimization. Literature has identified several decision biases for these subtasks that might be relevant in our setting. For behavioral forecasting, we will analyze system neglect behavior and for behavioral inventory management, we will look at group aggregation, anchoring, and inventory error minimization.
Behavioral demand forecasting
Behavioral operations management literature suggests that actual ordering decisions are more than optimally adjusted toward recent demand realizations, an effect referred to as demand chasing (for instance, Bolton & Katok, 2008; Bostian et al., 2008; Schweitzer & Cachon, 2000). Other studies with stationary demand forecasts focus on how subjects sample historical data to estimate future demand (Tong & Feiler, 2017). They find that subjects naively sample too few observations from historical data to estimate the mean point forecast. Another study on stationary demand forecasts analyzes how censored demand settings impact the estimate of point forecasts (Feiler et al., 2013). They find that subjects show a censorship bias, that is, they underestimate the extent of unobserved lost sales and “rely too heavily on the observed censored sample” (Feiler et al., 2013). For a comprehensive review on studies in behavioral forecasting, see Goodwin et al. (2018).
Focusing on demand series forecasting, Kremer et al. (2011) analyze how subjects forecast autocorrelated time series similar to ours. They found that subjects' forecasting behavior in correlated demand environments is consistent with the mechanics of a single exponential smoothing forecast. However, subjects overadjust in settings where they should not adjust (corresponding to small
Based on their findings, we model forecasting as
Behavioral inventory optimization
Based on the demand forecasts, the order quantities are optimized. In an optimal solution, the order quantities are chosen, such that the target service level is reached and expected profits are maximized. The service levels then depend on the demand uncertainties, unit revenues, and unit costs of the products (Theorem 2). However, literature on behavioral inventory management suggests various deviations from expected profit maximizing behavior. Prominent observations include anchoring (Bolton & Katok, 2008) and ex post inventory error minimization (Kremer et al., 2014). Before we look at these factors, we will introduce a factor that is specific to our setting, we refer to this as group aggregation.
Group aggregation
In our setting, the manufacturer must optimize service levels for 23 products. This is analytically challenging. A potential simplification would be to split the products into
The optimization problem of the manufacturer can be formulated as follows:
This results in the following optimality condition:
Such a model can be seen as a kind of heuristic for the decision maker. Using aggregated product groups leads to similar target service levels within a group. This results in less than optimal within‐group differentiation compared to optimal individual product‐based differentiation. The actual target service levels of the groups and between groups depend on the group composition. We will refer to the model with group aggregation as Model 2.
Anchoring
Tversky and Kahneman (1974) observed that people who solve a decision task often start with an initial solution that is based on simple features and then adjust the solution toward the optimal solution. Because the final solution is often anchored on the initial solution and not adjusted all the way toward the optimal solution, the heuristic is referred to as the anchoring and insufficient adjustment heuristic.
The anchoring and insufficient adjustment heuristic has been used to explain ordering decisions in the newsvendor problem. A natural anchor in the expected cost minimization models that have been used in the literature is mean demand (e.g., Bolton & Katok, 2008; Bostian et al., 2008; Schweitzer & Cachon, 2000). However, unlike the newsvendor problems considered in previous behavioral research, the problem we consider has a service‐level constraint. The retailer regularly communicates the target service level to the manufacturer and informs him if he misses the target. Therefore, the target service level is a candidate for a natural anchor for the manufacturer's ordering decisions.
We model anchoring using an approach similar to that of Bostian et al. (2008) and introduce an anchoring factor
As a result, we conclude that increasing the anchoring factor
Inventory error minimization
The behavioral operations management literature has suggested inventory error preferences as a potential explanation for ordering behavior (Kremer et al., 2010, 2014). Ho et al. (2010) argue that psychological costs are associated with leftovers and stockouts and that the psychological aversion to leftovers is greater than the disutility for stockouts. This model is a generalization of the model used by Schweitzer and Cachon (2000), where the psychological underage and overage costs are the same.
We use a similar model as Ho et al. (2010) to analyze whether inventory error minimization can explain the manufacturer's ordering behavior. We denote the psychological cost associated with a unit of leftover inventory by
The model of Ho et al. (2010) has no service‐level constraint and inventory error minimization pulls orders toward the mean demand. In our setting, service‐level differentiation in the optimal solution is (besides demand uncertainty) driven by differences in unit revenues and unit costs of the product, and thus by differences in the underage and overage costs. To analyze the impact of psychological costs in our setting, it is important to consider that critical ratios, which include psychological underage and overage costs, vary less between products than critical ratios, which do not include the psychological costs. Assuming the same demand variance between products, this results in more similar target service levels between products. Because of the aggregated service‐level constraint, this results in service levels that are pulled toward aggregated target service level
Table 1 summarizes the behavioral models that we will test, and we will use the normative solution of Section 2.2 as a benchmark. Model 1 adds the first decision bias and considers behavioral forecasting, but optimal approaches for determining order quantities. As we will demonstrate, behavioral forecasting improves the model fit considerably, so we use it in all behavioral models. Models 2, 3, and 5 each add an inventory optimization bias to Model 1. Model 2 adds group aggregation, Model 3 adds anchoring, and Model 5 adds inventory error minimization. These models allow us to analyze the significance of individual optimization biases and their effects on the model fits. In Models 4 and 6, we analyze combinations of the biases. Model 4 uses group aggregation and anchoring and Model 6 uses group aggregation and inventory error minimization. Anchoring and inventory error minimization offer alternative explanations for ordering behavior. Because it is unclear how models that contain both decision biases can be interpreted, we do not analyze them.
Overview of behavioral models analyzed
To summarize, we will analyze four behavioral factors. (1) Behavioral forecasting: We expect decision makers to put too much weight on recent demand realizations. This will result in biased forecasts that will have a lower forecast accuracy and performance. (2) Group aggregation: Simplifying tasks by not optimizing 23 different service levels, but grouping products together will lead to product clusters that have similar service levels. Service levels will differ between groups, but within‐group differentiation will be small. (3) Anchoring: In our setting, the decision makers need to ensure a certain target service level. Therefore, anchoring on mean demand is not feasible because it would lead to too low service levels. However, we expect decision makers to anchor on the overall target service level. That would lead to too little differentiation between products with all products being pulled‐to‐target. (4) Inventory error minimization: Similar to other behavioral newsvendor studies, this factor is an alternative explanation for the expected pull‐to‐target effect. Adjusting actual cost by adding psychological costs reduces differences in product costs and, consequently, results in more similar target service levels.
EMPIRICAL DECISION ANALYSIS
After developing the analytical and behavioral models, we are analyzing empirical decisions of the manufacturer in this section. We first describe the details of the case and then analyze the service level achieved per store and per product to see if the results are in line with the analytical model. Then, we analyze the behavioral models and their predictions and test which biases can be observed with the empirical newsvendor.
Setting and data
The manufacturer has a product portfolio with 23 bakery items (breads, buns, rolls, pastries, etc.) that are sold at 66 stores of a retailer. The products have a shelf life of one day and the manufacturer replenishes the retailer's shelves every morning before the stores open. The manufacturer decides on the order quantities and must ensure that on average at least
The manufacturer is a family‐owned business with about 150 employees and more than 20 years of experience in producing, delivering, and inventory planning for perishable bakery products. The production quantity decision is made on the day before the items are delivered to the stores that are then produced during the night. Early in the morning, the manufacturer delivers the items to the stores and picks up any leftover inventory from the previous day. His information system allows the manufacturer to observe past sales for each product and at each store. The data are then forwarded to the manufacturer's production department that analyzes the data, tracks performance, and makes the production quantity decisions. The department consists of several employees who make these decisions, but it is neither tracked nor transparent for the retailer which one of the manufacturer's employees made a decision. All employees of the production department have several years of work experience in this field and mainly rely on their judgment when making production quantity decisions.
The retailer is one of the main customers of the manufacturer. Although there is no monetary penalty if the manufacturer fails to achieve the target service level, the manufacturer is aware that continuously underachieving it could risk losing the contract with the retailer. The service level can be tracked both by the manufacturer and the retailer. If the retailer observes repeated underachievement of the service level, they discuss the issue with the manufacturer and identify potential solutions.
We collected daily order quantities and hourly sales from November 15, 2010 to December 7, 2012. The stores were open from Monday through Saturday from 8 a.m. to 8 p.m. They were closed on public holidays, which affected sales on the day before and the day after a public holiday, so we excluded these days from our analyses. We also had to exclude two of the stores. One was used by the manufacturer to supply the workers of a nearby company and we could not separate the deliveries for the workers from the replenishment quantities of the store. The other had a bug in the data collection module of the information system, which meant we could not obtain reliable sales data from that store.
The manufacturer's product assortment can be classified into three main product types—bread, rolls, and pastry. There are 11 different types of breads that differ by flour (e.g., wheat, rye, spelt), additional ingredients such as seeds, and other characteristics (e.g., organic, cut into slices, half/whole loaf). Additionally, the assortment consists of four types of rolls and eight different pastries. Of the 23 products in the portfolio, 16 are produced by the manufacturer and 7 are purchased from an external supplier by the manufacturer. We refer to these products as Make and Buy products, respectively, a segmentation that will be important in our subsequent analyses. The customer cannot distinguish between Make or Buy products because the packaging for all products is similar for products of the same type and does not differ by Make or Buy category.
A commonly used classification in inventory management is ABC analysis. Products are clustered into three categories (A, B, C) based on their contribution to total cost. The top 20% of products (i.e., the ones with the highest total cost) are classified as A products, the next 30% are classified as B product, and the last 50% are labeled as C products (Lysons & Farrington, 2006; Teunter et al., 2010). Applying ABC analysis to our setting results in 5 A products, 7 B products, and 11 C products. In empirical settings, A products often account for 80% of total cost, B products account for the next 15%, and C products only for the remaining 5%. This is different in our setting, where A products account for 42%, B products for 32%, and C products for 26% of total cost. This indicates that the classification is qualitatively comparable to other settings, but the order of magnitude of the difference between products is smaller. Column 10 in Table 2 shows the classification for our setting. We see that this classification is different from the Make–Buy categorization. We will use both classifications in later analyses.
Product characteristics
The characteristics of the products are summarized in Table 2. Mean and standard deviation of (estimated) demand are denoted by
Service levels by store
Figure 1 shows the average service levels that the manufacturer achieves in each store. The dashed line indicates the target service level of 70%. We observe that the actual average service levels are often close to the target service level. They range from 66.5% (Store 49) to 73.6% (Store 59), with an overall average of 69.3%. To test the differences between actual store service levels and target service levels, we use the Wilcoxon signed‐rank test because a test of normality of store service levels revealed a significant deviation from the normal distribution for 52 of 64 stores (Shapiro–Wilk test with

Average service levels by store (with 95% confidence intervals)
We conclude that the decisions are effective because the service levels are close to the target service level at the store level. To analyze whether the service levels are differentiated as suggested by the analytical model, that is, efficient, we next compare the actual with the optimal service levels at the product level.
Service‐level differentiation by product
In the optimal solution, the manufacturer considers demand uncertainties, unit revenues, and unit costs when making service‐level decisions (Theorem 2). Because the factors differ across products, the optimal service levels differ across products. The left graph in Figure 2 shows the average optimal service levels for all products and compares them with the average actual service levels.

Average actual versus optimal service levels in the product portfolio
We observe heterogeneity in the actual average service levels, which indicates that the manufacturer differentiates service levels by product. However, the correlation between the average actual and optimal service levels of 0.281 is significantly below 1 (
Determining the optimal service levels for 23 individual products is complex, and a simpler approach is to differentiate service levels by grouping products into categories. In discussions with the retailer and manufacturer, products were often categorized into Make and Buy products. Although customers cannot distinguish between the two categories, the retailer and manufacturer are aware of the differences in profitability. This is also reflected by Make products having a higher average critical ratio than Buy products (
The middle and right graphs in Figure 2 show the results. We observe a higher average actual service level for Make than for Buy products. Within each product category, the average actual and optimal service levels are not significantly correlated (Kendall's tau,
Summarizing the analyses of this section, we find evidence that the manufacturer has made effective but inefficient decisions. The manufacturer is essentially achieving the target service level, but the differentiation of the products is not optimal and rather focused on the distinction between Make and Buy products. Therefore, we will apply the grouping model (Section 3.2.1) to this special case of two groups. We will extend the analysis to other groupings in Section 4.5. In the next section, we will discuss the data in more detail and test our behavioral models for the manufacturer's decisions.
Evaluation of behavioral models
Before estimating the behavioral parameters and comparing the behavioral models, we will discuss some specifications and general insights regarding the two subtasks forecasting and inventory management.
Behavioral forecasting
The manufacturer faces autocorrelated demand and has not observed sales of the previous day when deciding the order quantity for the current day. For instance, when determining the order quantity for Friday, the manufacturer has not yet seen Thursday's demand and must rely on information from Wednesday and earlier days and weeks.
Figure 3 shows the autocorrelation of demand. It suggests that the manufacturer's best choice is to use demand information from the same weekday in previous weeks because the autocorrelation of the demand is the highest for a time lag of six days. Note that stores are closed on Sundays so that six days correspond to one week. The figure also shows that weekly autocorrelation is higher than daily autocorrelation (dotted bar), which has commonly been observed in grocery retailing environments (van Donselaar et al., 2006, 2010). Therefore, we will use a time lag of one week in the forecasting model. For notational convenience, we denote the current day by

Autocorrelation coefficients of demand for various time lags
Behavioral inventory management—Group aggregation
We compare the actual service levels with the service levels of the analytical solution. For our grouping model (Section 3.2.1), we assume a Make and Buy grouping as discussed in Section 4.3. We find that service‐level differentiation between products is not as strong as predicted by the analytical model. Figure 4 illustrates the service levels of the two product groups for the optimal solution, the Make‐Buy group aggregation model, and the actual orders. We observe that the group aggregation model results in service levels that are closer to the actual service levels than those of the optimal solution. We note that this effect does not need any estimated parameter, but is the result of optimizing target service levels only for the two product groups.

Average service levels for Make and Buy products for optimal differentiation, group optimization, and average actual service level
However, Figure 4 also shows that there remains a gap between the service levels of the group aggregation model and the actual average service levels. This difference might be explained by anchoring (on target service level) and inventory error minimization and we will analyze the behavioral models in detail next. We note that anchoring on mean demand cannot explain the ordering behavior that we observe. If the manufacturer anchored order quantities on mean demand and insufficiently adjusted them toward the target service level, we would observe service levels between 50% and 70%. However, the manufacturer essentially achieves the target service level of 70%, which indicates that the manufacturer does not use mean demand as an anchor. This supports our modeling in Section 3.2.2.
Model estimation and evaluation
Before estimating the different behavioral models from Table 1, we first analyze how well the optimal decision model fits the empirical data. The rational model will serve as a reference point.
Column Optimal of Table 3 reports the fit of the normative solution, that is, the solution with a behavioral forecasting factor of
Maximum‐likelihood estimation of behavioral model parameters and quality of fits
Abbreviations: BIC, Bayesian information criteria; MCS, model confidence set.
Values are the results of an optimization, not of a parameter estimation. Therefore, no significance can be reported.
Model with “x” is included in the MCS, for other models the p‐value for exclusion is reported.
p
To estimate the behavioral forecasting factor
Column Model 1 of Table 3 shows the results of the parameter estimation and the value of the likelihood. The behavioral forecasting factor with the highest likelihood is
Model 1 has a smaller BIC than the optimal model and we conclude that including behavioral forecasting explains the manufacturer's ordering behavior better than the optimal model without. The magnitude of the differences in the BICs is large and we will include behavioral forecasting in all other models. We cannot use the chi‐square test to compare our models because not all of them are nested. To compare all models analytically, we will use the model confidence set (MCS) at the end of this section (Hansen et al., 2011).
Column Model 2 of Table 3 shows the results of the group model optimization. The optimal group service levels are the results of the optimization. The behavioral forecasting factor was determined by maximum‐likelihood estimation using the same approach that we used for estimating
In Models 3 and 4, the anchoring factor
To analyze whether group anchoring, which uses anchoring in addition to group aggregation, improves the fit of the model, we compare the BICs of Models 2 and 4. The BIC of Model 4 is 48 below that of Model 2, indicating that anchoring in addition to group aggregation explains actual orders better than without anchoring.
Inventory errors: To estimate the psychological underage and overage cost, we use an approach similar to the one for the behavioral forecasting parameter
Including group aggregation improves the fit further, which gives Model 6 the best fit of all the models that we analyzed. Incorporating the group aggregation bias, the value of the psychological costs decreases significantly and
We also analyzed whether Make and Buy products have different forecasting biases and estimated group‐specific
In total, we analyzed six behavioral models, and the results indicate that including behavioral forecasting and group aggregation is important for understanding the manufacturer's ordering behavior. The results also indicate that including anchoring or inventory error minimization in the models further improves the fit. These two factors result in actual service levels being pulled‐to‐target (i.e., Buy products being pulled upward, and Make products being pulled downward). However, compared with the model including behavioral forecasting and group aggregation (Model 2), the additional improvements obtained by including anchoring (Model 4) or inventory error minimization (Model 6) are comparable, and it is not obvious which model provides the best fit.
Selecting the model based on the BIC does not reveal the uncertainty of this selection (Hansen et al., 2011). To determine whether the differences in the model fits are significant, we use the MCS introduced by Hansen et al. (2011). The MCS conducts a sequence of hypothesis tests based on bootstrap samples and eliminates the models that are significantly outperformed at a given
The results of the MCS are also shown in Table 3. For our data, the MCS consists of a single model, Model 6. This model performs weakly significantly better than Model 4 (
Other grouping heuristics
In Section 3.2.1, we argued that the categorization into Make and Buy products is a natural differentiation of products for the manufacturer. Additionally, the analysis in the previous section indicated that this grouping heuristic fits actual decisions well. However, there are other potential groupings, and we analyze some of them that seem reasonable to follow.
Clustering the products by product type could be an appropriate categorization. Setting target service levels for breads (type 1), rolls (type 2), and pastries (type 3) would provide an alternative intuitive clustering. Although this requires three target service levels, the decision process is still significantly easier than determining 23 targets.
Management literature often uses ABC analysis to differentiate inventory policies for different products. Table 2 also shows the ABC classification for the 23 products in our assortment. Using this categorization, a grouping heuristic could optimize the target service levels for these categories.
A very basic alternative clustering would be to use only one group. This means that all products receive the same target service level. We refer to this as “naive” approach because it uses the target service level (of 70%) for each of the products.
We conducted comparable analyzes as in Section 4.4.3 for the three alternative grouping models. We used the classifications to determine the optimal target service level and the resulting order quantities for each day in our data set. We then conducted a maximum‐likelihood estimation for these predictions on our data set. Table 4 shows the change in BIC when using the alternative groupings compared to the Make–Buy classification. We find that the alternative groupings explain the manufacturer's decisions not as well as the Make–Buy grouping. Figure B.1 also compares the actual average product service levels with the predicted service levels achieved by the different grouping heuristics. The graphs show that predicted product service levels are closer to actual service levels for the Make–Buy clustering than for the other groupings analyzed.
Comparing the fit of alternative grouping models

Determining the profit impact of the detected decision biases

How to improve performance: profit impact of different grouping heuristics compared with optimal differentiation (normalized to 100%)
Managerial implications—Impact on profit and potential recommendations
Our analyses indicate that the manufacturer's ordering decisions are affected by three biases: behavioral forecasting, group aggregation, and inventory error minimization. These biases are significant and explain actual ordering decisions better than the other biases or combinations of biases that we analyzed. However, from a managerial perspective, not only the significance of effects but also their monetary impact is important. Therefore, we evaluate the impact of the three different behavioral factors on the manufacturer's profitability.
We simulate the use of different decision models and calculate the resulting profit for our data set. We forecast demand for each product in each store, determine the resulting order quantity, and calculate the resulting profit based on actual demand. To calculate profits for the different models for our data set, we must estimate demand (given the unobservable lost sales) based on sales data. We estimate demand based on the approach of Lau and Lau (1996), which uses stockout times and hourly demand information from previous periods to estimate unobservable lost sales. Note that this approach is different from the one used in the analytical model because stockout times are not available to the manufacturer and therefore cannot be used in demand forecasting. However, an approach considering stockout timing provides more accurate demand estimates (Jain et al., 2015) and enables an accurate profit comparison between different analytical models and the manufacturer's decisions. Figure 5 shows the reduction of profit allocated to the different behavioral aspects. As a benchmark, we indexed the profit of the optimal solution at 100. This means with the data available by the manufacturer (historical number of units sold) and using nonbiased forecasting and optimal product differentiation the manufacturer would achieve a profit of 100.
To estimate the impact of the different behavioral factors, we calculate the profits of partial models, including the different factors sequentially. We calculate the predicted order quantities for using the behavioral models (with the estimated parameters in Table 3) and simulate the performance for our data set. Applying behavioral forecasting, but keeping the optimal differentiation, results in a profit decrease of 2.5% compared to the optimal model. When further including group aggregation on these biased forecasts, profits decrease by another 5.3%. The effect of inventory error minimization is small compared to the other two effects (only 0.1% profit loss). The results suggest that substantial profit gains can be achieved by reducing decision biases.
The total profit that the manufacturer actually achieved is 92.2. As a robustness check, and as a fair comparison with our partial models, we calculated the profits using order quantities resulting from the model containing all three decision biases (Model 6 in Table 3). The resulting average profits are 92.1, which is close to the actual profits. The results indicate a profit loss of 7.9% that can be attributed to the decision biases.
Behavioral forecasting or including psychological inventory error costs does not help to reduce complexity. For instance, behavioral forecasting distorts demand forecasts without significantly reducing effort. In such cases, debiasing strategies might be applicable to improve performance. Decomposing forecasting and inventory decisions (Lee & Siemsen, 2017) or using multiple independent forecasters (Kremer et al., 2011) might reduce the forecasting bias and improve overall performance.
However, we have seen that group aggregation has a major profit impact. In general, using grouping heuristics might simplify the decision tasks of the manufacturer but results in efficiency losses. To analyze the impact of such heuristics on profits, we simulate the performance of different grouping heuristics. Figure 6 shows the profit losses for four grouping models compared to optimal differentiation. Using a naive no‐differentiation approach (i.e., targeting 70% for all products) reduces profits by 10.1%. Note that we used nonbiased forecasting for these analyses to isolate the impact of the grouping heuristics. Therefore, adding this simple differentiation (Make vs. Buy) already leads to a substantial improvement in profits over naive optimization with rather limited additional effort (only two different target service levels). Make–Buy grouping results in a profit loss of 5.3%. Increasing the number of clusters, for example, to three, decreases the efficiency loss further. But we see that the marginal improvement decreases. Using the ABC analysis based clustering and the clustering by product type (bread, rolls, pastries) that were introduced in Section 4.5, results in profits losses of 3.3% and 3.9%. This means that the manufacturer could increase his profits by adding a third group. However, marginal gains of adding another group decrease and using the Make–Buy grouping captures the majority of the potential differentiation gains. This implies that given the increasing complexity and the decreasing marginal benefits of adding groups, grouping heuristics might be considered ecologically rational (Gigerenzer & Todd, 2012), which is related to Simon's (1986) idea of satisficing. Adding more groups would increase profitability of the manufacturer, but (perceived) additional required effort might prevent the manufacturer from doing so. Additionally, Chen and Li (2018) compare the performance of human decision makers when making a single decision versus multiple simultaneous decisions. They find that performance decreases when making multiple decisions. This indicates that increasing the number of groups from two to three might not result in the additional profit indicated in Figure 6 due to the increased complexity of the decision task.
DISCUSSION
The ordering behavior of newsvendor decisions in laboratory environments has been analyzed extensively over the past two decades (Donohue et al., 2020). The experiments were usually conducted with students who entered orders in a computer over a short period of time to earn a moderate amount of money. In practice, experienced managers place orders for real products on a daily basis and their performance affects their incomes and their careers. Previous experiments therefore left unclear whether the decision biases observed in the laboratory were also present in practice.
In this paper, we address this issue by analyzing the ordering behavior of an actual manufacturer. The results of our analyses indicate that the decision biases that have been observed in laboratory experiments are also present at the manufacturer (e.g., behavioral forecasting and ex‐post inventory error minimization). We identified an additional bias: group aggregation. Although the manufacturer is prone to these biases, his decisions resulted in effective solutions with service levels that were close to the target service level of 70%. This result is of some interest in its own right. One of the most robust findings of the behavioral operations management literature is that decision makers choose order quantities that are pulled toward the expected demand (e.g., Bolton & Katok, 2008; Bostian et al., 2008; Schweitzer & Cachon, 2000). Translated to the manufacturer's situation, it suggests that the service levels are below the target service level and are pulled toward 50%. This, however, is not what we observed.
The main reason for this is the specific setting that the manufacturer faces. The manufacturer operates under a service level contract, whereas most laboratory experiments (showing the pull‐to‐center bias) use profit‐based contracts such as wholesale price or buyback contracts. As Bolton et al. (2016) show in lab experiments, decision makers achieve target service levels more effectively and more efficiently under a service‐level contract than using a wholesale price contract. Potential reasons are that the service‐level contract provides an anchor that the wholesale price contract does not provide and that the expected profit curve is steeper. Related to this, Lee and Siemsen (2017) find a strong performance increase when providing the optimal target service level in profit‐based environments such as the wholesale price contract setting. Although our setting does not include service‐level penalties (that are used in Bolton et al., 2016), the manufacturer still has an explicit service‐level constraint of 70% that he is not allowed to fall below. This results in overall average service levels that are not pulled‐to‐center, but rather pulled‐to‐target, which means that differentiation between products is not strong enough.
Looking at the efficiency loss of the decision maker, we find that the manufacturer incurs a profit loss of 5.3% compared to our analytical model. One might argue that the performance is actually not too bad compared to subjects in newsvendor lab experiments. However, we want to highlight three important aspects here. First, previous lab studies using single product newsvendor settings report a range of efficiencies between 80% (Bolton & Katok, 2008) and 89% (Bolton et al., 2012, for trained subjects) depending on experience and prior knowledge. We acknowledge that our empirical setting is more complex, but the decision maker is also much more experienced than subjects in the lab. Therefore, seeing higher efficiencies is not very surprising. Using a single‐product service‐level contract, Bolton et al. (2016) report efficiencies between 89% and 97.2%. This shows that service‐level contracts lead to higher performance also in the lab environment.
The second important aspect that needs to be considered when comparing our empirical results with previous lab data is that we have provided a model for optimizing order quantities for the multiproduct problem that the manufacturer faces. Like all analytical models, our model relies on a number of assumptions. We expect that more comprehensive models would improve profits further, but they are also much more complex. This would increase the efficiency loss of the decision maker compared to the optimal model. Lab studies compare actual decision making against the normative benchmark.
Third, the manufacturer is subject to self‐selection and market selection, whereas subjects in laboratory studies are typically selected on a first‐come‐first‐serve basis out of a pool of students looking for some short‐term financial benefit. Thus, the consequences of ordering suboptimally are quite different for students and for the manufacturer. If the manufacturer does not achieve the target service level, he loses business with the retailer and is replaced by another manufacturer. Therefore, it is not surprising that we observe a manufacturer who is achieving the target service level with a rather moderate efficiency loss. If efficiency had been far below optimum, other companies would probably have taken over the business already.
Highlighting the differences between our empirical setting, existing lab studies, and the impact of different grouping heuristics, we acknowledge that it might be insightful to analyze decision making in this context in more detail in future lab studies. Using multiproduct cases with differentiation between products has not been studied extensively. Such lab experiments could complement our findings, and improve the understanding of behavioral decision making in operations management even further. This might also allow to analyze behavioral factors such as cognitive limitations, sacrificing, or time pressure in more detail.
Footnotes
PROOFS
ADDITIONAL GRAPHS—PREDICTIVE FIT OF PRODUCT SERVICE LEVEL
ACKNOWLEDGMENTS
We thank the department editor Elena Katok, the senior editor, and two anonymous referees for their constructive comments to improve the paper. We also thank the German Research Foundation for financial support through the research unit “Design & Behavior” (FOR 1371) and Germany's Excellence Strategy—EXC 2126/1.
