Abstract
When using Bayesian hierarchical modeling, a popular approach for Item Response Theory (IRT) models, researchers typically face a tradeoff between the precision and accuracy of the item parameter estimates. Given the pooling principle and variance-dependent shrinkage, the expected behavior of Bayesian hierarchical IRT models is to deliver more precise but biased item parameter estimates, compared to those obtained in nonhierarchical models. Previous research, however, points out the possibility that, in the context of the two-parameter logistic IRT model, the aforementioned tradeoff has not to be made. With a comprehensive simulation study, we provide an in-depth investigation into this possibility. The results show a superior performance, in terms of bias, RMSE and precision, of the hierarchical specifications compared to the nonhierarchical counterpart. Under certain conditions, the bias in the item parameter estimates is independent of the bias in the variance components. Moreover, we provide a bias correction procedure for item discrimination parameter estimates. In sum, we show that IRT models create a unique situation where the Bayesian hierarchical approach indeed yields parameter estimates that are not only more precise, but also more accurate, compared to nonhierarchical approaches. We discuss this beneficial behavior from both theoretical and applied point of views.
Keywords
Bayesian hierarchical modeling is a popular approach for Item Response Theory (IRT) models. They are quite complex and, depending on sample size and test length, consist of many parameters of the same type (e.g., item discriminations and item difficulties, as well as person parameters), making them excellent candidates for Bayesian hierarchical specifications. Due to their hierarchical prior structure and the associated pooling process, which maximizes the information in a given dataset, Bayesian hierarchical models yield item parameter estimates that are more precise than those of their nonhierarchical counterparts are (e.g., Katahira, 2016). This is typically reflected by narrower 95% highest density intervals (HDI) of the parameter estimates.
There is a tradeoff, however, because the increased precision (i.e., smaller standard error) is associated with a decreased accuracy (i.e., larger bias) of the parameter estimates. The pooling process depends on the variance of the individual item parameter estimates. To the extent their variance decreases, their estimates shrink towards their grand mean, that is, the mean of their hyperprior distribution (Efron & Morris, 1977). Thus, since individual item parameters always vary to some degree (Fox, 2010), their estimates obtained with Bayesian hierarchical models exhibit a certain amount of bias, proportional to the amount of shrinkage. Hence, the expected (typical) behavior of Bayesian hierarchical models is to deliver more precise but biased individual parameter estimates, compared to the parameter estimates obtained with their nonhierarchical counterparts.
However, Koenig et al. (2020) found that their optimized hierarchical two-parameter logistic (OH2PL) IRT model for small-sample item calibration outperformed its nonhierarchical counterpart, especially in terms of bias of the item discrimination parameters. This was an interesting finding, because it contradicts the typical and theoretically expected behavior of Bayesian hierarchical models (larger bias of all parameters compared to nonhierarchical models).
It is possible, however, that applying the Bayesian hierarchical approach to IRT models creates a unique situation where there is no tradeoff between accuracy and precision. Reasons for this unique situation may relate to a combination of characteristics of the item parameters of IRT models with the current practice of Bayesian hierarchical modeling (i.e., current recommendations for model specifications and the specification of priors for variance components). Koenig et al. (2020) did not investigate this possibility further. Therefore, the objective of this paper is to investigate the question whether Bayesian hierarchical IRT models indeed behave differently than their general counterparts, in the sense that the aforementioned tradeoff between precision and accuracy does not have to be made in general, or whether the behavior is a consequence of the interplay between item parameter and model characteristics when applying the Bayesian hierarchical approach to IRT models. We further want to explore the specific reasons for this atypical, but beneficial behavior of Bayesian hierarchical IRT models.
In the following sections, we illustrate (1) the core characteristics and specification of the Bayesian H2PL, (2) the typical characteristics of parameters in IRT contexts and priors of current Bayesian hierarchical IRT models, and (3) describe our comprehensive simulation study. We then (4) present the results of our simulation, and discuss them in relation to their benefits for accurate item calibration in small samples, computerized adaptive testing (CAT) and Bayesian hierarchical IRT modeling in general. Scripts to replicate this simulation, our data, and results are available as an online supplement at https://osf.io/ybk2f/ (Jackman, 2009)
Pooling, Shrinkage, and Bias in the Context of the Hierarchical 2PL Item Response Theory Model
Suppose a sample of
Both discrimination and difficulty parameters are item-specific. Thus, in a test of K items there are
A common implementation of the hierarchical 2PL IRT model is as follows. For the abilities
Level 1:
Level 2:
In this hierarchical structure, the individual item parameters share an inherent dependency with their respective grand means (Betancourt & Girolami, 2015). This dependency maximizes the information available for the estimation of the individual item parameter estimates. The increase in information leads to an increased precision of the individual parameter estimates, that is, narrower 95% HDIs or smaller standard errors.
Another consequence of the dependency of individual item parameters and their grand mean is that, for instance, an item discrimination parameter
The amount of bias introduced into the estimation of the individual item parameters depends on the relation of the true variance
Typical Characteristics of Parameters and Current Specifications of Bayesian Hierarchical Item Response Theory Models
To derive possible explanations of the atypically better performance of the Bayesian H2PL model compared to its nonhierarchical counterpart (as noted by Koenig et al., 2020), we have to consider the typical characteristics of the item parameters, as well as processes in the context of the current practice of specifying Bayesian hierarchical IRT models.
First, the item discrimination and difficulty parameters are known to fall in a relatively narrow range. The item discriminations, for example, typically fall in the interval
Second, current specifications of Bayesian H2PL models are tailored towards avoiding bias in both the individual parameter estimates
Lastly, current specifications of Bayesian H2PL models are noncentered. In noncentered specifications, the first level of the model consists of a standard normal prior distribution for the abilities
Level 1:
Level 2:
In sum, favorable and optimized model specifications that avoid bias in estimated variance components
Purpose of the Study
Consequently, the primary purpose of this paper is an in-depth investigation of the question, whether the curious behavior of the Bayesian H2PL is (a) an indication of a generally different behavior of Bayesian hierarchical IRT models, compared to that of their general counterparts, or (b) a consequence of the interplay between item parameter and model characteristics as outlined above. Moreover, we aim at providing insights regarding the specific reasons for this curious behavior in the context of the Bayesian H2PL.
Therefore, we follow a two-step approach answering two primary research questions. First, we investigate whether the hierarchical specifications of the 2PL, namely, the optimized Bayesian hierarchical 2PL (OH2PL; Koenig et al., 2020) and the standard Inverse Wishart specification (SH2PL), yield less biased item parameter estimates than their nonhierarchical counterpart. We chose the OH2PL and the SH2PL as examples of current approaches to Bayesian hierarchical IRT modeling (e.g., Gilholm et al., 2021). In this step, we compare the performance (relative and absolute bias, root mean squared error RMSE) of the hierarchical specifications of the 2PL with different specifications of the nonhierarchical 2PL model to check whether the advantages are robust across a broad range of data conditions. We also look at the widths of the 95% HDIs of the resulting item parameter estimates across model specification to assess the precision of the estimates. Second, we take a closer look at the relation of the relative and the absolute bias in the individual parameter estimates
As a further contribution to the literature, we will further seek clarification whether there is a critical value of the true variance components
Moreover, we present a bias correction procedure for individual item discrimination parameter estimates in cases in which they are biased because of their variance component being underestimated. As mentioned before, bias should be more pronounced when the true variance is underestimated due to larger unintended shrinkage. Such a procedure constitutes another important contribution to optimize the Bayesian H2PL model further, especially for its use in small-sample situations.
Method
Simulation Design
The fully crossed design of the study consisted of the following factors. (1) The variance in the item discriminations
Nonhierarchical Specifications of the 2PL Model
The prior configurations of the nonhierarchical models were chosen to keep the different model specifications comparable, and reflect prior configurations common in IRT modeling (e.g., Levy & Mislevy, 2016). In all model specifications, the ability parameters were given a standard normal prior
The nonhierarchical Bayesian specifications of the 2PL model only have a single level consisting of prior distributions for the individual item parameters. Because Koenig et al. (2020) found differences in the performance (compared to the hierarchical specification) to be specific to the item discrimination parameter, the specifications differ primarily in the prior distribution for the individual discrimination parameters [NH2PL1, NH2PL2, NH2PL3, respectively:
Data Generation and Analysis
Data were generated under a unidimensional 2PL with correlated item parameters. To obtain realistic item discrimination and difficulty parameters (
We used Stan (Carpenter et al., 2017) and the R interface Rstan (Stan Development Team, 2020) to estimate the hierarchical and nonhierarchical models. Stan employs the No-U-Turn-Sampler (NUTS; Hoffman & Gelman, 2014), which is an adaptive variant of Hamiltonian Monte Carlo (HMC). In HMC, Hamiltonian systems are simulated to sample from target distributions (Neal, 2011). By introducing the momentum as an auxiliary variable, HMC is able to utilize the local geometry of the target distribution in order to traverse the posterior density more efficiently (Gelman et al., 2014). This usually requires, however, hand-tuning of key parameters of the standard HMC algorithm. The No-U-Turn-Sampler implemented in Stan eliminates this requirement by adaptively tuning the necessary parameters. Thus, applied researchers can focus on the model specification, and not on the setup of the MCMC algorithm (Annis et al., 2017). For more details about the standard HMC algorithm and its adaptive variant interested readers are referred to Hoffman and Gelman (2014), where both algorithms are illustrated in great detail. Three chains with 3000 draws (1000 burn-in cycles) were set up. Moreover, different random starting values were supplied to each chain. Convergence was achieved when the R-hat diagnostic was smaller than 1.05 (Vehtari et al., 2021). For a comprehensive overview of the frequency of non-convergent solutions across model specifications, see Supplement 1. Non-convergent solutions were excluded from further analysis.
Evaluation Criteria
To test the aforementioned assumptions, we calculated the average raw bias
Results
Hierarchical Specifications Consistently Outperform the Nonhierarchical 2PL
Figure 1 shows the bias of the item discrimination parameter estimates of both specifications of the Bayesian H2PL in comparison with their nonhierarchical counterpart, across all simulation conditions. It becomes evident that the performance of both the OH2PL and the SH2PL was better (and never worse) than the performance of the different specifications of the nonhierarchical 2PL. This general pattern also held when investigating the absolute bias of the item discrimination parameters (Figure 2). The advantages of the hierarchical specifications were especially pronounced with Raw bias in item discrimination parameter estimates across simulation conditions. Absolute bias in item discrimination parameter estimates across simulation conditions. RMSE in item discrimination parameter estimates across simulation conditions. 95% HDI of the item discrimination parameter estimates across simulation conditions.



Bias in Item Parameters Partly Independent from Bias in Variance Components
Correlations Between the Bias in the Estimates of the Variance Components and the Relative and Absolute Bias in the Item Parameter Estimates Across Sample Sizes and Test Lengths.
Hence, the second condition supporting an atypical behavior of Bayesian hierarchical IRT models (namely,
Bias in Individual Item Parameter Estimates and the True Variance Components
The violin plots in Figure 5 illustrate the change in relative bias in the individual item parameter estimates along increasing true values of the associated variance components Relative Bias in the Item Parameter Estimates by True Variance Components. Note. Left panel: Item discrimination parameters. Right Panel: Item difficulty parameters. OH2PL with solid lines, SHPL with dotted lines. The dashed lines indicate the interval [–.1, +.1].
From Figure 5 (right panel) we learn that the relative bias in the item difficulty parameter
A potential reason for this increase can be found when looking at the bias in the variance component Bias of Estimated Variance Component 
To investigate this potential explanation further, we ran two four-way ANOVAs with the relative and absolute bias as dependent variables and specification, sample size, test length, and true variance component
Mean Bias Across Specification, Sample Size, Test Length, and Variance Components.
Note. N = Sample Size. K = Test Length; Small =
Thus, we may summarize that the third condition for confirming an atypical behavior of Bayesian hierarchical IRT models was also only partially fulfilled. This applied to both hierarchical specifications of the 2PL.
A Bias Correction Procedure
Interestingly, we found a relationship between the bias in the item discrimination parameter estimates and the bias in the variance component Bias in 
At first sight, a linear trend seemed to apply, but a closer look based on distribution-free measures as the series of boxplots (Figure 7, top row) revealed a certain non-linearity, that is, a logarithmic kind of association.
Because we knew the true values of all parameters in our simulation study, we could calculate the actual bias of both
However, in real-life applications the bias of
Discussion and Conclusion
Our goal in this study was to provide an in-depth investigation of the question whether Bayesian hierarchical IRT models behave differently than their general counterparts in terms of the accuracy of the individual parameter estimates. We found (1) the Bayesian hierarchical specifications of the 2PL to yield individual parameter estimates consistently less biased compared to their nonhierarchical counterpart (especially in smaller samples), and (2) the bias in the individual item parameter estimates being partly independent from the bias in their associated true variance components. However, as shown by the relation between the bias in the individual discrimination parameter estimates and their true variance components, both are independent only when
Thus, from a theoretical point of view, the results of this study indicate that the connection between variance, shrinkage, and bias, a common characteristic of Bayesian hierarchical models (e.g., Rouder et al., 2017), albeit not completely absent, is not that pronounced in Bayesian hierarchical IRT models. In other words, shrinkage of the individual estimates towards their respective grand means does not lead, on average, to a marked increase in bias in the individual item parameter estimates. The difference between the results regarding the relative and absolute bias can be explained by the fact that only the latter explicitly captures the bias of discrimination parameters on the margins of the parameter distribution. Interestingly, even in terms of absolute bias the advantage of the hierarchical specifications over their nonhierarchical counterparts remains. Thus, while the behavior is in its core not different from general hierarchical models, the Bayesian hierarchical specifications of the 2PL provides a means to overcome the tradeoff between precision and accuracy of the individual item parameter estimates.
What does this rather theoretical finding mean for applied educational and psychological measurement? In the following, we briefly outline three consequences resulting from our finding that are relevant for applied IRT modeling.
First, using hierarchical Bayesian approaches for item calibration reduces item calibration error, one of the primary sources of biased ability estimates in computerized adaptive testing (CAT; e.g., Frey, 2023). More specifically, with the hierarchical Bayesian approach it is possible to avoid capitalization on chance in item selection due to spuriously large discrimination parameters (Patton et al., 2013). Given shrinkage, the overestimation of the item discrimination parameter is less likely to occur. As shown in this paper, the shrinkage associated with the item discrimination parameters does not lead to markedly biased parameter estimates in typical conditions (i.e.,
Second, consequently, using the hierarchical Bayesian approach is likely to avoid capitalization on item calibration error by the maximum information criterion in CAT (Patton et al., 2013). Since the item discrimination parameter plays a dominant role, unbiased parameter estimates are crucial for an accurate calculation of the Fisher information. Thus, the hierarchical Bayesian approach combined with the bias correction procedure outlined in this paper directly contributes to a more accurate calculation of the information contained in an item bank, especially in small samples. This translates into advantages regarding ability estimates and was shown by Wagner et al. (2022). Typically, item calibration error is largest when calibration samples are small; as shown in this paper, however, smaller sample sizes are not associated with larger calibration errors when utilizing the hierarchical Bayesian approach. This in turn leads to more flexibility when it comes to the calibration of new item banks with small samples, for example, when using continuous calibration methods (e.g., Fink et al., 2018).
Third, the benefits of using the hierarchical Bayesian approach are relatively independent of the specification of its prior structure. The advantages of the OH2PL over the standard Inverse Wishart specifications still exist, but they are small: both overestimate
Our bias correction procedure is easy to apply. The prediction model for
Taken together, the results of this study show that the curious behavior of the hierarchical Bayesian approach can be utilized to improve the accuracy and precision of the resulting item parameter estimates, not only in the context of the 2PL model, but also more complex models such as the GPCM. This in turn is beneficial for the precision of ability estimation and renders it especially appealing for situations where test information is crucial. Moreover, the hierarchical Bayesian approach facilitates applications of IRT models in situations that would not be feasible with alternative methods, for example, when recruiting large calibration samples is not possible (e.g., university exams or in clinical contexts).
To conclude, we could show that the characteristics of parameters typically found in applications of IRT models in combination with Bayesian hierarchical modeling indeed create a unique situation where the resulting item parameter estimates are not only more precise, but also more accurate, compared to nonhierarchical approaches. The contributions of our simulation study can serve as a reference for applied researchers on when and how to use Bayesian hierarchical approaches in IRT modeling without having to worry about potentially biased item parameter estimates. This should be appealing for a wide range of psychometric applications and psychological research.
Supplemental Material
Supplemental Material - Benefits of the Curious Behavior of Bayesian Hierarchical Item Response Theory Models—An in-Depth Investigation and Bias Correction
Supplemental Material for Benefits of the Curious Behavior of Bayesian Hierarchical Item Response Theory Models—An in-Depth Investigation and Bias Correction by Christoph König and Rainer W. Alexandrowicz in Applied Psychological Measurement.
Supplemental Material
Supplemental Material - Benefits of the Curious Behavior of Bayesian Hierarchical Item Response Theory Models—An in-Depth Investigation and Bias Correction
Supplemental Material for Benefits of the Curious Behavior of Bayesian Hierarchical Item Response Theory Models—An in-Depth Investigation and Bias Correction by Christoph König and Rainer W. Alexandrowicz in Applied Psychological Measurement.
Supplemental Material
Supplemental Material - Benefits of the Curious Behavior of Bayesian Hierarchical Item Response Theory Models—An in-Depth Investigation and Bias Correction
Supplemental Material for Benefits of the Curious Behavior of Bayesian Hierarchical Item Response Theory Models—An in-Depth Investigation and Bias Correction by Christoph König and Rainer W. Alexandrowicz in Applied Psychological Measurement.
Footnotes
Acknowledgments
We would like to thank the Editor in Chief Dr John R. Donoghue and the anonymous reviewers for their valuable, constructive and helpful comments on our manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
