Abstract
Thurstonian item response theory (Thurstonian IRT) is a well-established approach to latent trait estimation with forced choice data of arbitrary block lengths. In the forced choice format, test takers rank statements within each block. This rank is coded with binary variables. Since each rank is awarded exactly once per block, stochastic dependencies arise, for example, when options A and B have ranks 1 and 3, C must have rank 2 in a block of length 3. Although the original implementation of the Thurstonian IRT model can recover parameters well, it is not completely true to the mathematical model and Thurstone’s law of comparative judgment, as impossible binary answer patterns have a positive probability. We refer to this problem as stochastic dependencies and it is due to unconstrained item intercepts. In addition, there are redundant binary comparisons resulting in what we call logical dependencies, for example, if within a block
Introduction
Personality questionnaires are ubiquitous in most areas of psychological assessment and education. Constructs like personality and motivation can explain variance in achievement beyond ability. However, an accurate scoring of responses to personality questionnaires is necessary for decisions based on them to be reliable and fair. The multidimensional forced-choice (MFC) format has become popular as a response format in personality questionnaires. In the MFC format, participants have to rank order items measuring different attributes. The MFC format avoids response biases such as, for example, halo effects (Brown et al., 2017) or extreme response style (Brown & Maydeu-Olivares, 2018). Furthermore, it reduces faking compared to rating scales (Cao & Drasgow, 2019; Wetzel et al., 2021).
The scoring of MFC data according to classical test theory results in ipsative scores. Ipsative scoring distorts correlation-based analyses (Clemans, 1966) and the scores should not be compared between persons (Closs, 1996; Johnson et al., 1988). Normative scoring of MFC data has become possible with advances in the computation of item response theory (IRT) models. Brown (2016) gives an overview of IRT models for forced-choice response formats. The most popular IRT model for MFC data is the Thurstonian IRT (T-IRT) model (Brown & Maydeu-Olivares, 2011). Thurstonian IRT scoring of MFC data results in normative scores (Brown & Maydeu-Olivares, 2013; Frick et al., 2023). Moreover, the Thurstonian IRT model is the most widely applicable IRT model for MFC data since it can accommodate various response formats and instructions (Brown & Maydeu-Olivares, 2018). To estimate the Thurstonian IRT model, the rankings are re-coded into pairwise item comparisons. Each possible combination of two items is considered once. Exemplary, this is illustrated for a block of length 3 with the realized ranking
In the original implementation (Brown & Maydeu-Olivares, 2011, 2012), the intercepts for the pairwise comparisons are unconstrained, that is, a set of item intercepts that comply with them does not have to exist. However, since the data are indeed item rankings, they imply stochastic dependencies that this model specification does not account for, as we will show in the following.
In addition, in the original implementation, the person parameters were estimated with maximum likelihood or maximum a posteriori estimation. This person parameter estimation is based on the product of dependent normal distributions (for block sizes
This research aims to investigate the effect of considering logical dependencies, both analytically (for block size 3) and in a simulation (for block sizes 3 and 4). We show analytically that logical dependencies do not play a role as long as stochastic dependencies are considered. We propose a new Bayesian implementation of the Thurstonian IRT model using a multivariate distribution. In the following, we will first present the Thurstonian IRT model, then explicate our definitions of stochastic and logical dependencies and give an intuition of the related proof. The full proof can be found in the Appendix. Afterward, we will present a simulation that compares three implementations: considering both types of dependencies, neglecting logical ones, and the original that neglects both types of dependencies. We will end with a discussion on the implications of our investigation for item and person parameter estimation in practice.
The Thurstonian IRT Model
The Thurstonian IRT model is based on the law of comparative judgment (Thurstone, 1927), which states that test takers compare statements pairwise when ranking multiple options. This justifies that a person’s ranking can be viewed as a series of binary item comparisons. It is assumed that the rank of each statement within a block is determined by the test takers latent utility toward that statement. Next to statement-specific characteristics, a test taker’s utility toward a statement is influenced by latent traits
where
Note that there is no further error term as the ranking is considered to be transitive. The probability of test taker
where
To represent the full response pattern in a block of length
When modeling multiple
where
and since the covariance matrix
Dependencies in the Thurstonian IRT Model
The Thurstonian IRT model utilizes that each ranking can be unequivocally decoded into a pattern of binary comparisons. The other way around, this does not hold true: By transitivity, out of all
Stochastic Dependencies
In the T-IRT model, intercepts
The comparison with
holds true. While this is possible, if
Logical Dependencies
Another concern is the fact that the information between considered binary comparisons is partially redundant which results in logical dependencies. To circumvent these dependencies, one can take only neighboring ranking comparisons into account, which results in
For an intuitive argument, assume
If stochastic dependencies are considered, the Dirac measure is always one and the density of the subset of utility differences is identical to the density of all utility differences. Hence, the density of
More so, this subset is sufficient to derive the full answer pattern. Why would a different subset of size
Conditional on other parameters, the answer pattern probabilities are products of the probabilities of single binary comparisons. The proposed estimation process directly uses the multivariate structure of the binary comparisons, accounting for logical dependencies in the process. This leads to a more economical use of data. While this could make the estimation process more efficient, it does not affect the estimated model, as shown in the Appendix.
To illustrate the theoretical implications of considering stochastic and logical dependencies, three path diagrams consisting of two blocks of length 3 are displayed in Figure 1. The left diagram shows the original T-IRT model, the one in the middle shows the version considering only stochastic dependencies and the right one shows the version considering both stochastic and logical dependencies, exemplary for the realized pattern

Path diagrams for the original T-IRT model (left), the version considering only stochastic dependencies (middle) and the version considering both stochastic and logical dependencies (left), for the answer pattern
Simulation Study
To compare the performance of these three T-IRT implementations, original, stochastic dependencies, and stochastic & logical dependencies, a simulation study is performed. To gain insight into the performance depending on the conditions specified by the test constructor, we vary the block length, the test length, and the sample size in two settings each. The block length
with
The simulation study and its evaluation are carried out using the software R (R Core Team, 2024) version 4.2.2. All implementations of the Thurstonian IRT model are implemented using Stan (Stan Development Team, 2024) software. The interface between R and Stan is established through the rstan package (Stan Development Team, 2022) version 2.26.1. The package mvtnorm (Genz et al., 2024) version 1.1-3 is used for data simulation,. The microbenchmark function from the package microbenchmark (Mersmann, 2024) version 1.4.10 was used to evaluate the computational time of each implementation. In addition, the package ggplot2 (Wickham, 2016) version 3.4.2 was used for visualization. The R-code for the simulation as well as the Stan-code of the three Thurstonian IRT model implementations are available on OSF https://osf.io/8fndw/.
Simulation Results
We compare deviations between estimates to evaluate the performance of the three implementations based on their parameter estimates. Figure 2 illustrates the differences between parameter estimates alongside the respective Monte-Carlo standard errors (MCSEs). It is visible that the parameters from the implementation stochastic dep. and stochastic & logical dep. share a higher similarity than the estimates resulting from the original implementation. This is expected behavior as we showed for block length 3 that these two implementations are equivalent. Differences between them are within or even below the MCSE. This strongly indicates that the equivalence holds true also for block lengths larger than 3. The intensity of difference between these two implementations and the original implementation depends on the simulation settings and the parameter group. For the factor loadings

Average absolute difference between estimates of the estimation methods (dots) and the average MCSEs (lines) per setting and replication.
Generally, differences between implementations decrease with increasing information, in the form of sample size and test length. When increasing the block length, the differences between the original implementation and the others increase. This behavior has two possible explanations. First, with increasing block length, dependencies increase, therefore ignoring them leads to a larger mismatch between the estimates. Second, since the number of full comparisons is held constant, the information contained in the test is smaller for larger block lengths, and therefore differences become larger.
The MCSE for the implementation stochastic & logical dep. is lower than for the other two implementations. This is an indicator that sampling is more efficient when considering stochastic and logical dependencies. This can be also seen when investigating convergence. While, according to the effective sample size and
Average Run Times in Hours and Their Standard Deviations Over 50 Replications per Simulation Setting.
It is obvious that across settings, the required computation time is at least by the factor 10 smaller in the implementation stochastic & logical dep. than in the other two implementations. The difference between those is way smaller; however, the implementation stochastic dep. is always a bit faster than the original implementation. In addition, it is interesting to see that the ratio between implementations is stable for both sample sizes. This indicates that the larger the sample gets the greater the added value of the implementation stochastic & logical dep. Unsurprisingly, the longer the test, the higher the required computation time.
The computation time changes only mildly with the block length. However, an interesting phenomenon is observable. While computation time mildly increases with increasing block length for the two implementations stochastic dep. and original, it decreases for the implementation stochastic & logical dep. This behavior can be explained with the simulation setup. A larger block length leads to more dependencies between the statement comparisons if all binary comparisons are considered, this increases the computation time for those implementations. However, since the number of all binary compared statements is held constant, the number of independent comparisons decreases with block length resulting in a faster computation time for the implementation stochastic & logical dep.
Empirical Example
To demonstrate that all three versions of the Thurstonian IRT model implementation can be fitted to real-world data and to explore the practical implications of these implementations, the models were applied to data from a personality test. The test is a modified version of the Big Five Inventory-2 (BFI-2) in a forced-choice design, measuring the five personality traits openness, conscientiousness, extraversion, agreeableness, and neuroticism. Each of these traits is assessed with 12 statements, resulting in a total of 60 statements. The statements are presented in pairs of 3, resulting in 20 blocks. Each block consists of positively and negatively coded statements. The proportion of positively and negatively coded items is almost balanced, with 29 positively and 31 negatively coded statements. The test was administered to 1,031 participants, of whom 94 were excluded from the analysis due to missing values. The data were collected in the study by Kupffer et al. (2024). In this study, participants were asked to complete six questionnaires, one of which was the BFI-2. The data were collected in an online survey that was conducted over a 2-week period in September and October 2017. All participants were from English-speaking countries, with a mean age of 36 years (SD = 11), and 46% of the participants were male. More details about the data collection process and the test can be found in the original study documentation from Kupffer et al. (2024).
All the implementations of the Thurstonian IRT model converged when fitting them to the data. This can be seen in Table 2, as all
Effective Sample Size (
When investigating the point estimates one can see, analogous to the simulation results, a high agreement between the parameter estimates of all three implementations, see Figure 3. One can see that the estimates of all models align nearly perfectly, which shows that the parameter recovery is not strongly affected by the choice of implementation.

Scatterplot of estimated model parameters for each implementation plotted against those resulting from the implementation stochastic & logical dep.
To see whether the small differences affect the model fit, we compare the widely applicable information criterion (WAIC) for the three models. As illustrated in Table 3, the implementation stochastic & logical dependencies should be preferred, as it resulted in the highest predictive accuracy. Followed by the implementation stochastic dependencies and original having the lowest predictive accuracy. However, part of the difference in fit is due to the difference in the constants of the multivariate normal distributions. As we note in Equation (A7) in the Appendix, the general density of a singular normal distribution is
Estimated WAIC Values in the Empirical Example per Implementation.
Note. WAIC = widely applicable information criterion.
Discussion
The objective of this study is to investigate whether the parameter estimates of the Bayesian Thurstonian IRT model are affected by the consideration of dependencies within blocks. In the originally defined Thurstonian IRT model, two types of dependencies occur. One are stochastic dependencies, resulting in illogical answer patterns, which can be avoided by constraining the utility intercepts. The other are logical dependencies at the test taker level due to redundant information in binary comparisons. These can be eliminated by considering only item comparisons with neighboring ranks. A theoretical comparison was made between the likelihoods of implementations that consider and neglect logical dependencies on the test taker level while considering stochastic dependencies. The comparison showed that for a block length of 3, the likelihoods are identical. Since both implementations are based on the same item and trait parameters with identical prior distribution, this results in equal posterior estimates. The authors assume that the proof generalizes to block lengths larger than 3.
To investigate the effect of constrained intercepts on parameter estimation, a simulation study was conducted. The study showed that accounting for stochastic dependencies leads to estimates that are as accurate, if not slightly more accurate, than those from the original T-IRT model. Since constraining the model has only been proposed by Brown and Maydeu-Olivares (2011) to enable person parameter estimation, there is no theoretical reasoning not to constrain the intercepts. Furthermore, by constraining the intercepts, the model adheres to Thurstone’s (1927) Law of Comparative Judgment. As such, it reflects the ranking process of individuals more accurately, as they cannot give intransitive rankings. Therefore, since current software enables us to consider stochastic dependencies, these should be considered in the Thurstonian IRT model.
The two implementations that consider stochastic dependencies but either consider or neglect logical dependencies are highly similar for all investigated settings. Especially interesting is that this similarity did not change with block lengths (3 and 4). While this supports the assumption of equivalent parameter estimates, the computational efficiency in the forms of convergence and computation time differs strongly between these implementations. When additional logical dependencies were considered, the computation time decreased drastically alongside a decrease in the MCSEs.
Therefore, considering both stochastic and logical dependencies in Thurstonian IRT model estimation has several advantages without any drawbacks. We recommended users utilize a T-IRT implementation that considers logical dependencies. For those interested in Bayesian model estimation, this paper provides the necessary code. The provided Bayesian implementation has the huge advantage that it can be easily extended. Researchers interested in extensions like multi-group models can adapt the stan code by changing only a few lines. Nevertheless, this idea can also be employed for frequentist estimation. Future research could implement this idea in frequentist models, to enable an even faster implementation.
Footnotes
Appendix
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors gratefully acknowledge the computing time provided on the Linux HPC cluster at Technical University Dortmund (LiDO3), partially funded in the course of the Large-Scale Equipment Initiative by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) as project 271512359.
