Sage Journals: Discover world-class research

Abstract

PyTorch and TensorFlow are two widely adopted modern deep learning frameworks that provide comprehensive computational libraries for developing and fitting complex models. Motivated by the technical barriers in recent item response theory (IRT) work and the lack of practice-oriented tutorials, we demonstrate how modern deep learning platforms can be used for Bayesian IRT parameter estimation by providing a didactic yet in-depth introduction to PyTorch and TensorFlow in a psychometric context, framing IRT models as graphical models, and offering step-by-step guidance that bridges probabilistic machine learning and psychometrics. In this study, we illustrate how to leverage these platforms to estimate widely used psychometric models in educational testing, psychological measurement, and behavioral assessment, namely dichotomous and polytomous IRT models and their multidimensional extensions. We compare Hamiltonian Monte Carlo and variational inference estimators for these models in a unified computational environment. Simulation studies show that both approaches yield parameter estimates with low mean squared error and bias in low-dimensional settings, while also indicating that VI might underestimate aspects of posterior uncertainty in higher-dimensional scenarios. Nonetheless, for practitioners who prioritize computational efficiency and scalability, especially when Graphics Processing Unit (GPU) acceleration is available, VI remains a compelling option. Three empirical case studies further demonstrate how PyTorch- and TensorFlow-based implementations compare with established IRT software in applied settings. We conclude by discussing the broader potential of integrating contemporary deep learning tools and perspectives into psychometric research.

Keywords

item response theory deep learning Markov Chain Monte Carlo variational inference PyTorch TensorFlow

1. Introduction

Deep learning (DL; LeCun et al., 2015; Nalisnick et al., 2023) has significantly advanced many fields, including computer vision (Krizhevsky et al., 2012), natural language processing (Manning, 2015), and, more recently, large language models (Floridi & Chiriatti, 2020; Vaswani et al., 2017). Its success is often attributed to highly expressive models with enormous numbers of parameters, the availability of massive labeled datasets, and efficient, scalable optimization algorithms. As flexible and scalable function approximators (Cybenko, 1989), DL methods are increasingly attractive in the social, educational, and behavioral sciences, where computational models are becoming more complex to accommodate novel assessment designs, more fine-grained research questions, and growing data sizes.

1.1 Deep Learning Computational Frameworks

DL frameworks such as PyTorch (Paszke et al., 2019) and TensorFlow (Abadi et al., 2016) have greatly reduced the effort required to develop and estimate complex models. Originally designed for training deep neural networks, these open-source libraries have evolved into powerful platforms for efficient numerical computation and automatic differentiation (AutoDiff; Paszke et al., 2017), making them suitable for a wide range of large-scale data analysis tasks. AutoDiff computes exact derivatives via the chain rule within computational graphs. It evaluates derivatives during code execution with machine precision, with reverse-mode AutoDiff (underlying backpropagation) efficiently computing gradients for models with many parameters and relatively few outputs. Incorporated into PyTorch and TensorFlow, AutoDiff enables automatic gradient computation with minimal additional code (Paszke et al., 2017), which is crucial not only for training deep neural networks but also for gradient-based optimization in statistical modeling, where manual derivation of complex gradients is difficult. Hence, these frameworks provide efficient and scalable tools for large datasets and complex models that characterize contemporary data analysis.

Both PyTorch and TensorFlow also provide probabilistic programming libraries, Pyro for PyTorch and TensorFlow Probability (TFP) for TensorFlow, which extend their functionality to probabilistic modeling and inference. These libraries allow users to specify probabilistic models using familiar syntax and to perform inference with advanced algorithms such as Hamiltonian Monte Carlo (HMC) and variational inference (VI). In doing so, they support a wide range of traditional parametric models and flexible deep probabilistic models built from simple building blocks. AutoDiff is central to these methods (Wang et al., 2018, 2022), especially for algorithms that require repeated gradient evaluations of complex statistical models (Paszke et al., 2017). Furthermore, the DL structures implemented in these frameworks naturally support multimodal data and the extraction of large-scale, heterogeneous information (Gao et al., 2020).

1.2 Computational Challenges in Item Response Theory Models

In educational testing, psychological measurement, and behavioral assessment, item response theory (IRT) models aim to infer latent traits (e.g. abilities and attitudes) of individuals from observable responses to test items. Marginal maximum likelihood (MML) is a popular frequentist estimation method in the IRT literature, treating person abilities as latent variables (random effects) and widely used in practice. However, computing the marginal likelihood requires integrating over the latent traits, leading to complex numerical integration that can become intractable for large-scale assessments or high-dimensional latent structures. To improve the computational efficiency of MML, a variety of algorithms have been proposed, such as the Metropolis–Hastings Robbins–Monro algorithm (Cai, 2010) and variational Expectation–Maximization (Jeon et al., 2017).

Bayesian IRT models (Fox, 2010) are increasingly popular because they allow flexible specification of complex models, but Bayesian estimation further increases computational demands by requiring computation of posterior distributions over parameters, typically involving integration over high-dimensional spaces. Markov Chain Monte Carlo (MCMC) algorithms, such as the Metropolis–Hastings algorithm (Hastings, 1970; Metropolis et al., 1953), are widely used to approximate these integrals, yet they can be computationally intensive and slow to converge in the high-dimensional settings typical of IRT. These challenges are exacerbated by large datasets and complex model structures (e.g. multidimensional traits), underscoring the importance of efficient gradient computation and scalable optimization for improving parameter estimation in IRT models.

1.3 DL Frameworks Improving Computational Efficiency for IRT Models

The computational strengths and flexibility of PyTorch and TensorFlow offer promising tools for addressing the computational challenges in IRT. Their support for AutoDiff reduces the need to manually derive complex gradients in likelihood functions. Using the probabilistic programming libraries Pyro and TFP, researchers can specify IRT models within these frameworks and leverage advanced inference algorithms such as HMC and VI. These algorithms are implemented efficiently, taking full advantage of AutoDiff and optimized computational kernels, which allows for faster convergence and better scalability; for example, HMC uses gradient information to explore the parameter space more effectively than traditional random-walk MCMC (Neal, 2011). VI provides another powerful numerical method, widely used for training large-scale Bayesian neural networks: it approximates the posterior by finding the closest distribution within a specified family, turning inference into an optimization problem that is generally faster and more scalable (Blei et al., 2017). Although MCMC-based estimates are often slightly more accurate at the cost of higher computation (C. Ma et al., 2024; Urban & Bauer, 2021), VI, as an optimization-based approach, can benefit substantially from GPU acceleration available in these frameworks. Moreover, because the model specification and computational backend remain largely unchanged, these frameworks provide a convenient and fair platform for comparing traditional MCMC methods with rapidly evolving VI approaches simply by swapping inference routines, thereby minimizing confounding implementation differences.

While the HMC algorithm has already been introduced on the R platform using Stan (Carpenter et al., 2017) to estimate IRT models (Y. Luo & Jiao, 2018), Python frameworks such as PyTorch and TensorFlow provide more adaptable and scalable environments for constructing sophisticated probabilistic models. Their extended libraries, including Pyro and TFP, not only facilitate the implementation of HMC and VI but also seamlessly integrate a wide range of neural network architectures. Both frameworks are under continuous development by leading DL research communities, with state-of-the-art neural network designs and learning algorithms regularly incorporated. This ongoing evolution streamlines the construction of complex, nonlinear, and hierarchical models and broadens the scope for extending IRT models beyond traditional forms, including applications that make use of multimodal data.

On the technical front, recent IRT research has begun to explore how advanced computing techniques, particularly VI and related methods, can be adapted and improved for IRT estimation (N. Luo & Ji, 2025; B. Ma et al., 2022; Urban & Bauer, 2021). However, these contributions tend to emphasize methodological innovation and are often presented in highly technical language with Python/PyTorch implementations, making them less accessible to applied researchers and psychometric practitioners. As a result, there remains a noticeable gap in practical, accessible discussions and tutorials (Y. Luo & Jiao, 2018; McClure, 2023) that explicitly bridge probabilistic machine learning, DL, and IRT models. The goal of this article is to demonstrate how DL computational platforms can be effectively employed to estimate parameters in IRT models. Our contribution is twofold: first, we provide the psychometric community with a didactic yet in-depth introduction to PyTorch and TensorFlow specifically in a psychometric context (in contrast to more general reviews such as Pang et al., 2020), while also offering computer science and machine learning researchers a tutorial on commonly used psychometric models; second, by framing IRT models through the lens of graphical models (a central framework in probabilistic machine learning) we provide intuitive discussions and visualizations that aid in understanding and interpreting complex models and encourage interdisciplinary collaboration between psychometrics and DL. This contribution is particularly timely given the rapid advances in DL and artificial intelligence and the fact that recent applications of DL in IRT are often highly technical and present a steep learning curve for researchers without prior exposure to these methods (Cho et al., 2021; Converse et al., 2021; N. Luo & Ji, 2025; C. Ma et al., 2024; Urban & Bauer, 2021). By providing step-by-step guidance and practical demonstrations within flexible DL frameworks, we aim to promote wider adoption of these advances and to foster innovative approaches in measurement and assessment.

The rest of this article is structured into five main sections to provide a comprehensive introduction and tutorial on using DL computation frameworks for fitting various IRT models. The first section briefly introduces IRT model construction and Bayesian parameter estimation. The second section provides a detailed tutorial showing how to estimate parameters in PyTorch and TensorFlow. The third and fourth sections compare the estimation performance of HMC and VI implemented in these two frameworks using various simulated and empirical datasets, respectively. The article concludes with a comprehensive discussion highlighting the strengths and limitations of DL frameworks and providing general guidance for choosing HMC and VI methods in practice.

2. Overview of IRT Models and Bayesian Analysis

This section outlines the specifications of dichotomous and polytomous item response models, specifically the graded response model (GRM) and the partial credit model (PCM). Additionally, it provides a brief introduction to Bayesian parameter estimation.

Using the observed response matrix as input data, the model estimates both item parameters and explores the latent traits of examinees through the structure of IRT. A key assumption is that examinees are independent of one another. Furthermore, given an examinee’s latent trait, their responses to different items are assumed to be independent. IRT models, with their inherent dependencies among parameters, can be viewed as special cases of probabilistic graphical models. In this context, DL frameworks have been employed to facilitate parameter estimation (Johnson et al., 2016; Paszke et al., 2017).

2.1 IRT Models

IRT models can date back to Rasch’s introduction of the Rasch model in 1960. This model has excellent statistical properties due to the natural exponential family form and has influenced how researchers think about measuring psychological traits (Rasch, 1960). Around the same time, Lord and Novick (1968) provided a general framework for IRT models and introduced the two-parameter (2PL) and three-parameter logistic (3PL) models, developed by Birnbaum (1968). These models include not only item difficulty but also how well an item can distinguish between different levels of latent trait, and for the 3PL model, the likelihood of guessing the correct answer (Birnbaum, 1968; Lord & Novick, 1968).

The IRT models mentioned above belong to the dichotomous IRT models, which deal with binary responses. IRT models can be further categorized according to the dimensions of latent traits and individual’s responses to a set of items. Later developments expanded IRT to include polytomous models to handle items with multiple categories of responses, such as questions with more than two possible answers. Important polytomous models include the GRM (Samejima, 1969), the PCM (Masters, 1982), and the nominal response model (Bock, 1972). More general and complex IRT models raise the dimension of latent traits, such as the multidimensional 2PL IRT model (Reckase, 2009).

2.1.1 Dichotomous IRT Models—Unidimensional

Model parameters for IRT usually include each examinee’s latent ability and each item’s parameters, such as discrimination, difficulty, and the pseudo-guessing parameter, which we introduce below. Dichotomous IRT models are those with a binary response, with “ $1$ ” representing “correct” and “ $0$ ” denoting “incorrect.” For a unidimensional 3PL IRT model (Birnbaum, 1968) with $n$ examinees and $q$ items, the probability that examinee $i$ answers item $j$ correctly is

P_{i, j} = p (u_{i, j} = 1 | θ_{i}, a_{j}, b_{j}, c_{j}) = c_{j} + \frac{(1 - c_{j})}{1 + \exp (- a_{j} (θ_{i} - b_{j}))},

(1)

where $u_{i, j} \in {0, 1}$ is the response of examinee $i$ to item $j$ ; $θ_{i}$ is the latent ability of examinee $i$ ; $a_{j} > 0$ , $b_{j}$ , and $c_{j} \in [0, 1]$ are the discrimination, difficulty, and the pseudo-guessing parameter or lower asymptote of item $j$ . The 2PL IRT model (Birnbaum, 1968) is obtained by assigning 0 to $c_{j}$ for all $j$ , and if $a_{j}$ is further constrained to be the same for all $j$ , the model is reduced to the 1PL IRT model, often known as the Rasch model (Rasch, 1960) (we omit the philosophical and technical differences between the Rasch and 1PL models here).

Graphical representations play a role in understanding complex latent variable models in probabilistic machine learning. These visual tools, often expressed as directed or undirected graphs, provide a structured framework that represents the probabilistic dependencies between observed data and latent variables. In probabilistic machine learning, this approach is particularly useful because it offers an intuitive illustration of how parameters, latent variables, and observations interact, rather than relying solely on abstract equations. Figure 1 illustrates the graphical model demonstrations for these IRT models. The shaded variables are the observed variables, and all dependencies between variables are shown by arrows.

Figure 1.

Graphical model demonstrations for the dichotomous IRT models (unidimensional case) (a) 1PL (b) 2PL (c) 3PL.

The statistical model can be summarized in the full likelihood function. Let $a = {(a_{1}, \dots, a_{q})}^{T}$ , $b = {(b_{1}, \dots, b_{q})}^{T}$ and $c = {(c_{1}, \dots, c_{q})}^{T}$ . The likelihood function of the response, given by the response vector of examinee $i$ , denoted as $u_{i} = {(u_{i, 1}, \dots, u_{i, q})}^{T}$ , is: given $P_{i, j}$ in Equation 1,

L_{i} (θ_{i}, a, b, c | u_{i}) = Π_{j = 1}^{q} P_{i, j}^{u_{i, j}} {(1 - P_{i, j})}^{1 - u_{i, j}} .

(2)

Let $U = [u_{i, j}]_{i = 1, \dots, n}^{j = 1, \dots, q}$ , and $θ = (θ_{1}, \dots, θ_{n})$ . Due to the assumption of examinee independence and item conditional independence, the likelihood of the response matrix $U$ can be factorized into

L_{full} (θ, a, b, c | U) = Π_{i = 1}^{n} L_{i} (θ_{i}, a, b, c | u_{i}) = Π_{i = 1}^{n} Π_{j = 1}^{q} P_{i, j}^{u_{i, j}} {(1 - P_{i, j})}^{1 - u_{i, j}}

(3)

2.1.2 Dichotomous IRT Models—Multidimensional

The examinee usually requires multiple abilities to answer a question accurately, so a more general construct is to assume the examinee’s latent ability and the item’s discrimination as a set of values rather than a single number. For instance, items for a mathematics test might depend on two skill constructs: arithmetic problem-solving and algebraic symbol manipulation. Taking the 2PL IRT model as an illustrated example, with the modification, where $θ_{i}$ is a vector of latent abilities of examinee $i$ , $a_{j}$ is a vector of discrimination parameters of item $j$ , and $d_{j}$ is the intercept of item $j$ , the multidimensional 2PL IRT model (M2PL; Reckase, 2009) states the probability that examinee $i$ gives the correct answer for item $j$ is

p (u_{i, j} = 1 | θ_{i}, a_{j}, d_{j}) = \frac{1}{1 + \exp (- (a_{j}^{T} θ_{i} + d_{j}))} .

(4)

Figure 2(a) shows the graphical demonstration for the multidimensional 2PL IRT model. The full likelihood function is similar to the unidimensional case mentioned above, except that the parameters are replaced with their multivariate counterparts. Multivariate distributions and matrix multiplication should be included to extend the dimensions in the code implementation.

Figure 2.

Graphical model demonstrations for the (a) multidimensional 2PL IRT model (b) GRM and PCM.

2.1.3 Polytomous Item Response Models

Polytomous item response models consider responses with more than two categories, that is, not just correct and incorrect. The GRM and the PCM are introduced below. They can be demonstrated using the same graphical model in Figure 2(b).

The GRM (Samejima, 1969), one type of the ordinal response models, is a general version of the 2PL IRT model and handles ordinal polytomous categories relating to constructed-response or selected-response items. To be specific, examinees are supposed to obtain multiple levels of ordinal scores such as $0, 1, 2, 3, 4$ . With two levels of scores, the GRM reduces to the 2PL model. The formula of the specific probability of choosing the $k$ -th category from $m_{j}$ possible categories of item $j$ can be derived as follows:

\begin{matrix} P_{i, j, k} = p (u_{i, j} = k | a_{j}, θ_{i}, b_{j, k}) = p (u_{i, j} \geq k | a_{j}, θ_{i}, b_{j, k}) - p (u_{i, j} \geq k + 1 | a_{j}, θ_{i}, b_{j, k + 1}) \\ = \frac{1}{1 + \exp (- a_{j} (θ_{i} - b_{j, k}))} - \frac{1}{1 + \exp (- a_{j} (θ_{i} - b_{j, k + 1}))}, \end{matrix}

(5)

where $b_{j, k}$ specifies the $k$ -th threshold that indicates the transition point between adjacent response categories (e.g. category difficulty of item $j$ ) which is strictly increasing with respect to $k$ , and $k = 1, \dots, m_{j}$ . The probability of responses below the lowest category is null and the probability of responses below the highest category is 1 so

p (u_{i, j} \geq 1 ∣ a_{j}, θ_{i}, b_{j, 1}) = 1, p (u_{i, j} \geq m_{j} + 1 ∣ a_{j}, θ_{i}, b_{j, m_{j} + 1}) = 0 .

(6)

The PCM (Masters, 1982) is another type of ordinal response models and is constructed on the basis of items that can be partially correct. For a multicategory item $j$ , student $i$ must go through several steps to provide a correct response. There is a certain order among these steps and examinees cannot skip any intermediate step when responding to the next step¹. However, the difficulty levels of these steps may not necessarily increase progressively. For any two successive steps $v$ and $v - 1$ , assume that the examinee $i$ correctly solves the first $v - 1$ steps and is now solving the $v$ -th step. Then, if $i$ reaches the correct answer, the process continues. Otherwise, $i$ stops at $v - 1$ and gets $v - 1$ scores. Therefore, the PCM can be viewed as a sequence of dichotomous 2PL models. PCM is defined by the specific probability of choosing the $k$ -th category from $m_{j}$ possible categories of item $j$ :

P_{i, j, k} = p (u_{i, j} = k | a_{j}, θ_{i}, b_{j, k}) = \frac{\exp (\sum_{v = 1}^{k} a_{j} (θ_{i} - b_{j, v}))}{\sum_{c = 1}^{m_{j}} \exp (\sum_{v = 1}^{c} a_{j} (θ_{i} - b_{j, v}))},

(7)

where $b_{j, 1} = 0$ and $b_{j, v}$ specify the difficulty thresholds between score categories $v - 1$ and $v$ . The formula can then be modified into the following: denoting $z_{j, k} (θ_{i}) = a_{j} (θ_{i} - b_{j, k})$ ,

\begin{matrix} p (u_{i, j} = k | a_{j}, θ_{i}, b_{j, k}) = \frac{\exp [z_{j, 1} (θ_{i})] \cdot \exp [\sum_{v = 2}^{k} z_{j, v} (θ_{i})]}{\exp [z_{j, 1} (θ_{i})] + \sum_{c = 2}^{m_{j}} \exp (z_{j, 1} (θ_{i}) + \sum_{v = 2}^{c} z_{j, v} (θ_{i}))} \\ = \frac{\exp [\sum_{v = 2}^{k} z_{j, v} (θ_{i})]}{1 + \sum_{c = 2}^{m_{j}} \exp (\sum_{v = 2}^{c} z_{j, v} (θ_{i}))} . \end{matrix}

(8)

For GRM and PCM, the likelihoods given by the whole response matrix $U$ have the same form but the difference exists in the specific definition of $p (u_{i, j} = k | a_{j}, θ_{i}, b_{j, k})$ . Consider $U = [u_{i, j}]$ where $u_{i, j} \in {1, \dots, m_{j}}$ ; $θ = {(θ_{1}, \dots, θ_{n})}^{T}$ ; $a = {(a_{1}, \dots, a_{q})}^{T}$ ; $B = (b_{1}, \dots, b_{q}),$ where $b_{j} = {(b_{j, 1}, \dots, b_{j, m_{j}})}^{T}$ for $j = 1, \dots, q$ . The full likelihood function is:

L_{full} (θ, a, B | U) = Π_{i = 1}^{n} L_{i} (θ_{i}, a, B | u_{i}) = Π_{i = 1}^{n} Π_{j = 1}^{q} [\sum_{k = 1}^{m_{j}} 1_{k} (u_{i, j}) p (u_{i, j} = k | a_{j}, θ_{i}, b_{j}, k)]

(9)

where $1_{k} (u_{i, j}) = 1$ if $u_{i, j} = k,$ and otherwise, it is $0$ .

The multidimensional extensions of GRM and PCM can be achieved by replacing the unidimensional latent ability $θ_{i}$ and the discrimination parameter $a_{j}$ with their multidimensional counterparts $θ_{i}$ and $a_{j}$ , similar to the M2PL model. The specific probability functions for GRM and PCM can be modified accordingly by substituting $a_{j} θ_{i}$ with $a_{j}^{T} θ_{i}$ in Equations 5 and 7.

2.2 Bayesian Approach for Parameter Estimation

The Bayesian approach to parameter estimation provides a robust framework for incorporating prior knowledge and updating beliefs in light of new data (Patz & Junker, 1999a, 1999b). This is achieved by characterizing parameters not as fixed points, but as probability distributions. The core of this approach rests on Bayes’ theorem, which synthesizes prior beliefs about parameters $x$ (e.g. $θ, a, b$ ), denoted by the prior distribution $p (x)$ , with information from observed data $u$ , captured by the likelihood function $L (x | u)$ . The result is the posterior distribution $p (x | u)$ , which represents our updated knowledge of the parameters after observing the data:

p (x | u) = \frac{L (x | u) p (x)}{p (u)} \propto L (x | u) p (x) .

(10)

A significant computational challenge arises from the denominator, $p (u) = \int L (x | u) p (x) dx$ , known as the marginal likelihood or evidence. This term requires integrating over all possible parameter values, which is often analytically intractable and computationally prohibitive for complex, high-dimensional models. To overcome this, we turn to numerical approximation methods. The two predominant approaches are sampling-based methods, such as HMC, and optimization-based methods, like Variational Inference.

HMC (Neal, 2011) is an advanced MCMC algorithm designed for efficient sampling from the posterior distribution. The general goal of MCMC is to construct a Markov chain whose stationary distribution is the target posterior. After a sufficient number of steps, samples drawn from this chain can be treated as samples from the posterior itself. Unlike simpler MCMC methods that employ a random walk, HMC uses gradient information from the log-posterior to guide its exploration of the parameter space, making it a highly efficient sampler (Carpenter et al., 2017). By introducing an auxiliary momentum variable and simulating Hamiltonian dynamics, HMC can propose distant, yet high-probability, new states, leading to faster convergence and less correlation between successive samples. A key challenge in standard HMC is the need to manually tune simulation parameters, particularly the number of integration steps, $L$ . The No-U-Turn Sampler (NUTS) is an important extension that automates this process (Hoffman & Gelman, 2014).

Variational inference offers a different, often faster, alternative to sampling. Instead of attempting to draw samples from the true posterior $p (x | u)$ , VI reframes inference as an optimization problem. The goal is to approximate the intractable posterior with a simpler, tractable distribution $q_{ψ} (x)$ from a chosen family (e.g. a multivariate Gaussian), parameterized by $ψ$ . The objective is to find the parameters $ψ$ that make the approximation $q_{ψ} (x)$ as close as possible to the true posterior $p (x | u)$ . The “closeness” between these two distributions is typically measured by the Kullback–Leibler (KL) divergence, $D_{KL} (q_{ψ} ∥ p)$ . Directly minimizing this KL divergence is not feasible because it depends on the intractable posterior, but it is mathematically equivalent to maximizing a computable objective function called the evidence lower bound (ELBO):

ELBO (ψ) = E_{q_{ψ} (x)} [\log p (u | x)] - D_{KL} [q_{ψ} (x) ∥ p (x)] .

(11)

The ELBO consists of two intuitive terms. The first term, the expected log-likelihood, encourages the variational distribution to explain the observed data. The second term is the KL divergence between the variational distribution and the prior, which acts as a regularizer, penalizing deviations from our prior beliefs. By using gradient-based optimization methods to maximize the ELBO with respect to $ψ$ , VI can efficiently find an optimal approximation to the posterior. While MCMC methods asymptotically converge to the true posterior, VI provides an approximation that is often highly accurate and can be significantly faster to compute, making it particularly suitable for large-scale datasets and models.

3. Fitting IRT Models Using PyTorch and TensorFlow

This section provides a detailed explanation for the code snippets in Supplementary Appendix (available in the online version of this article) with respect to the implementation of the IRT model construction and parameter estimation in two DL frameworks. We use PyTorch and TensorFlow, as well as their probabilistic programming libraries extensions, Pyro and TFP, to build statistical models and perform inference from scratch. Table 1 summarizes the key input requirements for configuring HMC and VI.

Table 1.

Input Summary for HMC and VI Configuration in PyTorch and TensorFlow

Method	Model definition	Hyperparameters
Method	Model definition	Sampling	Optimization
PyTorch HMC	function “model”	num_samples, warmup_steps	—
TensorFlow HMC	function “target_log_prob_fn”	num_results, num_burnin_steps	—
PyTorch VI	function “model”	num_particles	optim (lr), num_iterations
TensorFlow VI	function “target_log_prob_fn”	sample_size	optimizer (learning_rate), num_steps
	Variational distribution	Parameter constraints
PyTorch HMC	—	Log-normal distribution for positive discrimination, “OrderedTransform()” to ensure increasing difficulty thresholds for GRM
TensorFlow HMC	—	Log-normal distribution for positive discrimination, “tf.sort()” to ensure increasing difficulty thresholds for GRM
PyTorch VI	Function “guide”	Positive or positive definite constraint for scale (covariance)
TensorFlow VI	“Distribution” instance “surrogate_posterior”	“TransformedVariable()” to ensure positive or positive definite scale (covariance)

Note. HMC = Hamiltonian Monte Carlo; VI = Variational Inference; GRM = Graded Response Model.

3.1 Unidimensional 3PL IRT Model

A step-by-step illustration is first presented for the unidimensional 3PL IRT model. For IRT models, parameters can be sampled from specific distributions once the hyperparameters for distributions, like the mean and variance of the distributions, are set. For instance, we adopt weakly informative priors distributions of examinees’ ability $θ_{i}$ , and items’ discrimination $a_{j} > 0$ , difficulty $b_{j}$ and pseudo-guessing $c_{j} \in [0, 1]$ parameters commonly used in Bayesian IRT (Natesan et al., 2016): for $i = 1, \dots, n$ and $j = 1, \dots, q$ ,

\begin{matrix} θ_{i} ~ Normal (0, 1^{2}), a_{j} ~ Lognormal (0, 0 . 5^{2}) since a_{j} \geq 0, \\ b_{j} ~ Normal (0, 1^{2}), c_{j} ~ Normal (0.5, 0 . 5^{2}), constrained within the interval [0, 1] . \end{matrix}

(12)

Examples of how to set up distributions like the Equation 12 and start sampling are presented first in codes 1 and 2. Also, in order to perform a simulation study in the next section, these are also the first step, showing how to sample true parameters for the evaluation of parameter estimation and to simulate a response matrix based on the true parameters and the Equation 1. Although examinees’ ability and items’ parameters have different shapes, in PyTorch and TensorFlow, broadcasting allows tensors with different shapes to be automatically expanded to have compatible dimensions for elementwise operations and does not need explicit replication of data (Code 1, line 30; Code 2, line 31). This feature simplifies operations on tensors of varying sizes, enabling more efficient and concise code for computations such as addition, multiplication, and other arithmetic operations.

3.1.1 HMC Setting

After setting the prior distributions, the next step is to compute the likelihood function as the statistical model according to Equation 3. This requires the probability computation of the response matrix $U$ by Equation 1. To specify the posterior distribution and perform sampling, the prior and likelihood should be encapsulated. For implementation details, the differences arise from the specific APIs and conventions of PyTorch and TensorFlow. While both Pyro and TFP have a Bayesian probabilistic modeling approach to perform MCMC inference employing NUTS as the kernel, the way NUTS is instantiated and integrated into the sampling process differs.

The instantiation of NUTS in Pyro requires defining a probabilistic model (“model”) and specifying the observed data (“responses”). The “model” is usually a function taking “responses” as input and capsuling priors and the model (Code 3, line 6–23). Two crucial hyperparameters for MCMC are the number of posterior samples and the number of warm-up steps. Warm-up steps are used for tuning the sampler to find a suitable parameter space. A sufficiently large number of samples guarantee that the parameter space is fully explored. With the kernel, the number of samples to draw from the posterior distribution, and the number of warm-up iterations set, the MCMC process runs with the “responses” input and returns posterior samples. The example code snippet is shown below. More details are shown in Code 3.

1 # Define the 3PL-IRT model and encapsulate the priors and model.
2 def model(responses):
3 # Specify the prior distributions.
4 # Tensor inputs for ’loc’ and ’scale’ directly set up
5 # independent distributions for each examinee and item.
6 ability = pyro.sample(“ability,”dist.Normal (loc=torch.zeros(num_students), scale=torch.ones (num_students)))
7 difficulty = pyro.sample(“difficulty,”dist.Normal (loc=torch.zeros(num_questions), scale=torch.ones (num_questions)))
8 discrimination = pyro.sample(“discrimination,”dist. LogNormal(loc=torch.zeros(num_questions),scale=0.5* torch.ones(num_questions)))
9 pseudoguessing = pyro.sample(“pseudoguessing,”dist. Normal(loc=0.5*torch.ones(num_questions),scale=0.5* torch.ones(num_questions)))
10 # Restrict the range of pseudoguessing
11 pseudoguessing = torch.clamp(pseudoguessing,0,1)
12
13 # Specify the probability for the likelihood function.
14 prob = torch.sigmoid((ability.view( -1, 1) - difficulty) * discrimination) * (1 - pseudoguessing) + pseudoguessing
15 # Utilize Bernoulli distribution to compute the likelihood function.

16 # ’dist.Independent()’ with ’reinterpreted_batch_ndims=2’ _ treats each
17 # student’s response to each question as an independent _ Bernoulli trial,
18 # even though they are in a 2D matrix.
19 pyro.sample(“response,”dist.Independent(dist.Bernoulli_(prob),reinterpreted_batch_ndims=2), obs=responses)
20
21 # Perform MCMC inference.
22 # The ’model’ should be first processed by the kernel.
23 nuts_kernel = NUTS(model)
24 # The ’MCMC’ function support different MCMC sampling _ methods,
25 # which is specified by the ’kernel’ hyperparameter.
26 # Also, specify the number of posterior samples to collect
27 # and the number of warmup steps.
28 mcmc = MCMC(kernel=nuts_kernel, num_samples=1,000, _warmup_steps=500)
29 # Run the MCMC sampler on the observed data (responses).
30 mcmc.run(responses)

The argument “target_log_prob_fn” for “NoUTurnSampler” in TensorFlow requires a function taking a set of parameters and evaluating the log joint probability of the target distribution given those parameters. The exact form of this function depends on specific probabilistic models (Code 4, line 2–36). A “SimpleStepSizeAdaptation” wrapper dynamically adapts the step size during the initial burn-in phase, enhancing the efficiency of the MCMC process. The function “sample_chain” runs the MCMC process to get required number of posterior samples, starting from an initial state (“initial_state”) and incorporating the adaptive kernel. The trace function extracts information about whether each sample is accepted, providing insights into the convergence and acceptance rate of the sampler. Another example code snippet is shown in Code 4.

3.1.2 VI Setting

Pyro provides built-in modules for doing variational inference in the class “SVI” (abbreviation of stochastic variational inference), which offers straightforward methods for variational learning and evaluation. The user needs to provide three things: the “model,” the “guide,” and an “optimizer.” The “model” is similar to its MCMC implementation counterpart mentioned above. The “guide” is the variational distribution, serving as an approximation to the posterior (Code 5, line 22–37). The “optimizer” is chosen from PyTorch’s optimization library to maximize the ELBO. To be specific, the Adam algorithm (Kingma & Ba, 2015) is chosen for the “optimizer.” Apart from these three elements, the “loss,” which stands for the specific form of the ELBO, can be further adjusted, such as by tuning parameters like “num_particles,” standing for the number of samples used to form the ELBO (gradient) estimators. After setting up the “SVI” instance, the number of steps to run the optimizer is assigned to “num_iterations” and in each iteration, the method “step” is called to take a gradient step on the loss function to maximize the ELBO. To complete the illustration for the unidimensional IRT model, Code 5 is for the 2PL IRT model and uses simple variational inference in PyTorch.

TFP also has a module “tfp.vi” for variational inference. The core function is “fit_surrogate_posterior.” The major parameters include “target_log_prob_fn,” which is similar to the argument for “tfp.mcmc.NoUTurnSampler”; “surrogate_posterior,” a “Distribution” instance defining a variational distribution, and “optimizer,”“num_steps” for number of steps to run the optimizer, and “sample_size” for the number of Monte Carlo samples to use in estimating the ELBO. However, the use of variational distributions in TensorFlow often causes compatibility issues and cannot be set up as directly as in PyTorch, so we only provide one simple example snippet for VI in TensorFlow shown in Code 6 and focus on Pyro for VI in the later introduction and evaluation.

3.2 Multidimensional 2PL IRT Model

The multidimensional extension of IRT models increases the complexity of computation. For instance, it is assumed that the examinee might require two types of abilities to give the correct answers for the items. Therefore, for each examinee, the dimension of the ability parameter is 2, and for each item, the discrimination parameter also has two dimensions. With mean vector $μ$ and covariance matrix $Σ$ :

μ = (0, 0) Σ = [\begin{matrix} 1.0 & 0.5 \\ 0.5 & 1.0 \end{matrix}],

(13)

a priori independent distributions for examinees’ latent abilities and items’ discrimination and difficulty are set as follows: for $i = 1, \dots, n$ and $j = 1, \dots, q$ ,

θ_{i} ~ Normal (μ, Σ), a_{j} ~ Lognormal (μ, Σ), d_{j} ~ Normal (0, 0 . 5^{2}) .

(14)

The likelihood function is computed according to Equation 4.

For HMC, the only difference in code implementation compared to the 3PL IRT model is the inclusion of priors and the model. The code snippets for setting up the “model” function in PyTorch and “joint_log_prob” function in TensorFlow are shown in Codes 7 and 8. For VI in PyTorch, given that the “model” function can be the same as the function in Code 7 for the HMC case, the modification of the code is related to the variational distribution, which is given in the Code 9.

3.3 Graded Response Model

The GRM assumes that each item can have multiple ordinal response categories, with each category representing an increasing level of difficulty. As a result, a multivariate distribution should be applied to the difficulty parameters to properly capture these levels. For example, if the GRM defines three ordinal score levels for each item, the vector of difficulty parameters for item $j$ can be expressed as $b_{j} = {(b_{j, 1}, b_{j, 2}, b_{j, 3})}^{T}, j = 1, \dots, q$ .

Given the constraint shown in Equation 6, only a subset of the difficulty parameters needs to be modeled. Specifically, they can be sampled from a two-dimensional multivariate normal distribution with mean vector $μ$ and covariance matrix $Σ$ in Equation 13. Independent prior distributions are assumed for the examinees’ ability and for the items’ discrimination and difficulty parameters, defined as follows:

θ_{i} ~ Normal (0, 1^{2}), a_{j} ~ Lognormal (0, 0 . 5^{2}), b_{j} ~ Normal (μ, Σ) .

(15)

The likelihood function is set up as the statistical model according to Equations 5 and 9. The code snippet for the encapsulation of the prior distributions and likelihood function is presented in Codes 10 and 11. Similarly to the previous two models, for VI in PyTorch, the “model” in Code 10 can remain the same, and the variational distributions can be specified as in Code 12.

3.4 Partial Credit Model

The PCM also assumes multiple response categories for each item but removes the requirement for increasing difficulty parameters across categories. We consider a test of 10 items with $500$ examinees: five items with two categories and five items with three categories. The difficulty parameter for the first step for each item is fixed at 0, so only the remaining parameters (i.e. one for the first five items and two for the second five) are sampled. For the first five items, the second difficulty parameter is sampled from a standard normal distribution: $b_{j, 2} ~ Normal (0, 1)$ . For the other items with three response levels, the last two difficulty parameters are sampled from a multivariate normal distribution with mean vector $μ$ and covariance matrix $Σ$ in Equation 13.

The prior distributions for the parameters can be set following Equations 12 or 15. The computation of the likelihood function is guided by Equations 8 and 9. A Python dictionary can be used to store and organize the structure of the tests and the sizes of their components. Codes 13 and 14 for implementing HMC sampling are shown in Supplementary Appendix (available in the online version of this article). For VI in PyTorch, again, only the variational distribution part should be modified as shown in Code 15.

4. Simulation Studies

4.1 Basic Comparision Based on Different IRT Models

This section presents the results of simulation studies using the HMC and VI methods introduced in the previous section. Four models are taken as examples: the unidimensional and multidimensional 2PL IRT models, GRM, and PCM. As this article serves as a tutorial, relatively simple cases with small sample sizes and small numbers of dimensions and repetitions are considered. All experiments were conducted using Google Colab, a cloud-based platform providing free access to computational resources. The experiments ran on a standard Colab environment with a virtual machine powered by a 12GB of RAM and a single-core CPU.

Across all studies, we fix $n = 500$ examinees and $q = 10$ items. We begin with the unidimensional 2PL study. True ability, discrimination, and difficulty are drawn from Equation 12 with pseudo-guessing fixed at $c_{j} = 0$ . Responses are computed by Equation 1. Data generation adapts Codes 1 and 2 with the pseudo-guessing term removed, and the design is replicated 100 times. Bayesian estimation uses the priors in Equation 12 and is implemented with HMC (Codes 3, 4) and VI (Code 5), again omitting pseudo-guessing.

For the remaining studies, we follow a common protocol: priors follow each model’s assumptions in its subsection, we generate 100 response matrixes, and estimate with HMC and VI using the listed code (i.e. Multidimensional 2PL in Section 3.2: Codes 7, 8, 9; GRM in Section 3.3: Codes 10, 11, 12; PCM in Section 3.4: Codes 13, 14, 15). The estimation repeated 100 times for each model.

4.1.1 Evaluation Criteria

Simulated data are randomly generated, so the repetition of the process across simulated data sets provides insights into the stability and robustness of the performance. In order to evaluate the performance of the estimation process, mean square error (MSE) and bias are often calculated for each parameter. The MSE directly quantifies how different the estimates are from the true values, and the bias reflects whether or how the estimates skew off from the true values.

For a true parameter $b_{j}$ , where $j = 1, \dots, 10$ , taking the item’s difficulty parameters in the first study as an example, 100 simulated response matrixes are utilized to infer 100 estimated values of $b_{j}$ , which are denoted as ${\hat{b}}_{j}^{(l)}$ , where $l = 1, \dots, 100$ . For the $l$ -th simulated response matrix, we define the MSE of the difficulty parameter computed across items as follows:

MSE ({\hat{b}}^{(l)}, b) = \frac{1}{10} \sum_{j = 1}^{10} {({\hat{b}}_{j}^{(l)} - b_{j})}^{2}

(16)

where ${\hat{b}}^{(l)} = ({\hat{b}}_{1}^{(l)}, \dots, {\hat{b}}_{10}^{(l)})$ and $b = (b_{1}, \dots, b_{10})$ .

The results based on 100 simulated response matrixes can be considered as 100 Monte Carlo samples of estimates. The mean value is a point estimate of the true MSE using Monte Carlo estimation. Since the number of simulations is not large, Monte Carlo Standard Error (MCSE) is required to approximate the estimation noise and quantify how large the uncertainty is in the estimate of quantities such as MSE and bias (Koehler et al., 2009). The slight MCSE in the experiments supports, for the purpose of this tutorial, that it is sufficient to show the DL frameworks’ effectiveness with 100 simulated datasets. Different MSEs across the 100 simulated response matrixes are further summarized by calculating the mean and MCSE across replications:

MSE (\hat{b}, b) = \frac{1}{100} \sum_{l = 1}^{100} MSE ({\hat{b}}^{(l)}, b) .

(17)

MCSE [MSE (\hat{b}, b)] = \frac{1}{\sqrt{100}} \sqrt{\frac{1}{100} \sum_{l = 1}^{100} [MSE ({\hat{b}}^{(l)}, b) - MSE (\hat{b}, b)]^{2}} .

(18)

Bias also assesses a model’s ability to capture the underlying patterns in the data, and a bias value close to 0 indicates unbiased parameter estimation without skew. Again, for instance, with $l = 1, . ., 100$ simulated response matrixes, the bias of the item’s difficulty parameters is defined as

Bias (\hat{b}, b) = \frac{1}{100} \sum_{l = 1}^{100} Bias ({\hat{b}}^{(l)}, b) = \frac{1}{100} \sum_{l = 1}^{100} [\frac{1}{10} \sum_{j = 1}^{10} ({\hat{b}}_{j}^{(l)} - b_{j})] .

(19)

The MCSE of bias across replications can also be obtained as

MCSE [Bias (\hat{b}, b)] = \frac{1}{\sqrt{100}} \sqrt{\frac{1}{100} \sum_{l = 1}^{100} [Bias ({\hat{b}}^{(l)}, b) - Bias (\hat{b}, b)]^{2}} .

(20)

For high-dimensional parameters, the definitions of MSE and bias must be slightly modified. For example, considering the true parameters $A = (a_{1}, \dots, a_{10})$ , and the estimated parameters ${\hat{A}}^{(l)} = ({\hat{a}}_{1}^{(l)}, \dots, {\hat{a}}_{10}^{(l)})$ , $l = 1, \dots, 100$ , for the $l$ -th simulated response matrix, the MSE of discrimination parameters across items for the two-dimensional 2PL model is

MSE ({\hat{A}}^{(l)}, A) = \frac{1}{10} \sum_{j = 1}^{10} [\frac{1}{2} \sum_{k = 1}^{2} {({\hat{a}}_{j, k}^{(l)} - a_{j, k})}^{2}],

(21)

and the bias of discrimination parameters across items is defined by removing the square in Equation 21, similar to the first study.

4.1.2 Results

For the HMC method, as an extension of the MCMC method, a trace plot can evaluate whether the sampler sufficiently explores the parameter space without accepting or rejecting excessive proposals and check the convergence state. The trace plot is shown in Figure A1 in Supplementary Appendix (available in the online version of this article). The values change at an acceptable frequency and within a small range, indicating HMC has adequately explored the space of parameters and achieved convergence.

To assess bias and MSE, the mean and MCSE of MSE and bias of estimated difficulties and discrimination throughout 100 simulated response matrices are calculated as shown in Figure 3. For the HMC method, the performances in PyTorch and TensorFlow do not have an obvious difference. The estimates of parameters using HMC and VI approaches are accurate according to the fairly low MSE. The low bias indicates that MCMC inference and variational inference efficiently capture the underlying patterns in the data. The MCSE values are all small, indicating that the estimates of MSE and bias are stable across replications so 100 repetitions are sufficient to ensure the effectiveness of DL frameworks.

Figure 3.

MSE and bias of estimated parameters throughout 100 simulated response matrixes (a) MSE (b) Bias.

To evaluate the computational efficiency, the computation time for each simulation study is recorded, and the plot of computation times for the four simulation studies is summarized in Figure 4. VI is generally faster than HMC, especially for more complex models. The computation time for PyTorch and TensorFlow is similar, except for the multidimensional 2PL model, where TensorFlow outperforms PyTorch and is even faster than the TensorFlow codes in the first study.

Figure 4.

Computation time by model types and methods.

4.2 Simulation Study: Prior Sensitivity Analysis

This section presents a simulation study to evaluate the prior sensitivity of HMC and VI for the 2PL IRT model. Prior sensitivity analysis is crucial in Bayesian statistics, as it helps assess how different prior choices influence posterior estimates. This is particularly important in IRT applications with small item pools, where the amount of information available from the data may be limited, making the choice of priors more impactful.

We generate a 2PL IRT dataset with $n = 500$ examinees and $q = 10$ items. For each simulation, true abilities and item parameters are first drawn as $θ_{i} ~ N (0, 1)$ , $b_{j} ~ N (0, 1)$ , and $a_{j} ~ LogNormal (0, 0.35)$ . To avoid extreme item characteristics in a short test, we truncate $b_{j}$ to $[- 2.5, 2.5]$ and $a_{j}$ to $[0.5, 3.0]$ . These true parameters are then held fixed, and 100 independent response matrices are generated.

The prior sensitivity analysis is designed to reflect a practical situation in which the analyst may be unsure about how strongly to regularize item parameters when the number of items is small. Hence, we vary the scale of the priors for discrimination and difficulty while holding the ability prior constant: $θ_{i} ~ N (0, 1)$ , $b_{j} ~ N (0, σ_{b}^{2})$ , and $a_{j} ~ LogNormal (0, σ_{a}^{2})$ . The examined scales for $a_{j}$ and $b_{j}$ are $σ_{a}^{2} \in {0.5, 1, 1.5}$ and $σ_{b}^{2} \in {0.5, 1, 2}$ , respectively, yielding nine prior combinations. Smaller scales impose stronger shrinkage toward the centered prior locations, and larger scales represent weaker regularization. This range covers a reasonable spectrum from mildly informative to comparatively diffuse priors for short tests.

For HMC with NUTS, we use the implementation in Pyro, with 400 posterior samples following 300 warm-up iterations, one chain. For variational inference, we use SVI in Pyro with learning rate 0.03, 4,000 iterations; posterior summaries are obtained from 1,000 draws from the variational approximation. These settings are chosen to keep the two approaches comparable in computational budget while reflecting typical default choices for small-scale IRT applications.

Figure 5 shows the MSE results, and Figure A2 in the Appendix (available in the online version of this article) shows the bias results. Both methods recover difficulty reasonably well across prior scales, with biases close to zero and MSEs mostly between 0.028 and 0.056. The smallest difficulty prior scale (0.5) produces slight positive bias in $b_{j}$ for both HMC and VI (about 0.020), consistent with mild shrinkage toward zero, while the most diffuse difficulty prior (scale 2) yields modest negative bias and increased MSE, particularly when combined with larger discrimination scales. This pattern is consistent with the intuition that, with only 10 items, relaxing the difficulty prior too much can increase estimation noise.

Figure 5.

Discrimination and difficulty MSE across prior scale combinations for HMC and VI.

The more pronounced differences between methods arise for discrimination. When the discrimination prior is tight (scale 0.5), both methods show small positive bias in $a_{j}$ , but VI is consistently closer to zero and has lower MSE (e.g. at $(σ_{a}^{2}, σ_{b}^{2}) = (0.5, 0.5)$ the discrimination bias is 0.090 for HMC versus 0.063 for VI, and the MSE is 0.042 versus 0.033). At the balanced setting $(σ_{a}^{2}, σ_{b}^{2}) = (1, 1)$ , VI again shows small bias and a lower discrimination MSE (0.034) compared with HMC (0.056). As the discrimination prior becomes more diffuse (scale 1.5), the HMC discrimination MSE increases sharply and becomes highly variable across repetitions, whereas VI remains comparatively stable. For example, at $(σ_{a}^{2}, σ_{b}^{2}) = (1.5, 1)$ the HMC discrimination MSE rises to 0.13 with substantial dispersion across runs, while VI remains around 0.036. The MSE boxplots reinforce this contrast: VI’s discrimination error distribution stays narrow across prior scales, while HMC exhibits wider spread and more extreme outliers as the discrimination prior is loosened.

Taken together, these findings suggest that, under the present design with a short test and moderate sample size, VI provides more robust discrimination recovery to plausible prior-scale choices, while both methods are broadly comparable for difficulty. The elevated and unstable discrimination MSE under HMC with larger discrimination scales likely reflects the combination of weaker regularization and limited item information, which can make the posterior geometry harder to explore efficiently with the current sampling budget (one chain and 400 retained draws). In this setting, a moderate prior scale for item parameters (around 1) appears to offer a sensible trade-off between shrinkage and flexibility, yielding small bias and relatively low MSE for both approaches. However, VI’s apparent robustness to prior-scale variation may be partly driven by its tendency to concentrate mass in high-density regions of the posterior, thereby dampening the influence of weaker priors in the tails of the parameter space.

4.3 Simulation Study: Large-Scale Assessment

This section evaluates method performance in a setting that is closer to the scale of many contemporary testing programs. When the number of items and examinees increases, computational efficiency and numerical stability become central practical concerns, even for conceptually standard unidimensional IRT models. A large-response design therefore provides an appropriate benchmark for assessing whether DL platform-based implementations deliver reliable inference while maintaining feasible runtime in realistic, high-information applications.

We conduct a large-scale simulation under a unidimensional 2PL IRT model with $n = 5, 000$ examinees and $q = 100$ items. This design reflects typical large-assessment scenarios and provides a direct test of computational scalability in TensorFlow- and PyTorch-based implementations. For each replication, true person and item parameters are generated as $θ_{i} ~ N (0, 1)$ , $b_{j} ~ N (0, 1)$ , and $a_{j} ~ LogNormal (0, 0.35)$ . Then, we clip $b_{j}$ to $[- 2.5, 2.5]$ and $a_{j}$ to $[0.5, 3.0]$ , and fix the resulting parameter set and generate 100 independent response matrixes.

For three estimation approaches: TensorFlow HMC, PyTorch HMC, and PyTorch SVI, we set weakly informative priors: $θ_{i} ~ N (0, 1)$ , $b_{j} ~ N (0, 1)$ , and $a_{j} ~ LogNormal (0, 1)$ . For HMC, both TensorFlow and PyTorch implementations use 400 posterior samples after 300 warmup steps, with two chains, max_tree_depth = 10, and target_accept_prob = 0.8. For VI, we use 2,000 optimization iterations with learning rate 0.04, 20 particles for stochastic estimation of the ELBO, and 500 posterior samples drawn from the fitted variational distribution for evaluation.

Across replications, the MSE and bias of discrimination and difficulty parameters, along with computation time for each method, are summarized in Figure 6. All three methods achieve very small estimation error for both discrimination and difficulty. The MSE distributions in the accompanying boxplots are tightly concentrated, indicating stable recovery in this high-information setting. The two HMC implementations yield nearly identical accuracy, with difficulty MSE around 0.0027 (Standard deviation [SD] ≈ 0.0005) and discrimination MSE around 0.0021 (SD ≈ 0.0003). The VI approach matches HMC for discrimination MSE at approximately 0.0021 (SD ≈ 0.0003) and shows only a marginal increase for difficulty, around 0.0028 (SD ≈ 0.0005). Bias patterns are similarly close across methods for difficulty, with small positive bias of roughly 0.015 for all three approaches. For discrimination, the two HMC approaches show small positive bias (about 0.0067–0.0068 with SD ≈ 0.0050), whereas VI shows a small negative bias of similar magnitude (about $- 0.0078$ with SD ≈ 0.0046).

Figure 6.

MSE and bias of discrimination and difficulty parameters in a large-scale 2PL IRT simulation, along with computation time across three methods.

The most visible contrast lies in computation time. The TensorFlow HMC implementation requires the longest runtime, averaging about 647 s (SD ≈ 56). The PyTorch HMC implementation runs substantially faster, averaging about 475 s (SD ≈ 50), while maintaining the same level of accuracy. PyTorch VI achieves the shortest average runtime at roughly 422 s, although its variability is larger (SD ≈ 94), which is also reflected in the wider spread of the time boxplot. Overall, the results indicate that when the number of items and examinees increases to a scale that is common in large educational assessments, the PyTorch-based implementations retain accuracy comparable to a TensorFlow HMC benchmark while offering clear computational advantages, and VI provides an additional speed gain with only minor, directionally interpretable differences in discrimination bias.

4.4 Simulation Study: High-Dimensional Latent Traits

This section examines a multidimensional setting in which the latent structure is substantially richer than in standard unidimensional designs. As the number of traits increases, the parameter space expands rapidly and inference must recover both item parameters and the dependence structure among abilities. Evaluating performance in a seven-dimensional model therefore helps clarify whether deep learning platform-based implementations remain accurate and efficient when the latent space is relatively large.

We simulate responses under a confirmatory M2PL model with $n = 2, 000$ students, $q = 70$ items, and $K = 7$ latent traits. Student abilities are generated from a multivariate normal distribution with mean zero and an equicorrelated correlation matrix $Σ$ , where diagonal elements equal 1 and all off-diagonal elements equal a common correlation parameter. This construction produces moderate cross-trait dependence while keeping the covariance structure controlled and interpretable for recovery. Item difficulties are generated from $N (0, 1)$ and clipped to $[- 2.5, 2.5]$ to avoid extreme items. For discriminations, we first draw a positive magnitude for each item from a lognormal distribution and clip it to $[0.5, 3.0]$ , then impose a block-confirmatory structure by assigning exactly one nonzero loading per item: items 1–10 load on trait 1, items 11–20 load on trait 2, and so on, with 10 items per trait. The response probability follows the Equation 4. Specifically, for each item $j$ , the discrimination vector is set as $a_{j} = {(0, \dots, 0, a_{j}, 0, \dots, 0)}^{T}$ , which contains a single nonzero discrimination aligned with the item’s designated trait. We fix the generated $θ$ , $A$ (the matrix where each row is $a_{j}^{T}$ ), and $b$ as true values and then produce 100 independent response matrixes.

We compare PyTorch HMC and PyTorch VI under priors that mirror the data-generating structure. We place an Lewandowski-Kurowicka-Joe (LKJ) prior on the ability correlation matrix via LKJCholesky(K, 1.0) to express a weakly informative, uniform preference over correlation structures, and draw student abilities from $N (0, Ω)$ using the sampled Cholesky factor. For items, we use $b_{j} ~ N (0, 1)$ and $a_{j} ~ LogNormal (0, 1)$ , then construct the full discrimination matrix by inserting the nonzero element into the prespecified trait position for each item. For HMC, we use 600 posterior samples after 200 warm-up iterations, one chain, max_tree_depth = 10, and target_accept_prob = 0.8. For VI, we use a learning rate of 0.006, 3,000 iterations, and 200 posterior draws from the variational approximation for evaluation.

Figure 7 summarizes the MSE and bias of ability correlation, item difficulty, and discrimination parameters, along with computation time for each method across the 100 replications. The results show that both methods recover the key components of the model with small errors despite the seven-dimensional latent space. For the ability correlation parameters, HMC produces essentially unbiased estimates on average (bias $\approx 0.0000$ with SD $\approx 0.0076$ ), while VI shows a modest negative bias (about $- 0.0214$ with SD $\approx 0.0084$ ). This difference is visible in the bias boxplot for ability correlation. However, the corresponding MSE values remain close, with HMC around 0.0009 (SD $\approx 0.0003$ ) and VI around 0.0011 (SD $\approx 0.0005$ ), indicating that the practical impact of the VI bias is limited in this setting. For item difficulty, both methods achieve low error, with MSE near 0.0037 for HMC and 0.0040 for VI; VI shows slightly larger positive bias (about 0.0065) than HMC (about 0.0015), which is also consistent with the central tendency of the bias boxplots. For discrimination, HMC and VI exhibit opposite but similarly small bias directions (approximately $- 0.0106$ for HMC and $0.0104$ for VI), while their MSE values are again close (roughly 0.0080 versus 0.0084). The MSE boxplots suggest that these differences are stable across replications rather than driven by a small number of extreme runs.

Figure 7.

MSE and bias of discrimination, difficulty, and ability correlation parameters in a confirmatory M2PL IRT simulation, along with computation time across two methods.

Computation time separates the two approaches most clearly. HMC requires substantially longer runtime on average (about 588 s) and shows high variability across replications (SD $\approx 384$ ), reflecting the increased cost of exploring a higher-dimensional posterior. VI reduces the average runtime to about 54 s with much smaller dispersion (SD $\approx 3.4$ ), which is evident in the narrow time boxplot. Taken together, these findings indicate that PyTorch VI provides a large efficiency gain in a seven-trait M2PL model while maintaining accuracy close to HMC for item parameters and MSE of ability correlations, albeit with a small directional bias in the recovered correlation structure, which may warrant that VI might underestimate certain aspects of posterior uncertainty in higher dimensions.

5. Empirical Study

This section details an empirical investigation utilizing the “bfi” dataset from the psychpackage in R (Revelle, 2024) to illustrate parameter estimation for the PCM using both PyTorch and TensorFlow. The findings are further contrasted with estimates obtained from WinBUGS and R’s package ltm, as reported by Li and Baser (2012). The “bfi” dataset comprises responses from 2,800 subjects to 25 personality self-report items, corresponding to the five hypothesized dimensions of personality, namely the Big Five traits (Goldberg, 1992). In this study, our focus is confined to the five items that assess neuroticism, a trait defined by a predisposition to experience negative emotions. These items evaluate tendencies such as anger (e.g. “Get angry easily”), irritation and mood instability (e.g. “Get irritated easily” and “Have frequent mood swings”), depression (e.g. “Often feel blue”), and anxiety (e.g. “Panic easily”).

Each item is rated on a six-point scale, ranging from “1. Very inaccurate” to “6. Very accurate.” In the context of IRT, these response categories function as indicators of the latent construct of neuroticism, with higher cumulative scores signifying greater severity of the trait. Unlike conventional summary scores, IRT modeling permits a more refined interpretation by accounting for each item’s severity threshold (i.e. the difficulty $b_{j, k}$ ) and its sensitivity to variations in neuroticism (i.e. the discrimination $a_{j}$ ).

For illustrative purposes, we present the discrimination estimates for all items and the difficulty estimates for the first item as representative examples of our parameter estimation outcomes. In the absence of definitive “true” parameter values, we compared the posterior distributions derived from PyTorch HMC, TensorFlow HMC, and PyTorch VI. As shown in Figure 8, the posterior distributions produced by these methods are highly congruent.

Figure 8.

Posterior distributions derived from PyTorch HMC, TensorFlow HMC, and PyTorch VI (a) Discrimination (b) Difficulty (Item 1).

Furthermore, for each neuroticism item, we compute the posterior mean as the point estimate. For simplicity, the average difficulty estimates across the response thresholds are calculated as $\frac{1}{5} \sum_{k = 1}^{5} b_{j, k}$ . As demonstrated in Table 2, the results obtain from PyTorch and TensorFlow are similar, and they also align closely with the five-factor structure in estimates generated by the MCMC approach in WinBUGS and R’s package ltm as reported by Li and Baser (2012). This convergence underscores the robustness and consistency of the parameter estimation methodology employed in this study.

Table 2.

Parameter Estimates Derived from PyTorch HMC, TensorFlow HMC, PyTorch VI, WinBUGS and R

Parameter	PyTorch HMC	TensorFlow HMC	PyTorch VI	WinBUGS	R ltm
Discrimination
Angry	1.402	1.545	1.401	1.576	1.589
Irritated	1.607	1.780	1.612	1.970	1.969
Swings	1.072	1.035	1.069	0.958	0.931
Blue	0.454	0.460	0.453	0.438	0.417
Panic	0.469	0.480	0.470	0.458	0.448
Difficulty
Angry	0.418	0.413	0.414	0.409	0.440
Irritated	−0.086	−0.081	−0.084	−0.094	−0.071
Swings	0.235	0.225	0.230	0.248	0.281
Blue	0.402	0.331	0.400	0.374	0.422
Panic	0.596	0.543	0.595	0.613	0.702

Note. HMC = Hamiltonian Monte Carlo; VI = Variational Inference.

Two more empirical studies are presented in Supplementary Appendix (available in the online version of this article) to further illustrate the application of the DL framework in practice. One study analyzes an educational assessment dataset to illustrate the practical use of the GRM implemented in PyTorch and compares the results with those from R’s package mirt. The other study analyzes a large-scale psychological assessment dataset to show the advantages of GPU acceleration.

6. Discussion

DL offers powerful tools for modeling complex relations in data, especially when leveraging highly expressive models and large datasets. PyTorch and TensorFlow, together with their probabilistic ecosystems (e.g. Pyro and TFP), provide flexible and scalable environments for statistical modeling, including IRT models.

In this article, we provided a detailed introduction to using PyTorch and TensorFlow to fit common IRT models. Our simulation studies demonstrate excellent estimation performance, with low MSEs and biases for model parameters. Both the MCMC method using NUTS and the direct application of VI efficiently estimate model parameters. A prior sensitivity analysis further illustrated that HMC method might be more sensitive to choices of prior scales than VI in moderate-scale assessments. Two large-scale simulation studies showcased the scalability of these frameworks for unidimensional and multidimensional IRT models. Three empirical study further illustrated how these two frameworks function in practice.

Two notable observations emerged. First, regarding user-friendliness, PyTorch and TensorFlow differ in practice. Based on the length of code snippets and the complexity of code design, especially for VI implementations, PyTorch appears to offer a more accessible interface, likely aided by its rich set of predefined functions. This aligns with evidence from broader comparisons of the two ecosystems (Novac et al., 2022). Second, our analysis indicates that VI is substantially faster than MCMC, particularly for large-scale or high-dimensional models, while maintaining comparable accuracy in parameter recovery. However, it is important to recognize VI’s well-documented tendency to concentrate posterior mass in regions of high density and underestimate posterior variances (Blei et al., 2017), a pattern that is reflected in our results through weaker sensitivity to the prior scale in the prior sensitivity analysis and a negative bias in the recovery of ability correlations in the high-dimensional M2PL simulation. In psychometrics, related concerns and potential remedies have been discussed for multidimensional IRT and extensions of Gaussian VI (Cho et al., 2021; C. Ma et al., 2024). Therefore, VI should be used with caution when the primary goal is statistical inference rather than point estimation. In general, for practitioners prioritizing computational efficiency and scalability, VI presents a compelling option, whereas MCMC remains a robust choice when accurate uncertainty quantification is essential.

Although we did not empirically compare our implementations to dominant psychometric software (e.g. flexMIRT or ConQuest Adams et al., 2020; Chung & Houts, 2020) or to alternative highly optimized methods for large-scale or complex IRT models, adopting these DL frameworks remains meaningful. A key reason is that the AutoDiff tools in PyTorch and TensorFlow play a crucial role in efficiently computing gradients, facilitating algorithms that require derivative computations, such as HMC. These tools are not restricted to fully Bayesian analyses with priors on all parameters. The same computational infrastructure can support frequentist IRT estimation approaches, including MML-type objectives and joint maximum likelihood-style optimization. For example, Gaussian variational EM provides an efficient likelihood-based alternative for MIRT calibration in a frequentist spirit (Cho et al., 2021), and recent work has used neural architectures to amortize JMLE for item factor models (Molenaar et al., 2025). Related large-scale JML developments in psychometrics also highlight how optimization-based approaches can be competitive when dimensionality is high (Chen et al., 2019). In addition, VI objectives in these frameworks are differentiable and can, in principle, benefit from modern accelerators (e.g. GPUs), which further motivates learning and adopting this computational environment.

More broadly, leveraging DL frameworks not only enables efficient computation but also opens opportunities to integrate neural network architectures into IRT modeling. This integration may help capture complex, nonlinear relationships and better handle diverse data modalities, potentially expanding the scope of traditional IRT. Specifically, the likelihood function or parameter probability density can be modeled with flexible neural modules (Cho et al., 2021; Liu et al., 2022; Urban & Bauer, 2021). Advanced architectures such as convolutional neural networks (LeCun et al., 1990) and long short-term memory networks (Hochreiter & Schmidhuber, 1997) offer a natural way to represent structured inputs that extend beyond traditional item formats. This flexibility has the potential to enrich the IRT literature by enabling principled modeling of multimodal data within a unified estimation framework.

The fusion of IRT and DL also offers a practical path for handling models with a large number of parameters and complex input representations, with clear value when multimodal information is tied to explicit measurement targets. Rather than displacing response data, multimodal inputs can be used to derive or augment psychometrically meaningful variables that enter familiar IRT structures. For example, IRT models informed by item text or format demonstrate how auxiliary representations can improve measurement and interpretation (B. Ma et al., 2022; Cheng et al., 2019). A concrete multimodal use case is spoken language proficiency: examinee audio responses can be processed by neural models to extract substantively aligned features (e.g. fluency, pronunciation, prosody) or to map responses into rubric-based categorical scores, which can then be calibrated as polytomous items or incorporated as predictors in explanatory IRT frameworks. This provides explicit grounding in the psychometric goal of improving measurement of speaking ability while retaining the interpretability of IRT. Related work evaluating automated speech assessment systems with IRT-based fairness and bias analyses further indicates a realistic pathway for integrating audio-derived evidence into operational measurement (Kwako et al., 2022). Similar strategies could be extended to other modalities (e.g. image-based tasks, sentiment-informed textual feedback, or physiological signals) when those data streams are conceptually linked to well-defined constructs and scoring rubrics in assessment contexts.

We hope this article assists researchers in understanding and conceptualizing Bayesian IRT analysis using powerful DL computing frameworks. While HMC and VI have been increasingly discussed in the IRT literature, we provide an accessible introduction and concrete code examples for common dichotomous and polytomous IRT models, alongside a practical comparison of the two dominant DL platforms. This foundation may help readers engage with recent advances in Bayesian estimation using modern VI variants and neural-enhanced IRT approaches (Cho et al., 2021; Liu et al., 2022; C. Ma et al., 2024; Urban & Bauer, 2021).

Despite the promise of integrating IRT with DL frameworks, several challenges and limitations remain. Incorporating techniques from other fields into traditional IRT requires reconciling differences in methodology, data structures, and analytical conventions, often demanding substantial effort and specialized knowledge. Moreover, the inherent complexity of DL models can reduce interpretability, a critical concern in educational and psychological assessment where transparency is essential. Finally, our study does not provide a direct empirical comparison of parameter accuracy or computational efficiency against conventional estimation methods (e.g. MMLE) implemented in established IRT software such as flexMIRT or ConQuest (Adams et al., 2020; Chung & Houts, 2020). We leave these comparisons for future research.

Supplemental Material

sj-pdf-1-jeb-10.3102_10769986261439301 – Supplemental material for Fitting Bayesian Item Response Theory Models Using Deep Learning Computational Frameworks

Supplemental material, sj-pdf-1-jeb-10.3102_10769986261439301 for Fitting Bayesian Item Response Theory Models Using Deep Learning Computational Frameworks by Nanyu Luo, Yuting Han, Jinbo He, Xiaoya Zhang and Feng Ji in Journal of Educational and Behavioral Statistics

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Feng Ji was supported by the Seed Grant of the International Network of Educational Institutes (Grant Number 522011), the Connaught Fund (Grant Number 520245), and the Social Sciences and Humanities Research Council (SSHRC) of Canada (Grant Number 215119, Canada Research Chair: CRC-2024-00169).

ORCID iDs

Nanyu Luo

Yuting Han

Feng Ji

Notes

Authors

NANYU LUO is a PhD candidate in the Department of Applied Psychology and Human Development at the University of Toronto, Toronto, ON M5S 1V6, Canada; email: n.luo@mail.utoronto.ca. His research interests include psychometric modeling, educational measurement, and the application of machine learning, deep learning, and LLMs in psychological and educational assessments.

YUTING HAN is a lecturer in the Cognitive Science and Allied Health School at Beijing Language and Culture University, Beijing 100083, China; email: hanyuting716@gmail.com. Her research interests include process data modeling, applications of pretrained language models in psychological assessments, and computerized adaptive testing technologies.

JINBO HE is an associate professor in the Department of Biosciences and Bioinformatics at the School of Science, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China; email: Jinbo.He@xjtlu.edu.cn. His research interests include health psychology (e.g. eating behaviors, body image, obesity, as well as their links with psychosocial well-being, and related prevention/intervention strategies) and quantitative methods (e.g. advanced statistics, big data analytics, longitudinal cohort analysis, and machine learning).

XIAOYA ZHANG is an assistant professor in the Department of Family, Youth and Community Sciences at the University of Florida, Gainesville, FL 32611, USA; email: xiaoyazhang@ufl.edu. Her research interests include individual differences and population heterogeneity in susceptibility to positive and negative experiences, the broader determinants and impacts of adversity on youth mental and behavioral health, and precision-based prevention, using advanced statistical and computational methods.

FENG JI is an assistant professor in the Department of Applied Psychology and Human Development, at University of Toronto, Toronto, ON M5S 1A1, Canada; email: f.ji@utoronto.ca. His research interests include quantitative and statistical methods in behavioral, educational, and social sciences research.

References

Abadi

Barham

Chen

Davis

Dean

Devin

Ghemawat

Irving

Isard

Kudlur

Levenberg

Monga

Moore

Murray

D. G.

Steiner

Tucker

Vasudevan

Warden

Zheng

(2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), Savannah, GA, United States (pp. 265–283). USENIX Association. https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

Adams

R. J.

M. L.

Cloney

Berezner

Wilson

M. R.

(2020). Acer conquest: Generalised item response modelling software. Computer software. https://www.acer.org/au/conquest

Birnbaum

(1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M.

Lord

M. R.

Novick

(Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.

Blei

D. M.

Kucukelbir

McAuliffe

J. D.

(2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51.

Cai

(2010). Metropolis-hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307–335.

Carpenter

Gelman

Hoffman

M. D.

Lee

Goodrich

Betancourt

Brubaker

Guo

Riddell

(2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.

Chen

Zhang

(2019). Joint maximum likelihood estimation for high-dimensional exploratory item factor analysis. Psychometrika, 84(1), 124–146.

Cheng

Liu

Chen

Huang

Chen

(2019). Dirt: Deep learning enhanced item response theory for cognitive diagnosis. In Proceedings of the 28th ACM international conference on information and knowledge management (pp. 2397–2400). http://home.ustc.edu.cn/∼huangzhy/files/papers/SongCheng-CIKM2019.pdf

10.

Cho

A. E.

Wang

Zhang

(2021). Gaussian variational estimation for multidimensional item response theory. British Journal of Mathematical and Statistical Psychology, 74(Suppl. 1), 52–85.

11.

Chung

Houts

(2020). flexmirt: A flexible modeling package for multidimensional item response models. Measurement: Interdisciplinary Research and Perspectives, 18(1), 40–54.

12.

Converse

Curi

Oliveira

Templin

(2021). Estimation of multidimensional item response theory models with correlated latent variables using variational autoencoders. Machine Learning, 110(6), 1463–1480.

13.

Cybenko

(1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.

14.

Floridi

Chiriatti

(2020). Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681–694.

15.

Fox

J.-P.

(2010). Bayesian item response modeling: Theory and applications. Springer.

16.

Gao

Chen

Zhang

(2020). A survey on deep learning for multimodal data fusion. Neural Computation, 32(5), 829–864.

17.

Goldberg

L. R.

(1992). The development of markers for the big-five factor structure. Psychological Assessment, 4(1), 26–42.

18.

Hastings

W. K.

(1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109.

19.

Hochreiter

Schmidhuber

(1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

20.

Hoffman

M. D.

Gelman

(2014). The no-u-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.

21.

Jeon

Rijmen

Rabe-Hesketh

(2017). A variational maximization–maximization algorithm for generalized linear mixed models with crossed random effects. Psychometrika, 82(3), 693–716.

22.

Johnson

M. J.

Duvenaud

D. K.

Wiltschko

Adams

R. P.

Datta

S. R.

(2016). Composing graphical models with neural networks for structured representations and fast inference. In F.

Pereira

Burges

Bottou

Weinberger

(Eds.), Advances in neural information processing systems (Vol. 29, pp. 2946–2954). Curran Associates, Inc.

23.

Kingma

D. P.

(2015). Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, San Diego, CA, United States. https://arxiv.org/pdf/1412.6980

24.

Koehler

Brown

Haneuse

S. J.-P. A.

(2009). On the assessment of Monte Carlo error in simulation-based statistical analyses. The American Statistician, 63(2), 155–162.

25.

Krizhevsky

Sutskever

Hinton

G. E.

(2012). Imagenet classification with deep convolutional neural networks. In F.

Pereira

Burges

Bottou

Weinberger

(Eds.), Advances in neural information processing systems (Vol. 25, pp. 1097–1105). Curran Associates, Inc.

26.

Kwako

Wan

Zhao

Chang

K.-W.

Cai

Hansen

(2022). Using item response theory to measure gender and racial bias of a BERT-based automated English speech assessment system. In Proceedings of the 17th workshop on innovative use of NLP for building educational applications (BEA 2022), Seattle, Washington (pp. 1–7). Association for Computational Linguistics. https://aclanthology.org/2022.bea-1.1.pdf

27.

LeCun

Bengio

Hinton

(2015). Deep learning. Nature, 521(7553), 436–444.

28.

LeCun

Boser

Denker

J. S.

Henderson

Howard

R. E.

Hubbard

Jackel

L. D.

(1990). Handwritten digit recognition with a back-propagation network. In F.

Pereira

Burges

Bottou

Weinberger

(Eds.), Advances in neural information processing systems (Vol. 2, pp. 396–404). Curran Associates, Inc.

29.

Baser

(2012). Using r and Winbugs to fit a generalized partial credit model for developing and evaluating patient-reported outcomes assessments. Statistics in Medicine, 31(18), 2010–2026.

30.

Liu

Wang

(2022). Estimating three- and four-parameter Mirt models with importance-weighted sampling enhanced variational auto-encoder. Frontiers in Psychology, 13, Article 935419.

31.

Lord

F. M.

Novick

M. R.

(1968). Statistical theories of mental test scores. Addison-Wesley.

32.

Luo

(2025). Generative adversarial networks for high-dimensional item factor analysis: A deep adversarial learning algorithm. Psychometrika, 90(5), 1765–1788.

33.

Luo

Jiao

(2018). Using the stan program for Bayesian item response theory. Educational and Psychological Measurement, 78(3), 384–408.

34.

Hettiarachchi

G. P.

Ando

(2022). Format-aware item response theory for predicting vocabulary proficiency. In Proceedings of the 15th international conference on educational data mining, Durham (pp. 695–700). International Educational Data Mining Society. https://educationaldatamining.org/edm2022/proceedings/2022.EDM-posters.84/2022.EDM-posters.84.pdf

35.

Ouyang

Wang

(2024). A note on improving variational estimation for multidimensional item response theory. Psychometrika, 89(1), 172–204.

36.

Manning

C. D.

(2015). Computational linguistics and deep learning. Computational Linguistics, 41(4), 701–707.

37.

Masters

G. N.

(1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174.

38.

McClure

(2023). Bayesian IRT in jags: A tutorial. Journal of Behavioral Data Science, 3(1), 84–107.

39.

Metropolis

Rosenbluth

A. W.

Rosenbluth

M. N.

Teller

A. H.

Teller

(1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6), 1087–1092.

40.

Molenaar

Grasman

R. P. P. P.

Cúri

(2025). Autoencoders for amortized joint maximum likelihood estimation of confirmatory item factor models. Multivariate Behavioral Research, 60(4), 657–677.

41.

Nalisnick

Smyth

Tran

(2023). A brief tour of deep learning from a statistical perspective. Annual Review of Statistics and Its Application, 10(1), 219–246.

42.

Natesan

Nandakumar

Minka

Rubright

J. D.

(2016). Bayesian prior choice in IRT estimation using MCMC and variational Bayes. Frontiers in Psychology, 7, Article 422.

43.

Neal

R. M.

(2011). MCMC using Hamiltonian dynamics. In S.

Brooks

Gelman

G. L.

Jones

X.-L.

Meng

(Eds.), Handbook of Markov Chain Monte Carlo (Chapter 5, pp. 113–162). Chapman and Hall/CRC.

44.

Novac

O.-C.

Chirodea

M. C.

Novac

C. M.

Bizon

Oproescu

Stan

O. P.

Gordan

C. E.

(2022). Analysis of the application efficiency of TensorFlow and PyTorch in convolutional neural network. Sensors, 22(22), Article 8872.

45.

Pang

Nijkamp

Y. N.

(2020). Deep learning with TensorFlow: A review. Journal of Educational and Behavioral Statistics, 45(2), 227–248.

46.

Paszke

Gross

Chintala

Chanan

Yang

DeVito

Lin

Desmaison

Antiga

Lerer

(2017). Automatic differentiation in PyTorch. Presented at the NIPS 2017 Autodiff Workshop. https://openreview.net/pdf?id=BJJsrmfCZ

47.

Paszke

Gross

Massa

Lerer

Bradbury

Chanan

Killeen

Lin

Gimelshein

Antiga

Desmaison

Köpf

Yang

DeVito

Raison

Tejani

Chilamkurthy

Steiner

Fang

Chintala

(2019). Pytorch: An imperative style, high-performance deep learning library. In F.

Pereira

Burges

Bottou

Weinberger

(Eds.), Advances in neural information processing systems (Vol. 32, pp. 8024–8035). Curran Associates, Inc.

48.

Patz

R. J.

Junker

B. W.

(1999a). A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24(2), 146–178.

49.

Patz

R. J.

Junker

B. W.

(1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24(4), 342–366.

50.

Rasch

(1960). Probabilistic models for some intelligence and attainment tests. Danmarks Pædagogiske Institut. (Reprinted by University of Chicago Press, 1980)

51.

Reckase

M. D.

(2009). Multidimensional item response theory. Springer.

52.

Revelle

(2024). psych: Procedures for psychological, psychometric, and personality research (R package version 2.4.6). Northwestern University.

53.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(Suppl. 1), 1–97.

54.

Urban

C. J.

Bauer

D. J.

(2021). A deep learning algorithm for high-dimensional exploratory item factor analysis. Psychometrika, 86(1):1–29.

55.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. In F.

Pereira

Burges

Bottou

Weinberger

(Eds.), Advances in neural information processing systems (Vol. 30, pp. 6000–6010). Curran Associates, Inc.

56.

Wang

Graves

Rosseel

Merkle

E. C.

(2022). Computation and application of generalized linear mixed model derivatives using lme4. Psychometrika, 87(3):1173–1193.

57.

Wang

Strobl

Zeileis

Merkle

E. C.

(2018). Score-based tests of differential item functioning via pairwise maximum likelihood estimation. Psychometrika, 83(1):132–155.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.40 MB