Abstract
In this article, I introduce the
Keywords
1 Introduction
Item response theory (IRT) is a family of latent-variable models used to explain test behavior by explicitly distinguishing item properties from the properties of test takers. IRT data analysis is a common approach in psychological, educational, medical, and sociological research, wherever the collected data are of the form of responses to a psychometric test. Test construction, computerized adaptive testing, analysis of incomplete testing designs, test equating, and differential item functioning (DIF) analysis are just a few applications where IRT models are especially useful. Versatility of IRT stems from the fact that it naturally handles data missingness and allows controlling for unreliability of the measurement.
The rising popularity of IRT has been accompanied by the rise of Stata commands to conduct IRT-related tasks. In older versions of Stata, users could perform some unidimensional IRT analyses by formulating the IRT model in terms of a generalized linear mixed-effects model (Zheng and Rabe-Hesketh 2007). In Stata 14, a built-in
However, there are some limitations to the native
This article will explain some of the inner workings of
2 The uirt command
2.1 Description
2.2 Syntax
2.3 Options
A short description of
Options of
2.4 Postestimation
Some
one might split the analysis into four steps with the same result:
Running these postestimation commands only once after
Postestimation commands of
3 Item-fit analysis
3.1 Background
Let
where fj
is a function that describes the probability of observing an item response yj
conditional on the latent trait level of the test-taker, θ, and Ψ
g
is the a priori distribution of θ in group g. The shape of fj
depends on a vector of item parameters
Validity of inferences made with any model is conditional on the extent to which it fits the data. The structure of data that are gathered in psychometric testing and the form of the model presented in (1) imply an item-by-item strategy of fit assessment in IRT. If a particular item does not fit the data well, a researcher can choose another family of fj for that item, discard the item from the analyses, or, when it is in a testdevelopment stage, modify the test item.
The
Let us begin with the graphical item-fit analysis. It is accomplished by computing the observed proportions of responses to each item category cj
, cj
∊ {0,…, max(Yj
)}, over quantile-based groups of θ, and plotting them against the estimated response functions
The
with numerical integration performed by Gauss–Hermite quadrature used for the denominator and Gauss–Legendre quadrature for the numerator.
The observed proportion for response category cj in interval Δ k is obtained as a weighted mean over all m test-takers who responded to item j,
where 1
c
j
(yij
) is an indicator function, equal to 1 if yij
= cj
and otherwise 0. The observed proportions (2) are plotted against the response category curves
To perform a statistical test for an item fit, we can use a similar strategy of weighting by the a posteriori group membership probability. However, instead of category proportions, the item mean is computed,
and it is paired with the model-based expected item mean,
where
To test a null hypothesis that the vector of observed means is equal to the expected means vector, one uses a Wald-type test statistic (Kondratek Forthcoming),
with an asymptotic covariance matrix
The
The
3.2 Example
In this section, we will examine the fit of three IRT models using
1PLM is a dichotomous case of a partial credit model, so to fit 1PLM to all items, we have to use the
After the model is fit, we can inspect the item-fit statistics. To compute
Additionally, for a dataset consisting of dichotomous items with no missings, the classical S-X 2 item-fit statistic is also available:
Both

Graphical item-fit analysis of items
Graphs for items
2PLM is the default in
(output omitted )
Instead of looking at the default results table, which is lengthy, we will inspect the estimated item parameters stored in the
There is a noticeable spread in estimated discrimination parameters, with the biggest changes relative to 1PLM happening to the previously discussed pair of items
To compare the 2PLM with 1PLM, we can conduct an LR test,
to conclude that, indeed, a model with item-specific discrimination parameters provides a significantly better overall model fit.
Let us now inspect the model fit at an item level for the six items that produced significant misfit under the 1PLM with the
We see that five previously misfitting items do not give significant test results. However, item q6 still does. We will thus perform a graphical item-fit analysis on item q6, also including the previously analyzed pair q1 and q7:
Resulting graphs are presented in figure 2. Comparison of graphs for

Graphical item-fit analysis of items
Fitting a 3PLM to all items of a test may be impossible without imposing priors on the pseudoguessing parameter, because the likelihood function of this parameter is very flat. One can impose prior distributions on parameters of dichotomous items in
The
Let us see how it works with our data:
The estimated item parameters in compact form are
We see that the explorative algorithm has fit the 3PLM model only to the
Now let us inspect how the item-fit statistics of
Indeed, by changing the model of a single item from 2PLM to 3PLM, we have arrived at a hybrid IRT model that does not produce significant item misfit for

Graphical item-fit analysis of item
4 DIF
4.1 Background
A general definition of DIF may be formulated as (see Penfield and Camilli [2007])
Equation (3) tells us that the distribution of item response, conditional on the latent trait, is different between groups. Significant DIF is a signal that some additional factors other than the latent trait influence the item response behavior and that these factors covary with the group membership. The presence of DIF leads to questions about test validity. It may indicate that the item is biased toward some group of test takers that pose a threat to test fairness; it may also reveal that the underlying latent trait is of higher dimensionality than assumed.
Because the IRT model is explicitly defined by terms that describe the item response probabilities conditional on the latent trait (1), it provides a straightforward framework for DIF analysis. An IRT model for DIF is obtained by introduction of group-specific parameters for the item that is being tested for DIF:
To test for the presence of DIF, you perform the LR test. It compares the restricted model (1), which has equal item parameters between groups, with the unrestricted model (4), which introduced group-specific parameters for item j. The usual scenario for DIF analysis deals with two groups, the focal and the reference group, G ∊ {f, r}. In such case, the null hypothesis is
The alternative hypothesis is
The number of degrees of freedom of the LR statistic is equal to the difference in the number of estimated parameters between the two models.
However, a statistically significant result of an LR test for DIF does not carry with itself any information on the actual degree to which the measurement invariance is violated. Minuscule differences in group-specific item parameters may produce a positive test result if only the sample size is large enough. A proper measure of effect size is necessary to determine whether a statistically significant DIF is of a practical importance. One of the effect size measures used in DIF analysis is the P-DIF index, in which the size of the effect is presented on the scale of the raw score of the item. The P-DIF effect size measure for IRT models was proposed by Wainer (1993):
The P-DIF j,f informs us about the expected difference between the mean of yj in the focal group based on the item parameters estimated for the focal group and the mean of the same item in the focal group but obtained according to the parameters that are estimated for the reference group. In short, it can be described as an increase in item mean in group f due to the effect of DIF. Positive values of P-DIF j,f indicate that DIF “favors” the focal group (they obtain a higher score on that item after the latent trait is controlled for), and negative values mean the opposite.
Note that P-DIF j,f is not equal to the negative of P-DIF j,r . If we compute P-DIF j,r, we change not only the order in which the response functions are subtracted but also the latent trait distribution over which the integral is taken. P-DIF j,f and P-DIF j,r weight the local differences between response functions by the density of the distribution of the chosen group, so, in general, they do not produce exactly opposite values.
4.2 Example
Let us continue the analysis described in the previous example.
The table that displays model parameters now includes group-specific parameters of latent trait distribution. Parameters of the reference group (
Now we can proceed to perform DIF analysis with the
The
We see that DIF was significant for four out of nine items, with the highest effect size for item q1. Item q1 is estimated to be 10 percentage points easier in the focal group because of the DIF effect. The graph illustrating this case is presented in figure 4.

DIF graph for item q1
5 PVs
5.1 Background
The a posteriori density of the latent trait upon observing
Assuming the model holds, this distribution contains all the information on the latent trait of a test taker sampled from population g, who has responded
Using the point estimates
The drawbacks associated with using point estimates of a latent trait can be overcome by adopting a multiple imputation method described by Rubin (1987) that was developed to deal with missing data. Within the IRT approach, the missing data are the latent trait variable θ, and the multiple imputations of θ are random draws from (5). These are called PVs. When PVs are used for analyses that relate the latent trait to some ancillary variables
where
and the a posteriori distribution (5) becomes
Patz and Junker (1999) and de la Torre (2009) developed Markov chain Monte Carlo techniques that enable drawing PVs from (7). These algorithms are designed to estimate all model parameters. They include chains not only for θi
but also for the item parameters
5.2 Example
After fitting a two-group model to the
The following syntax will add the EAP estimator of the latent trait, its standard error, and a set of 10 PVs that are conditioned by a latent regression on the
The summary statistics reveal that the EAP is shrunk toward the mean of latent distribution, while the means and standard deviations of PVs are in accordance with the underlying latent distribution:
We must consider that the scale of the single-group model is different from the scale of the two-group model. The first is fixed at N(0, 1) globally, and the second is fixed at N(0, 1) within the male group. We have to rescale the current PVs, so the pooled mean and standard deviation of males align with the two-group model:
Data analysis with PVs involves a two-step procedure. In the first step, each PV is analyzed separately, and in the second step, the estimates and error variances are combined into a single result according to rules provided by Rubin (1987). To perform this task, we will use the
The effect for the
Let us now run a similar analysis with the EAP estimate of ability:
We can see that when simple point estimates of a latent trait are used to infer about the difference between the groups, the estimate is considerably shrunk toward 0 (−0.16 instead of −0.21). This is accompanied with a drop in standard error (from 0.07 to 0.05), so the inference about statistical significance of the effect is not affected much. However, it is clear that if the actual size of the effect would be of importance to the researcher, the bias that results from using point estimates is not negligible.
Supplemental Material
Supplemental Material, sj-zip-1-stj-10.1177_1536867X221106368 - uirt: A command for unidimensional IRT modeling
Supplemental Material, sj-zip-1-stj-10.1177_1536867X221106368 for uirt: A command for unidimensional IRT modeling by Bartosz Kondratek in The Stata Journal
Footnotes
6 Acknowledgment
Preparation of this article was made possible by the National Science Centre research grant number 2015/17/N/HS6/02965.
7 Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
