Sage Journals: Discover world-class research

Abstract

The cross-classified data structure is ubiquitous in education, psychology, and health outcome sciences. In these areas, assessment instruments that are made up of multiple items are frequently used to measure latent constructs. The presence of both the cross-classified structure and multivariate categorical outcomes leads to the so-called item-level data with cross-classified structure. An example of such data structure is the routinely collected student evaluation of teaching (SET) data. Motivated by the lack of research on multilevel IRT modeling with crossed random effects and the need of an approach that can properly handle SET data, this study proposed a cross-classified IRT model, which takes into account both the cross-classified data structure and properties of multiple items in an assessment instrument. A new variant of the Metropolis–Hastings Robbins–Monro (MH-RM) algorithm was introduced to address the computational complexities in estimating the proposed model. A preliminary simulation study was conducted to evaluate the performance of the algorithm for fitting the proposed model to data. The results indicated that model parameters were well recovered. The proposed model was also applied to SET data collected at a large public university to answer empirical research questions. Limitations and future research directions were discussed.

Keywords

item response theory (IRT)cross-classified data multilevel modeling student evaluation of teaching (SET)

1. Introduction

Data that span across multiple levels are prevalent in social and behavioral sciences. These multilevel data can be either hierarchical or have nonhierarchical structures. The hierarchical data are characterized by the exact nesting of lower level units in one and only one higher level unit (e.g., students nested within schools). When the lower level units belong to two or more than two non-nested higher level units simultaneously, the data exhibit a cross-classified structure (Beretvas, 2011; Browne et al., 2001; Fielding & Goldstein, 2006; Goldstein, 1994), and the lower level units are said to be cross-classified by the crossed factors (e.g., students being crossed-classified by schools and residential areas). Such cross-classified data are frequently encountered in education (e.g., Leckie, 2009; Levels et al., 2008; Rasbash et al., 2010), psychology (e.g., Pedan et al., 2007), and health outcome sciences (Barker et al., 2020). Also in these areas, assessment instruments that are made up of multiple items, such as survey questionnaires and clinical scales, are routinely employed to measure latent constructs.

In the presence of the cross-classified structure, responses to items in an assessment are the so-called item-level data with cross-classified structure. Such data are pervasive in education and allied disciplines (e.g., Dunn et al., 2015; Ecob et al., 2004; Lei et al., 2018; Spooren, 2010). A familiar example is the student evaluation of teaching (SET) survey data, which strongly motivated the study reported here. Across colleges and universities in North America, student evaluation surveys are employed regularly to measure teaching effectiveness and to inform institutional decision-making. Within an academic cycle, students take multiple courses and respond to the same set of survey items more than once to evaluate different instructors. At the same time, instructors teach more than one course and are evaluated by multiple students with the same survey items. The students are not perfectly nested within instructors, and instructors do not always see the same students. The responses to survey items are determined by unobserved influences from both the students and instructors, along with psychometric properties of the items (e.g., item location and discriminability).

The structure of SET data (as an example of item-level data with cross-classified structure) is further illustrated in Table 1. The columns of Table 1 represent students, and its rows are instructors. Cells of the table represent observations and are cross-classified by students and instructors. Each of the observations consists of responses to all items in the survey. Therefore, the item responses are at level 1, while the two crossed factors, students and instructors, are at level 2. This data structure showed in Table 1 is different from and more complicated than the conventional item response data matrix, where item responses can be viewed as being crossed-classified by persons and items. This is because in Table 1 the cross-classification of students and instructors happens at a higher level than the items. The item-level data with cross-classified structure are multivariate and, at the same time, categorical by nature.

Table 1.

Illustration of Item-Level Data With Cross-Classified Structure

		Student
		1	2	…	j	…	J
Instructor	1	$y_{11}$	$y_{21}$
	2				$y_{2 j}$		$y_{2 J}$
	…		$y_{.2}$
	k	$y_{k 1}$				$y_{k .}$
	…				$y_{. j}$
	K			$y_{K .}$			$y_{K J}$

Note. J and K denote the numbers of students and instructors in the data, respectively. Each cell of the data matrix consists of responses to I items.

In practice, when item-level data with cross-classified structure are analyzed, two classes of questions can be of interest. The first class asks about the effects of crossed factors and impacts of key covariates. An example of such research questions is “how gender and underrepresented minority (URM) status of students and instructors predict the teaching effectiveness?” The second class of research questions concerns the quality of assessment instruments (e.g., reliability) and psychometric properties of items. These two classes of research questions are usually addressed using two separate modeling approaches: the cross-classified random effects models (CCREMs; Goldstein, 1994; Raudenbush, 1993) for the first class and standard item response theory (IRT) models for the second. Neither of the two approaches, however frequently used they have been, can fully model the item-level data with cross-classified structure.

When the CCREM approach is utilized, responses to all items in the assessment instrument are combined into one score, either by adding up the responses or taking the average. Then, the summed/average scores are modeled as a function of random effects of crossed factors and fixed effects of predictors of interest. This CCREM approach allows partitioning the total variance of the summed/average scores into components contributed by different crossed factors and enables estimating fixed regression coefficients. However, while the summed/average scores are employed as the outcome, all items are treated in the same way and the differences in item properties are ignored. Classical summed/average score-based psychometrics does not capture the full range of nuance supported by IRT modeling.

Standard IRT models allow studying the psychometric properties of items. However, the cross-classified structure, when present, is typically overlooked (e.g., adopting a single-level IRT model while assuming independent observations), or misspecified (e.g., using a nested two-level IRT model and effectively ignoring one of the crossed factors). The impact of inappropriate modeling of cross-classified data with multivariate and categorical outcomes has not been fully explored in the literature. Within the CCREM literature, however, a number of studies (e.g., Luo & Kwok, 2009; Meyers & Beretvas, 2006; Ye & Daniel, 2017) already showed that the standard errors (SEs) of fixed effects would be underestimated and the estimates of variance components would be biased when the cross-classified data structure is misspecified. The implication for models that rely on multivariate categorical data such as IRT is clear: Estimates could be meaningfully impacted, and more research is needed.

Despite the clear need of an approach to properly handle item-level data with cross-classified structure (e.g., SET data), to our knowledge, little research exists in the literature. In contrast, IRT models developed to model the item-level data with a hierarchical structure (e.g., item responses nested within students and students nested within schools), namely, multilevel IRT models have received much more attention (e.g., Fox, 2003, 2004, 2005; Fox & Glas, 2001; Kamata, 2001; Maier, 2001; Rabe-Hesketh et al., 2004). With multilevel IRT models, item responses are specified as a function of random effects associated with units at different levels and fixed effects of items. The random effects in multilevel IRT models (e.g., students and school random effects) are nested and not crossed.

The lack of research on IRT modeling with cross-classified structure can be partly attributed to the statistical and computational complexities in estimating multilevel latent variable models with crossed random effects. The difficulty arises from the presence of both the crossed random effects and the multivariate categorical outcomes. The marginal likelihood function from which structural parameters are estimated usually involves high-dimensional integrals that admit no closed-form solution. Therefore, evaluating the likelihood function requires numerical approximations. For quadrature-based approaches, the computational burden grows exponentially as the number of random effects increases. When the Markov chain Monte Carlo (Robert & Casella, 2013) method is utilized to estimate multilevel latent variable models with crossed random effects, special attention has to be paid to the specification of priors, among other computational considerations and tweaks. As Lambert (2006) pointed out, while estimating a multilevel model, the role of the prior distributions for variance parameters is crucial, especially when the number of level-2 units is small. In addition, there is unlikely going to be a vague prior, which leads to a posterior mean or mode that is close to the maximum likelihood estimate (MLE), that works for all scenarios.

In this study, we develop an IRT model, namely, the cross-classified IRT model, for appropriately modeling item-level data with cross-classified structure. This model accounts for the cross-classified data structure while evaluating psychometric properties of items. The proposed cross-classified IRT model assumes that the effects of crossed factors (e.g., students and instructors) are random, and effects of items are fixed. Therefore, it can be viewed as a multivariate nonlinear mixed effects model. In this sense, the well-known random item effect IRT models (De Boeck, 2008; Van den Noortgate et al., 2003) can be considered a special case of the proposed model with the outcome being univariate.

This study also aims to address the computational challenge in estimating multilevel latent variable models with crossed random effects, of which the proposed cross-classified IRT model is a special case, by introducing a variant of the Metropolis–Hastings Robbins–Monro (MH-RM) algorithm (Cai, 2010a, 2010b; Chung & Cai, 2021; Huang, 2021; Monroe & Cai, 2014; Yang & Cai, 2014) that implements a new imputation strategy.

The remainder of this article is structured as follows. In Section 2, we present the formulation of the proposed cross-classified IRT model. In Section 3, we first outline the general scheme of the MH-RM algorithm and then introduce the imputation strategy designed for estimating multilevel latent variable models with crossed random effects. In Section 4, we present a preliminary simulation study that evaluates the performance of the new variant of the MH-RM algorithm under various conditions. In Section 5, we demonstrate the proposed model with SET data collected at a large public university. In Section 6, we conclude this study and discuss limitations and future research directions.

2. A Multilevel IRT Model for Cross-Classified Data

This section introduces a cross-classified IRT model for item-level data with crossed-classified structure. Without loss of generality, let us assume that the N observations in the data are cross-classified by two factors. The two crossed factors are conveniently named column-side and row-side factors. Let there be J unique column-side units and K unique row-side units. Note that the data do not have to be fully crossed. In other words, N does not necessarily equal to $J \times K$ , as in the case of cross-classified data. Each of the J column-side units is associated with a $m_{c} \times 1$ observed predictor/covariate vector $x_{j} (j = 1, \dots, J)$ , and each row-side unit has a $m_{r} \times 1$ predictor/covariate vector $z_{k} (k = 1, \dots, K)$ . Each observation consists of responses to I items, and there are C_i response categories for item $i (i = 1, \dots, I)$ . In other words, we can imagine expanding Table 1 by stacking its nonempty cells (each of which represents an observation that consists of responses to I items), so that we have an item response matrix of size $N \times I$ , along with two indices indicating the cross-classification, as well as additional data for the predictors.

2.1. Latent Structural Model

Let $η_{j}$ denote a $n_{c} \times 1$ vector of column-side latent variables associated with column-side units, and $ξ_{k}$ be a $n_{r} \times 1$ row-side latent variable vector associated with row-side units. The latent structural model expresses the relationships between observed predictors and the latent variables via two multivariate regression equations, one for each factor,

η_{j} = B x_{j} + ζ_{j},

ξ_{k} = Γ z_{k} + ∊_{k} .

In Equation 1, $B$ is a $n_{c} \times m_{c}$ regression coefficient matrix. $ζ_{j}$ is a random effect term and is assumed to follow a multivariate normal distribution, $ζ_{j} \sim N_{n_{c}} (0, Ψ_{c})$ . In other words, the column-side latent variable $η_{j}$ follows a multivariate normal distribution with mean $B x_{j}$ and covariance $Ψ_{c}$ . In Equation 2, $Γ$ is a $n_{r} \times m_{r}$ regression coefficient matrix and $∊_{k} \sim N_{n_{r}} (0, Ψ_{r})$ . Thus, the row-side latent variable $ξ_{k}$ follows a multivariate normal distribution with mean $Γ z_{k}$ and covariance $Ψ_{r}$ .

2.2. Measurement Model

For an item with two categories, the measurement model of the cross-classified IRT model extends the standard two parameter logistic (2PL) model. Let $y_{i j k}$ denote the response to item i in an observation that is cross-classified by the jth column-side unit and kth row-side unit. The conditional probability that $y_{i j k}$ equals 1 is

P_{θ} (y_{i j k} = 1 | η_{j}, ξ_{k}) = \frac{1}{1 + exp [- (α_{i} + λ_{c, i}^{'} η_{j} + λ_{r, i}^{'} ξ_{k})]} .

In Equation 3, $α_{i}$ is the intercept for item i, and $λ_{c, i}$ and $λ_{r, i}$ are, respectively, $n_{c} \times 1$ and $n_{r} \times 1$ slope vectors for item i. The two slope vectors correspond to the column-side and row-side latent variables. $θ$ denotes a vector that contains all estimable structural parameters, the item parameters included. It is used here to emphasize that the item response probabilities depend on these parameters. As indicated by Equation 3, the probability of endorsing item i depends on the latent variables associated with both column-side and row-side factors.

For an item with more than two strictly ordered categories, Samejima’s (1969) graded response model (GRM) is employed due to its popularity for modeling survey responses. Specifically, a set of cumulative probabilities are specified as follows:

P_{θ} (y_{i j k} \geq 0 | η_{j}, ξ_{k}) = 1,

P_{θ} (y_{i j k} \geq 1 | η_{j}, ξ_{k}) = \frac{1}{1 + exp [- (α_{i,1} + λ_{c, i}^{'} η_{j} + λ_{r, i}^{'} ξ_{k})]},

⋮

P_{θ} (y_{i j k} \geq C_{i} - 1 | η_{j}, ξ_{k}) = \frac{1}{1 + exp [- (α_{i, (C_{i} - 1)} + λ_{c, i}^{'} η_{j} + λ_{r, i}^{'} ξ_{k})]},

P_{θ} (y_{i j k} \geq C_{i} | η_{j}, ξ_{k}) = 0.

In Equation 4, $α_{i} = (α_{i,1}, \dots, α_{i, (C_{i} - 1)})^{'}$ is a $(C_{i} - 1) \times 1$ vector of intercepts for item i. $λ_{c, i}$ and $λ_{r, i}$ are the two slope vectors. Thus, the conditional probability that the response falls in category c, denoted by $π_{i j k c}$ , is the difference in two adjacent cumulative response probabilities,

π_{i j k c} = P_{θ} (y_{i j k} = c | η_{j}, ξ_{k}) = P_{θ} (y_{i j k} \geq c | η_{j}, ξ_{k}) - P_{θ} (y_{i j k} \geq c + 1 | η_{j}, ξ_{k}),

for $c \in {0, 1, \dots, C_{i} - 1}$ . Obviously, the GRM includes the 2PL model in Equation 3 as a special case with $C_{i} = 2$ .

2.3. Observed and Complete Data Likelihood

It follows from Equation 5 that the conditional distribution of $y_{i j k}$ is a multinomial of trial size 1 and C_i cells. The corresponding cell probabilities are $π_{i j k} = (π_{i j k 1}, \dots, π_{i j k C_{i - 1}})^{'}$ , which further depend on the latent variables. The conditional density of $y_{i j k}$ is

f_{θ} (y_{i j k} | η_{j}, ξ_{k}) = \prod_{c = 0}^{C_{i} - 1} π_{i j k c}^{χ_{c} (y_{i j k})},

where the indicator function is

χ_{c} (y_{i j k}) = {\begin{array}{l} 1, if y_{i j k} = c, \\ 0, otherwise, \end{array}

for $c \in {0, 1, \dots, C_{i} - 1}$ .

Let $y_{j k} = (y_{1 j k}, y_{2 j k}, \dots, y_{I j k})^{'}$ be a vector of responses in the observation that is cross-classified by the jth column-side unit and kth row-side unit. Invoking the conditional independence assumption (Lord & Novick, 1968), we have that the item responses in $y_{j k}$ are independent conditional on the latent variables. Therefore, the conditional density of $y_{j k}$ can be written as the product of conditional densities of all I items,

f_{θ} (y_{j k} | η_{j}, ξ_{k}) = \prod_{i = 1}^{I} f_{θ} (y_{i j k} | η_{j}, ξ_{k}) .

Let $Y_{o}$ denote the $N \times I$ matrix of observed item responses. The observed data likelihood from which $θ$ is estimated is obtained by integrating out the latent variables,

L (θ | Y_{o}, X, Z) = \prod_{k = 1}^{K} \int \prod_{j = 1}^{J} \int \prod_{i = 1}^{I} f_{θ} (y_{i j k} | η_{j}, ξ_{k}) f_{θ} (η_{j} | x_{j}) d η f_{θ} (ξ_{k} | z_{k}) d ξ .

In Equation 9, $X$ is a $J \times m_{c}$ observed predictor matrix, the jth row of which is the predictor vector associated with the jth unique column-side unit. $Z$ is a $K \times m_{r}$ predictor matrix, whose kth row corresponds to the kth unique row-side unit. $f_{θ} (η_{j} | x_{j})$ and $f_{θ} (ξ_{k} | z_{k})$ are, respectively, the density functions of $η_{j}$ and $ξ_{k}$ , $η_{j} \sim N_{n_{c}} (B x_{j}, Ψ_{c})$ and $ξ_{k} \sim N_{n_{r}} (Γ z_{k}, Ψ_{r})$ .

The latent variables, $η_{j}$ and $ξ_{k}$ , can be viewed as missing data (Dempster et al., 1977). The observed item responses can be augmented by missing data (denoted by $Y_{m}$ ), so that the complete data are formed as $Y = (Y_{o}, Y_{m})$ . Assuming $Y_{m}$ are observed, the complete data likelihood is

L (θ | Y, X, Z) = \prod_{k = 1}^{K} \prod_{j = 1}^{J} \prod_{i = 1}^{I} f_{θ} (y_{i j k} | η_{j}, ξ_{k}) f_{θ} (η_{j} | x_{j}) f_{θ} (ξ_{k} | z_{k}),

= [\prod_{k = 1}^{K} \prod_{j = 1}^{J} \prod_{i = 1}^{I} f_{θ} (y_{i j k} | η_{j}, ξ_{k})] [\prod_{j = 1}^{J} f_{θ} (η_{j} | x_{j})] [\prod_{k = 1}^{K} f_{θ} (ξ_{k} | z_{k})] .

Equation 10 has a fully factored structure and consists of three parts, which correspond to the measurement model and two multivariate regression models. The dominating insight here is that if the latent variables were observed, the complete data likelihood function would be relatively simple to evaluate and maximize.

3. Estimation of the Cross-Classified IRT Model

To estimate the proposed cross-classified IRT model, we introduce a variant of the MH-RM algorithm that adopts a new imputation strategy. In this section, we first outline the general scheme of the MH-RM algorithm and then describe the imputation strategy utilized in detail. In addition, we show approaches to obtain SEs of the MLEs of model parameters.

3.1. MH-RM Algorithm

The MH-RM algorithm combines the Metropolis–Hastings (MH; Hastings, 1970; Metropolis et al., 1953) sampler with the data augmented Robbins–Monro (RM; Robbins & Monro, 1951) stochastic approximation algorithm. It connects Fisher’s (1925) identity and Robbins and Monro’s (1951) root-finding algorithm for noise-corrupted regression functions and produces MLEs of model parameters. The MH-RM algorithm has been applied in many latent variable modeling contexts (e.g., Cai, 2010a, 2010b; Chung & Cai, 2021; Falk & Cai, 2016; Huang et al., 2022; Ju & Falk, 2019; Monroe & Cai, 2014; Yang & Cai, 2014).

Let $θ^{p}$ denote the parameter estimates by the end of the pth iteration of the MH-RM algorithm, with $θ^{0}$ serving as starting values. At its $(p + 1)$ th iteration, the MH-RM algorithm follows three steps: stochastic imputation, stochastic approximation, and RM update.

3.1.1. Stochastic imputation

Impute $R_{p + 1}$ sets of missing data from the posterior predictive distributions using the MH sampler to form $R_{p + 1}$ sets of complete data. The rth set of complete data in the $(p + 1)$ th iteration is denoted by $Y_{r}^{p + 1}, (r = 1, \dots, R_{p + 1})$ . In the context of cross-classified IRT model, this step involves drawing column-side and row-side latent variables. Usually in practice, $R_{p + 1} = 1$ is sufficient.

3.1.2. Stochastic approximation

Approximate the gradient of the complete data log-likelihood as the average of gradients of $R_{p + 1}$ sets of complete data obtained in the previous step,

{\tilde{s}}_{p + 1} = \frac{1}{R_{p + 1}} \sum_{r = 1}^{R_{p + 1}} s (θ^{p} | Y_{r}^{p + 1}) .

By Fisher’s (1925) identity, which states that the gradient of the observed data log-likelihood equals to the conditional expectation of the gradient of the complete data log-likelihood, Equation 11 is an approximation of the observed data gradient. In this step, the complete data information matrix

H (θ | Y) = - \frac{\partial^{2} l (θ | Y)}{\partial θ \partial θ^{'}},

is also approximated as the sample average

H_{p + 1} = \frac{1}{R_{p + 1}} \sum_{r = 1}^{R_{p + 1}} H (θ^{p} | Y_{r}^{p + 1}) .

3.1.3. RM update

Let $Π_{p}$ denote a recursive approximation to the expectation of complete data information matrix by the end of the pth iteration. $Π_{p}$ is updated according to

Π_{p + 1} = Π_{p} + κ_{p} (H_{p + 1} - Π_{p}),

where $κ_{p}$ is a sequence of slowly decreasing gain constants and satisfies

κ_{p} \in (0, 1], \sum_{p = 1}^{\infty} κ_{p} = \infty, and \sum_{p = 1}^{\infty} κ_{p}^{2} < \infty .

Then, $θ^{p}$ is updated recursively according to the following stochastic approximation

θ^{p + 1} = θ^{p} + κ_{p} (Π_{p + 1}^{- 1} {\tilde{s}}_{p + 1}) .

With the RM method, the “noise” introduced in the stochastic imputation step through imputing missing data is counteracted. In practice, a few burn-in iterations are ran before main iterations of the MH-RM algorithm. The main iterations are divided into three stages. Stage I utilizes a fixed gain constant $κ_{0}$ and aims to improve the starting values of the algorithm. Stage II is also a fixed gain constant stage and further brings the estimates of model parameters to the region that is closer to their MLEs. Stage III of the MH-RM algorithm uses the average of parameter estimates in Stage II iterations as its starting values and the gain constant in this stage is decreasing.

3.2. Implementation of the MH Sampler

To address the computational challenge in estimating crossed random effects, a new imputation strategy was applied, which couples the Metropolis-within-Gibbs algorithm (Patz & Junker, 1999a, 1999b) with the alternating imputation-posterior (AIP) algorithm (Cho & Rabe-Hesketh, 2011; Chung & Cai, 2021; Clayton & Rasbash, 1999). Specifically, the imputation process alternates between latent variables associated with different crossed factors. In the two-factor example, this process alternates between the column-side and row-side latent variables.

Let $η_{j}^{p} (j = 1, \dots, J)$ and $ξ_{k}^{p} (k = 1, \dots, K)$ denote the imputations of latent variables by the end of the pth iteration of the MH-RM algorithm. In the $(p + 1)$ th iteration, the row-side latent variables are fixed to values from the previous iteration $ξ_{k}^{p} (k = 1, \dots, K)$ , and a Gibbs sampler with the following steps is constructed:

Draw η_{1}^{p + 1} \sim f_{θ} (η_{1} | η_{2}^{p}, \dots, η_{J}^{p}, Y_{o}, X, Z, ξ^{p}),

⋮

Draw η_{j}^{p + 1} \sim f_{θ} (η_{j} | η_{1}^{p + 1}, \dots, η_{j - 1}^{p + 1}, η_{j + 1}^{p}, \dots, η_{J}^{p}, Y_{o}, X, Z, ξ^{p}),

⋮

Draw η_{J}^{p + 1} \sim f_{θ} (η_{J} | η_{1}^{p + 1}, \dots, η_{J - 1}^{p}, Y_{o}, X, Z, ξ^{p}),

where $f_{θ} (η_{j} | η_{1}^{p + 1}, \dots, η_{j - 1}^{p + 1}, η_{j + 1}^{p}, \dots, η_{J}^{p}, Y_{o}, X, Z, ξ^{p})$ represents the full conditional density for $η_{j}^{p + 1}$ . As it is not feasible to sample $η_{j}^{p + 1}$ directly, the Gibbs sampler has to be combined with the MH algorithm. That is to say, within each step of the Gibbs sampler, a candidate value of $η_{j}$ , denoted by $η_{j}^{*}$ , is generated by adding a perturbation simulated from a multivariate normal distribution to its current value. $η_{j}^{p + 1}$ is assigned the candidate value $η_{j}^{*}$ with the probability $min {1, R (η_{j}^{*}, η_{j}^{p})}$ , where

R (η_{j}^{*}, η_{j}^{p}) = \frac{\prod_{k = 1}^{K} \prod_{i = 1}^{I} f_{θ} (y_{i j k} | η_{j}^{*}, ξ_{k}) f_{θ} (η_{j}^{*} | x_{j})}{\prod_{k = 1}^{K} \prod_{i = 1}^{I} f_{θ} (y_{i j k} | η_{j}^{p}, ξ_{k}) f_{θ} (η_{j}^{p} | x_{j})},

and remains $η_{j}^{p}$ otherwise. The standard deviation of the multivariate normal distribution from which the perturbations are drawn from, denoted by h_c , can be adjusted to control the rate that candidates are accepted.

Imputations of the row-side latent variables are generated in the same fashion. Note that while the row-side latent variables are imputed, the column-side latent variables are fixed to values generated at the current iteration $η_{j}^{p + 1} (j = 1, \dots, J)$ . The row-side proposal standard deviation h_r controls the acceptance rate at the row side. If a third crossed factor is considered, the column-side and row-side latent variables would be fixed to their imputed values at the current iteration, when imputations of latent variables associated with the third factor are generated.

3.3. SE Estimation

SEs of parameter estimates are obtained by taking the square root of diagonal elements of the inverse of the observed data information matrix. Let $ℐ$ denote the observed data information matrix. Under the MH-RM algorithm, $ℐ$ is approximated through Louis’s (1982) formula

ℐ = - \frac{\partial^{2} l (θ | Y_{o})}{\partial θ \partial θ^{'}},

= E {H (θ | Y)} - E {s (θ | Y) [s (θ | Y {)]}^{'}} + E {s (θ | Y)} E {{[s (θ | Y)]}^{'}} .

Two types of SEs are defined based on how Equation 19 is used: (a) recursively approximated SEs and (b) post-convergence approximated SEs (Cai, 2010a; Yang & Cai, 2014).

3.3.1. Recursively approximated SEs

The observed data information matrix $ℐ$ is approximated as a by-product of iterations of the MH-RM algorithm. Specifically, in the $(p + 1)$ th iteration of the MH-RM algorithm, a recursive stochastic approximation of the first two terms of the right-hand side of Equation 19 is

{\hat{G}}_{p + 1} = {\hat{G}}_{p} + κ_{p} ({\tilde{G}}_{p + 1} - {\hat{G}}_{p}),

where

{\tilde{G}}_{p + 1} = \frac{1}{R_{p + 1}} \sum_{r}^{R_{p + 1}} {H (θ^{p} | Y_{r}^{p + 1}) - s (θ^{p} | Y_{r}^{p + 1}) [s (θ^{p} | Y_{r}^{p + 1} {)]}^{'}} .

In a similar fashion, the third term is recursively approximated as

{\hat{s}}_{p + 1} = {\hat{s}}_{p} + κ_{p} ({\tilde{s}}_{p + 1} - {\hat{s}}_{p}) .

Following Equation 19, the observed data information matrix is approximated as

ℐ_{p + 1} = {\hat{G}}_{p + 1} + {\hat{s}}_{p + 1} {\hat{s}}_{p + 1}^{'} .

3.3.2. Post-convergence approximated SEs

This approach directly applies Equation 19 after the MH-RM algorithm converges. Once convergence is achieved, additional MH-RM iterations are needed to generate R samples of latent variables/missing data to approximate the observed data information matrix. Specifically, the first and second term are approximated as

E {H (θ | Y)} \approx \frac{1}{R} \sum_{r = 1}^{R} H (\hat{θ} | Y_{r}),

E {s (θ | Y) [s (θ | Y {)]}^{'}} \approx \frac{1}{R} \sum_{r = 1}^{R} {s (\hat{θ} | Y_{r}) [s (\hat{θ} | Y_{r} {)]}^{'}} .

The third term is zero since $θ$ is at the MLE. Then, the observed data informing matrix is approximated as

ℐ \approx \frac{1}{R} \sum_{r = 1}^{R} H (\hat{θ} | Y_{r}) + \frac{1}{R} \sum_{r = 1}^{R} {s (\hat{θ} | Y_{r}) [s (\hat{θ} | Y_{r} {)]}^{'}} .

4. Simulation Study

A simulation study was conducted to evaluate the performance of the MH-RM algorithm for estimating the cross-classified IRT model under various conditions (summarized in Table 2). Three factors were manipulated in the simulation study: (a) the number of column-side units J (e.g., students), (b) the number of row-side units K (e.g., instructors), and (c) the ratio of variances of the column-side and row-side latent variables. Three levels of J considered were 500, 1,000, and 2,000. To mimic real-world scenarios, where the numbers of column-side and row-side units tend to be unequal, for each J, three different Ks were considered, which were, respectively, 10%, 20%, and 40% of J. For $J = 500$ , the three K levels were 50, 100, and 200. For $J = 1, 000$ , the three K levels were 100, 200, and 400. For $J = 2, 000$ , the three K levels were 200, 400, and 800. The three levels of variance ratio considered were 1, 2, and 4. The variance ratio was controlled through varying the column-side slope, while fixing the row-side slope and letting the latent variables have unit variances.

Therefore, a total of $3 \times 3 \times 3 = 27$ conditions were simulated. For each condition, 100 data sets were generated.

Table 2.

Simulation Conditions

Condition	J	K	Variance	Slopes
Condition	J	K	Ratio	$λ_{c}$	$λ_{r}$
1	500	50	1	1	1
2			2	1.41	1
3			4	2	1
4	500	100	1	1	1
5			2	1.41	1
6			4	2	1
7	500	200	1	1	1
8			2	1.41	1
9			4	2	1
10	1,000	100	1	1	1
11			2	1.41	1
12			4	2	1
13	1,000	200	1	1	1
14			2	1.41	1
15			4	2	1
16	1,000	400	1	1	1
17			2	1.41	1
28			4	2	1
29	2,000	200	1	1	1
30			2	1.41	1
31			4	2	1
32	2,000	400	1	1	1
33			2	1.41	1
34			4	2	1
35	2,000	800	1	1	1
36			2	1.41	1
37			4	2	1

Note. J and K denote the numbers of column-side and row-side units, respectively. $λ_{c}$ and $λ_{r}$ represent the column-side and row-side slopes, respectively.

4.1. Data Generation

For each simulation condition, responses to 10 five-category items were generated based on the proposed cross-classified IRT model using R (R Core Team, 2018). For simplicity, both the column-side and row-side latent variables were assumed to be unidimensional (i.e., $n_{c} = n_{r} = 1$ ) and only two continuous covariates were included, one for each latent variable.

To simulate item responses, J column-side and K row-side random effects were first drawn from a standard normal distribution (rounded to the second decimal place). The column-side covariates x_j and row-side covariates z_k were simulated from a standard normal distribution (rounded to the second decimal place). The generating values of the regression coefficients associated with x_j and z_k were 0.7 and 0.3, respectively. The simulated random effects and covariates were then plugged into Equations 1 and 2 to obtain J column-side latent variables and K row-side latent variables.

If a fully crossed design was adopted, the $J \times K$ pairs of latent variables were to be plugged into Equation 4 to simulate item responses. However, fully crossed data are rarely observed in practice; therefore, 10% of the latent variable pairs were randomly selected and retained. In other words, in conditions with $J = 500$ and $K = 100$ , the number of observations N in a simulated data set was $500 \times 100 \times 0.1 = 5, 000$ . The generating values of intercepts of the 10 items aimed to mimic real-world cases and were summarized in Table 3. Specifically, the generating values of $α_{i,1}$ ranged from 2.10 to 4.35. The generating values of $α_{i,2}$ ranged from 0.02 to 1.79. The generating values of $α_{i,3}$ ranged from −0.97 to 0.48. The generating values of $α_{i,4}$ ranged from −3.34 to -2.01. The generating value of the column-side slope $λ_{c}$ of all the 10 items was 1 in conditions with variance ratio $=$ 1, was 1.41 in conditions with variance ratio $=$ 2, and was 2 in conditions with variance ratio $=$ 4. The generating value of the row-side slope $λ_{r}$ was 1 across all simulation conditions.

Table 3.

Generating Values of Item Intercepts

Item	Intercept
Item	1	2	3	4
1	3.25	0.95	−0.41	−3.34
2	4.35	1.79	0.48	−2.91
3	2.18	0.02	−0.97	−2.95
4	2.10	0.29	−0.88	−3.20
5	3.71	1.71	0.42	−2.01
6	2.83	1.16	0.21	−2.13
7	3.27	0.77	−0.04	−2.61
8	2.13	0.20	−0.59	−2.69
9	3.28	0.95	0.00	−2.07
10	4.04	1.61	0.29	−2.40

4.2. Estimation Details

All simulated data were analyzed with flexMIRT® (Cai, 2017). Informed by this research, flexMIRT® implements the new variant of the MH-RM algorithm for cross-classified IRT modeling. Across all conditions, the numbers of iterations in the two fixed gain constant stages (Stages I and II) of the algorithm were 5,000 and 500. The gain constant applied in these two stages was 0.5, that is, $κ_{0} = 0.5$ . The maximum number of iterations in Stage III was 5,000. The starting values for item slopes, item intercepts, and regression coefficients were 1, 0 and 0.2, respectively. In the stochastic imputation step of each iteration, one set of latent variables was simulated. The proposal standard deviations h_c and h_r varied over conditions and were determined via experiments. The acceptance rates at the column and the row sides were controlled to be within the range of 0.5 and 0.7. A convergence window of 3 with a $1.0 \times 10^{- 4}$ tolerance was applied.

4.3. Simulation Results

Table 4 shows the generating values of the column-side slope $λ_{c}$ and row-side slope $λ_{r}$ and the corresponding biases and root mean squared errors (RMSE). The slopes were well recovered in all simulation conditions. The bias of $λ_{c}$ ranged from −0.006 to 0.013. The RMSE of $λ_{c}$ decreased as the number of column-side units J increased. The averaged RMSE of the nine $J =$ 500 conditions (Conditions 1–9) was 0.049, the averaged RMSE when $J =$ 1,000 (Conditions 10–18) was 0.035, and the averaged RMSE when $J =$ 2,000 (Conditions 19–27) was 0.026. Holding J constant, the RMSE increased as the variance ratio increased. For example, when $J =$ 500, the averaged RMSEs when variance ratio $=$ 1 (Conditions 1, 4, and 7), 2 (Conditions 2, 5, and 8), and 4 (Conditions 3, 6, and 9) were, respectively, 0.033, 0.049, and 0.066. Controlling for J and the variance ratio, the RMSEs were similar in magnitude across different levels of the number of row-side units K. For example, when $J =$ 1,000 and the variance ratio $= 1$ , the RMSEs in the $K =$ 100, 200, and 400 conditions were 0.024, 0.024, and 0.025.

The bias of $λ_{r}$ was also very small, ranging from −0.011 to 0.009. The RMSE of $λ_{c}$ decreased as K increased. The averaged RMSEs in conditions where $K =$ 50 (Conditions 1–3), 100 (Conditions 4–6, and 10–12), 200 (Conditions 7–9, 13–15, and 19–21), 400 (Conditions 16–18 and 22–24), and 800 (Conditions 25–27), were, respectively, 0.113, 0.077, 0.052, 0.039, and 0.028. Holding K constant, the RMSEs were similar across different Js and variance ratios.

Table 4.

Estimates of Slopes

Condition	J	K	Variance	$λ_{c}$			$λ_{r}$
Condition	J	K	Ratio	True	Bias	RMSE	True	Bias	RMSE
1	500	50	1	1	.000	.035	1	−.003	.107
2			2	1.41	.003	.052	1	−.011	.121
3			4	2	.013	.070	1	.007	.112
4		100	1	1	−.004	.034	1	−.008	.081
5			2	1.41	−.005	.050	1	.006	.076
6			4	2	.010	.061	1	−.005	.072
7		200	1	1	.003	.030	1	−.007	.050
8			2	1.41	−.003	.045	1	−.009	.054
9			4	2	.002	.068	1	.000	.055
10	1,000	100	1	1	−.001	.024	1	.001	.081
11			2	1.41	.005	.034	1	.008	.079
12			4	2	.004	.049	1	.001	.072
13		200	1	1	.001	.024	1	−.007	.054
14			2	1.41	.005	.031	1	−.010	.049
15			4	2	−.002	.042	1	.003	.057
16		400	1	1	.003	.025	1	−.001	.036
17			2	1.41	.000	.034	1	.002	.042
18			4	2	.001	.049	1	.004	.040
19	2,000	200	1	1	.001	.016	1	.006	.051
20			2	1.41	.002	.024	1	−.003	.046
21			4	2	−.006	.040	1	.008	.052
22		400	1	1	.001	.018	1	−.004	.037
23			2	1.41	.002	.023	1	−.003	.038
24			4	2	.003	.037	1	.009	.040
25		800	1	1	−.001	.015	1	−.003	.030
26			2	1.41	−.002	.024	1	.002	.028
27			4	2	.004	.035	1	−.002	.026

Note. True = generating value; RMSE = root mean squared error.

The biases and RMSEs of intercepts of the 10 items exhibited similar patterns; therefore, in the interest of brevity, the results of two items (Items 8 and 10) are presented in Tables 5 and 6. The biases and RMSEs of intercepts of the two items were similar in magnitude. For example, when $J = 1, 000$ , $K = 400$ , and variance ratio $=$ 1, the biases of Items 8 and 10’s intercepts were, respectively, 0.005, 0.005, 0.004, 0.005, 0.008, 0.006, 0.004, and 0.006. The corresponding RMSEs were 0.067, 0.065, 0.065, 0.070, 0.068, 0.067, 0.065, and 0.065.

The intercepts were well recovered. For instance, the bias of intercept 1 of Item 8 ( $α_{8,1}$ ) ranged from −0.040 to 0.008 and were similar in magnitude across conditions. Regarding the RMSEs of the intercepts, a general trend observed was that as N (i.e., $J \times K \times 0.1$ ) increased, the RMSEs of the intercepts decreased. For example, the averaged RMSEs of $α_{8,1}$ in conditions with N = 2,500 (Conditions 1–3), 5,000 (Conditions 4–6), 10,000 (Conditions 7–12), 20,000 (Conditions 13–15), 40,000 (Conditions 16–21), 80,000 (Conditions 22–24), and 160,000 (Conditions 25–27) were, respectively, 0.193, 0.134, 0.118, 0.092, 0.081, 0.066, and 0.056. However, in conditions that have the same N but different J and K, the RMSEs were slightly smaller in conditions with a more balanced design. In conditions where $J = 1, 000$ and $K = 400$ (Conditions 16–18), the RMSEs that correspond to variance ratio $=$ 1, 2, and 4 were 0.067, 0.060, and 0.092. In conditions with $J = 2, 000$ and $K = 200$ (Conditions 19–21), which also yielded 40,000 observations, the three RMSEs were 0.081, 0.091, and 0.097. For another two combinations of J and K that yielded the same N, $J = 500$ and $K = 200$ (Conditions 7–9) and $J = 1, 000$ and $K = 100$ (Conditions 10–12), the RMSEs of the former combination were slightly smaller when variance ratio $=$ 1 and were similar to the RMSEs of the less balanced combination when variance ratio $=$ 2 and 4.

Table 5.

Estimates of Intercepts (Item 8)

Condition	J	K	Variance	$α_{8, 1} = 2.13$		$α_{8, 2} = .20$		$α_{8, 3} = - 0.59$		$α_{8, 4} = - 2.69$
Condition	J	K	Ratio	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE
1	500	50	1	.014	.188	.012	.182	.013	.179	.011	.190
2			2	.024	.184	.017	.188	.017	.183	.014	.180
3			4	−.040	.208	−.025	.202	−.030	.200	−.040	.210
4		100	1	−.001	.130	−.010	.124	−.008	.124	−.005	.127
5			2	−.001	.138	.009	.136	.007	.134	.012	.139
6			4	−.003	.134	−.007	.128	−.003	.128	−.013	.138
7		200	1	−.005	.085	−.004	.088	−.001	.087	.003	.095
8			2	.004	.114	.006	.110	.004	.108	.011	.113
9			4	−.024	.139	−.024	.137	−.026	.139	−.026	.140
10	1,000	100	1	.000	.125	.006	.121	.008	.125	.005	.128
11			2	−.015	.110	−.017	.108	−.014	.107	−.016	.109
12			4	−.010	.136	−.016	.139	−.014	.141	−.017	.143
13		200	1	.008	.088	.010	.092	.009	.088	.013	.091
14			2	−.020	.088	−.020	.088	−.018	.086	−.023	.091
15			4	−.010	.099	−.007	.099	−.004	.101	−.004	.098
16		400	1	.005	.067	.005	.065	.004	.065	.005	.070
17			2	−.001	.060	−.002	.060	.000	.060	−.003	.060
18			4	−.005	.092	−.007	.092	−.005	.091	−.004	.090
19	2,000	200	1	−.014	.081	−.013	.080	−.012	.080	−.014	.080
20			2	−.001	.091	.000	.090	.000	.090	−.005	.090
21			4	.001	.097	.003	.096	.003	.097	.002	.098
22	2,000	400	1	.008	.068	.009	.070	.009	.071	.009	.072
23			2	.005	.068	.005	.066	.004	.066	.004	.067
24			4	.007	.063	.007	.064	.008	.064	.007	.064
25		800	1	−.008	.047	−.009	.047	−.009	.047	−.010	.047
26			2	−.010	.056	−.010	.056	−.011	.056	−.010	.056
27			4	−.019	.064	−.021	.066	−.020	.066	−.021	.065

Note. True = generating value; RMSE = root mean squared error.

Table 6.

Estimates of Intercepts (Item 10)

Condition	J	K	Variance	$α_{10, 1} = 4.04$		$α_{10, 2} = 1.61$		$α_{10,3} = .29$		$α_{10, 4} = - 2.40$
Condition	J	K	Ratio	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE
1	500	50	1	.030	.210	.020	.184	.013	.180	.009	.181
2			2	.016	.195	.011	.189	.008	.184	.015	.175
3			4	−.044	.220	−.024	.211	−.027	.206	−.034	.209
4		100	1	.000	.147	−.003	.120	−.005	.120	−.003	.125
5			2	.013	.155	.009	.136	.011	.134	.013	.138
6			4	−.006	.146	−.007	.134	−.012	.131	−.003	.136
7		200	1	.001	.096	.002	.088	.001	.089	.000	.090
8			2	.007	.114	.004	.101	.005	.104	.008	.113
9			4	−.025	.141	−.024	.140	−.021	.135	−.021	.139
10	1,000	100	1	−.002	.136	.001	.128	.006	.126	.011	.128
11			2	−.009	.117	−.014	.104	−.015	.106	−.015	.108
12			4	−.014	.143	−.020	.140	−.018	.141	−.020	.141
13		200	1	.008	.099	.008	.091	.008	.091	.005	.096
14			2	−.021	.090	−.022	.089	−.021	.089	−.022	.093
15			4	.000	.101	−.002	.098	−.002	.099	−.004	.103
16		400	1	.008	.068	.006	.067	.004	.065	.006	.065
17			2	.001	.066	.001	.062	−.001	.060	.000	.059
18			4	−.003	.090	−.003	.092	−.005	.094	−.003	.093
19	2,000	200	1	−.010	.081	−.011	.079	−.012	.078	−.014	.080
20			2	−.004	.098	−.004	.096	−.002	.094	−.004	.096
21			4	.005	.097	.006	.099	.006	.097	.007	.097
22	2,000	400	1	.009	.073	.007	.071	.007	.070	.007	.071
23			2	.002	.068	.004	.068	.005	.069	.004	.069
24			4	.006	.063	.006	.063	.004	.063	.005	.065
25		800	1	−.008	.047	−.008	.047	−.009	.047	−.010	.047
26			2	−.010	.057	−.011	.056	−.011	.056	−.011	.057
27			4	−.019	.069	−.018	.068	−.020	.065	−.021	.067

Note. True = generating value; RMSE = root mean squared error.

Table 7 summarizes the true values, biases, and RMSEs of the regression coefficients that correspond to the column-side and row-side covariates. The bias of the column-side regression coefficient $β$ ranged from −0.012 to 0.006. The RMSE of $β$ decreased as J increased. The averaged RMSEs in conditions where $J =$ 500, 1,000, and 2,000, were respectively 0.052, 0.036, and 0.025. Holding J constant, the RMSEs were similar in magnitude across different combinations of Ks and variance ratios.

The bias of the column-side regression coefficient $γ$ was also small, ranging from −0.026 to 0.024. The RMSE of $γ$ decreased as K increased. The averaged RMSEs in conditions where $K =$ 50, 100, 200, 400, and 800, were respectively 0.157, 0.109, 0.072, 0.052, and 0.037. Controlling for J, the magnitudes of the RMSEs were similar regardless of the levels of J and the variance ratio.

Table 7.

Estimates of Regression Coefficients

Condition	J	K	Variance	$β = 0.7$		$γ = 0.3$
Condition	J	K	Ratio	Bias	RMSE	Bias	RMSE
1	500	50	1	.004	.053	.024	.168
2			2	−.002	.057	.000	.145
3			4	−.012	.051	−.026	.158
4	500	100	1	.006	.053	.015	.117
5			2	−.001	.051	.001	.106
6			4	−.002	.053	.005	.105
7	500	200	1	−.006	.050	.014	.063
8			2	.005	.051	.002	.077
9			4	.003	.053	.005	.072
10	1,000	100	1	−.006	.034	−.013	.117
11			2	−.006	.037	−.008	.095
12			4	−.002	.037	.000	.114
13	1,000	200	1	−.001	.038	.000	.081
14			2	−.006	.035	−.003	.062
15			4	.004	.035	.001	.077
16	1,000	400	1	−.002	.037	−.007	.053
17			2	−.001	.034	.001	.053
18			4	−.005	.037	−.004	.051
19	2,000	200	1	−.001	.025	.000	.073
20			2	−.002	.025	−.005	.072
21			4	.001	.027	.007	.074
22	2,000	400	1	.003	.024	−.002	.053
23			2	.002	.025	−.007	.055
24			4	−.003	.026	.003	.047
25	2,000	800	1	.003	.024	−.001	.038
26			2	.002	.025	−.003	.035
27			4	.002	.022	.002	.039

Note. $β$ and $γ$ denote the regression coefficients associated with the column-side and row-side covariates, respectively.

5. Empirical Demonstration

The proposed cross-classified IRT model and the associated estimation algorithm were applied to SET data collected at a large public university in the academic year 2018–2019.

5.1. Sample and Measure

The analysis presented here focused on a seven-item subscale from the survey that aims to measure instructors’ teaching as related to diversity (e.g., “The diversity of my classmates enriched my learning in this course”). Each of the seven items has a five-point rating scale from 1 to 5 regarding to what extent students agree with the statement (i.e., 1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree). In addition to the item responses, demographic information of students, instructors, and courses were available. The demographic variables included student gender, student URM status, student first generation status, instructor gender, instructor URM status, and instructor rank.

Data cleaning and stratified sampling based on demographics (e.g., instructor gender and student URM status) were performed on the original data in order to prevent the parameter estimates from being dominated by large subgroups. The final sample used for the analysis included 19,429 evaluations to 814 instructors completed by 6,462 students. The evaluations were cross-classified by instructors and students.

5.2. Research Questions and Analysis

The research question survey developers at the higher education institution would like to address was that how student-related covariates, such as student gender and URM status, and instructor-related covariates, such as instructor gender and URM status, impact students’ perceptions of the instructors’ teaching as related to diversity.

To answer this research question, the proposed cross-classified IRT model was applied. Specifically, the latent structural model was specified as

η_{j} = β_{1} x_{1, j} + β_{2} x_{2, j} + ζ_{j},

ξ_{k} = γ_{1} z_{1, k} + γ_{2} z_{2, k} + ∊_{k},

where $η_{j}$ is a theorized student-related latent variable and represents student j’s perception or sensitivity to diversity-related materials in class. $x_{1, j}$ indicates student j’s gender (0 = female; 1 = male). $x_{2, j}$ represents student j’s URM status (0 = non-URM; 1 = URM). $ζ_{j}$ is the student-side random effect. In Equation 28, $ξ_{k}$ is a theorized instructor-related latent variable and indicates how much diversity-related material students perceive in the course taught by instructor k. $z_{1, k}$ indicates instructor k’s gender (0 = female; 1 = male). $z_{2, k}$ represents instructor k’s URM status (0 = non-URM; 1 = URM). $∊_{k}$ is the instructor-side random effect. The regression coefficients $β_{1}$ and $β_{2}$ reflect the estimated influence of student gender and URM status on their perceptions of instructors’ teaching as related to diversity. Estimates of $γ_{1}$ and $γ_{2}$ indicate how instructor gender and URM status are factored into the students’ perception of diversity-related instruction.

For identification purpose, the variances of the latent variables $η_{j}$ and $ξ_{k}$ were fixed to one. In addition, the item slopes associated with the same latent variable were constrained to be equal across items, so that a total of two slopes were estimated. This equality constraint was imposed since the data matrix is too sparse to allow different slopes.

5.3. Results

The estimates of item parameters are summarized in Table 8. The estimates of the slopes that correspond to $η_{j}$ and $ξ_{k}$ were 3.01 and 2.19, respectively. Thus, the ratio of the variances of the student-related and instructor-related latent variables is ${3.01}^{2} {/ 2.19}^{2} = 1.89$ . In other words, while there is sizable heterogeneity among students, there is still substantial amount of variance that is attributable to the instructors.

Table 8.

Item Parameter Estimates and Standard Errors (SEs)

Item	Slopes		Intercepts
Item	${\hat{λ}}_{c}$ (SE)	${\hat{λ}}_{r}$ (SE)	${\hat{α}}_{i,1}$ (SE)	${\hat{α}}_{i,2}$ (SE)	${\hat{α}}_{i,3}$ (SE)	${\hat{α}}_{i,4}$ (SE)
1	3.01 (.01)	2.19 (.01)	6.23 (.05)	4.22 (.03)	0.44 (.02)	−3.08 (.03)
2	3.01 (.01)	2.19 (.01)	6.12 (.05)	3.90 (.03)	0.72 (.02)	−2.88 (.03)
3	3.01 (.01)	2.19 (.01)	6.49 (.06)	4.05 (.03)	1.22 (.02)	−2.49 (.03)
4	3.01 (.01)	2.19 (.01)	6.05 (.05)	3.62 (.03)	0.42 (.02)	−2.99 (.03)
5	3.01 (.01)	2.19 (.01)	6.03 (.05)	3.59 (.03)	0.27 (.02)	−3.09 (.03)
6	3.01 (.01)	2.19 (.01)	7.41 (.08)	6.16 (.06)	1.96 (.02)	−1.36 (.02)
7	3.01 (.01)	2.19 (.01)	7.27 (.08)	6.10 (.06)	1.59 (.02)	−1.54 (.02)

The recursively approximation method was applied to obtain the SEs. The SEs associated with estimates of the slopes were relatively small (which were all 0.01). This is because the slopes were constrained to be equal across items and responses to all items contribute to these two estimates. The SEs associated with the first intercept of the seven items ${\hat{α}}_{i,1}$ ranged from 0.05 to 0.08. These SEs were slightly larger than the SEs of other intercepts (which ranged from 0.02 to 0.06). Figure 1 presents the item characteristic curves, which show how the probabilities of selecting different responses change as a function of the $η_{j}$ at five levels of $ξ_{k}$ (−2, −1, 0, 1, and 2). For example, Figure 1(a) indicates the probabilities of choosing different categories when $ξ_{k} =$ −2, while Figure 1(e) presents how the probabilities change when the instructor latent variable equals 2. When $ξ_{k} =$ −2, Category 4 requires $η_{j}$ to be relatively high (around 2.5), while $ξ_{k} =$ 2, Category 4 requires $η_{j}$ to be relatively low (around −0.2).

Figure 1.

Item characteristic curves of Item 1 at different levels of $ξ$ . (a) $ξ$ = −2. (b) $ξ$ = −1. (c) $ξ$ = 0. (d) $ξ$ = 1. (e) $ξ$ = 2.

The estimates of regression coefficients and the associated SEs were ${\hat{β}}_{1} =$ 0.02 (.02), ${\hat{β}}_{2} =$ −0.02 (.03), ${\hat{γ}}_{1} =$ 0.19 (.01), and ${\hat{γ}}_{2} =$ 0.42 (.01). These estimates help address the research question regarding impacts of the covariates of interest. A positive and significant ${\hat{γ}}_{1}$ indicates that students perceive more diversity-related materials in courses taught by male instructors on average, controlling for other variables in the model. Students perceive more diversity-related materials in courses taught by instructors who are URM, compared with those taught by non-URM instructors. In contrast, the estimate of $β_{1}$ indicates that male and female students do not differ significantly in terms of how sensitive they are to diversity-related materials in class, controlling for other covariates. Students that are from URM groups and those who are not do not differ significantly in terms of how sensitive they are to diversity-related materials in class either.

6. Discussion

6.1. Research Summary

In this study, we developed a cross-classified IRT model that properly handles item-level data with cross-classified structure, which are prevalent in social and behavioral sciences. The proposed model consists of two components: (a) a latent structural model and (b) a measurement model. The latent structural model specifies the relationships between observed covariates and latent variables associated with the crossed factors in the data. The measurement model describes the influence of latent variables on observed item responses through standard IRT models, such as 2PL model and GRM. For illustration purpose, the model presented in this article considered two crossed factors. The full generalization of the proposed model can be easily extended to incorporate more than two crossed factors.

We introduced a new variant of the MH-RM algorithm (Cai, 2008, 2010a, 2010b) to find the MLEs of parameters in the proposed cross-classified IRT model. Specifically, an imputation scheme that couples the Metropolis-within-Gibbs algorithm (Patz & Junker, 1999a, 1999b) with the AIP algorithm (Cho & Rabe-Hesketh, 2011; Chung & Cai, 2021; Clayton & Rasbash, 1999) was applied in the stochastic imputation step of the MH-RM algorithm. With this imputation strategy, latent variables associated with different crossed factors are sampled in alternation using the Metropolis-within-Gibbs sampler. The proposed estimation algorithm does not require the data to be fully crossed and can accommodate empty cells in the data matrix (i.e., missingness).

A simulation study was conducted to evaluate the performance of the new variant of the MH-RM algorithm under various conditions. Simulation results indicated that model parameters can be well recovered with the estimation scheme in all conditions considered in the present study. To demonstrate the proposed approach, a constrained version of the proposed model along with the estimation method was applied to SET data collected at a large public university. The model parameter estimates allowed answering questions that are of substantive interest. It is recommended that researchers fit different models (e.g., different measurement models or different covariates) and compare the results before drawing conclusions.

6.2. Future Directions

The lack of an approach for appropriately handling SET data is one of the major motivations of this study. Therefore, the proposed cross-classified IRT model is available to higher education researchers for better data analytic practice and improved outcomes. Compared with existing approaches, such as the CCREM approach, the model proposed in this study is more coherent since the cross-classified structure and psychometric properties of items are considered simultaneously and more information are provided. The cross-classified IRT model is also more flexible in the sense that additional latent variables can be easily incorporated. In addition, since the item-level data with cross-classified structure are ubiquitous in education and allied disciplines, where multidimensional measurement instruments are commonly applied. The proposed model can be broadly applied in these areas to address a variety of research questions. These research questions include but are not limited to questions regarding item properties, the reliability and validity of assessment instruments, and impacts of key covariates.

To the best of our knowledge, there is scant research that evaluates the consequences of misspecifying the cross-classified structure when the outcome is multivariate and categorical. To improve the higher education practice, more investigations are needed to compare the proposed cross-classified IRT model with existing approaches for analyzing SET data, including the CCREM, standard IRT models, and multilevel IRT models. For example, the proposed approach should be compared with the CCREM approach, which overlooks differences in item properties. Estimates of item parameters obtained through the proposed model also need to be compared with the estimates obtained with standard IRT models, which ignore the cross-classified structure. In addition to model parameter estimates, it is worth studying if the misspecifications would result in biased estimates of individual scores. For example, in the context of SET studies, the scores of instructors are of interest as these scores may be used for administrative and evaluative purposes. The instructor scores obtained with the proposed model can be compared with scores obtained with other approaches to improve institutional effectiveness.

The results of simulations in the present study indicate that the MH-RM algorithm is a promising algorithm for estimating multilevel latent variables with crossed random effects. To further evaluate the usefulness of the MH-RM algorithm and to provide guidelines on data collection design, simulations that consider different measurement models (e.g., 2PL model) and a wider range of conditions, especially the more general cases with multidimensional latent variables and multiple covariates, are desired. Conditions with unbalanced unit sizes should also be studied extensively, as in the SET and K–12 settings, the numbers of units of crossed factors could be very different (e.g., Lei et al., 2018; Murphy & Beretvas, 2015).

Another issue that has not been fully addressed in the present study is the missingness of data. As discussed, data that are fully crossed (e.g., all students evaluate all instructors) are rare in practice. Thus, it is of great importance to evaluate the impact of the mechanism, proportion, and pattern of missing data on parameter estimates and inferences. Results of these simulations, such as the minimum proportion of nonmissing data required to obtain accurate regression coefficients and precise individual scores, could inform future data collection design.

The present study focused on model building and parameter estimation. Future research should extend the proposed cross-classified IRT model so that additional, routine procedures could be conducted. These analyses include dimensionality assessment, differential item functioning analysis, and model fit assessment. This study considered GRM for polytomous items and 2PL model (as a constrained case of GRM) for dichotomous items. Measurement models other than GRM and 2PL model, such as nominal response model (Bock, 1972) and 3PL model, should also be explored.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Sijia Huang

References

Barker

K. M.

Dunn

E. C.

Richmond

T. K.

Ahmed

Hawrilenko

Evans

C. R.

(2020). Cross-classified multilevel models (CCMM) in health research: A systematic review of published empirical studies and recommendations for best practices. SSM-Population Health, 12, 100661.

Beretvas

S. N.

(2011). Cross-classified and multiple-membership models. In J. J. Hox & J. K. Roberts (Eds.), Handbook of advanced multilevel analysis (pp. 313–334). Routledge/Taylor & Francis Group.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.

Browne

W. J.

Goldstein

Rasbash

(2001). Multiple membership multiple classification (MMMC) models. Statistical Modelling, 1, 103–124.

Cai

(2008). A Metropolis–Hastings Robbins–Monro algorithm for maximum likelihood non-linear latent structure analysis with a comprehensive measurement model [Unpublished doctoral dissertation] . The University of North Carolina at Chapel Hill.

Cai

(2010a). High-dimensional exploratory item factor analysis by a metropolis-hastings Robbins–Monro algorithm. Psychometrika, 75, 33–57.

Cai

(2010b). Metropolis–Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35, 307–335.

Cai

(2017). flexMIRTÂ®: A numerical engine for multilevel item factor analysis and test scoring [Computer software manual]. Vector Psychometric Group.

Cho

S.-J.

Rabe-Hesketh

(2011). Alternating imputation posterior estimation of models with crossed random effects. Computational Statistics & Data Analysis, 55, 12–25.

10.

Chung

Cai

(2021). Cross-classified random effects modeling for moderated item calibration. Journal of Educational and Behavioral Statistics, 46, 651–681.

11.

Clayton

Rasbash

(1999). Estimation in large cross random-effect models by data augmentation. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162, 425–436.

12.

De Boeck

(2008). Random item IRT models. Psychometrika, 73, 533–559.

13.

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

(1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society—Series B, 39, 1–22.

14.

Dunn

E. C.

Richmond

T. K.

Milliren

C. E.

Subramanian

(2015). Using cross-classified multilevel models to disentangle school and neighborhood effects: An example focusing on smoking behaviors among adolescents in the United States. Health & Place, 31, 224–232.

15.

Ecob

Croudace

White

Evans

Harrison

Sharp

Jones

(2004). Multilevel investigation of variation in HoNOS ratings by mental health professionals: A naturalistic study of consecutive referrals. International Journal of Methods in Psychiatric Research, 13, 152–164.

16.

Falk

C. F.

Cai

(2016). A flexible full-information approach to the modeling of response styles. Psychological Methods, 21, 328–347.

17.

Fielding

Goldstein

(2006). Cross-classified and multiple membership structures in multilevel models: An introduction and review (Research Report No. 791). https://dera.ioe.ac.uk/6469/1/RR791.pdf.

18.

Fisher

R. A.

(1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22, 700–725.

19.

Fox

J.-P.

(2003). Stochastic EM for estimating the parameters of a multilevel IRT model. British Journal of Mathematical and Statistical Psychology, 56, 65–81.

20.

Fox

J.-P.

(2004). Applications of multilevel IRT modeling. School Effectiveness and School Improvement, 15, 261–280.

21.

Fox

J.-P.

(2005). Multilevel IRT using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58, 145–172.

22.

Fox

J.-P.

Glas

C. A.

(2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, 271–288.

23.

Goldstein

(1994). Multilevel cross-classified models. Sociological Methods & Research, 22, 364–375.

24.

Hastings

W. K.

(1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109.

25.

Huang

(2021). Estimation of cross-classified multilevel item response theory models with Metropolis-Hastings Robbins-Monro Algorithm [PhD dissertation] . University of California, Los Angeles.

26.

Huang

Luo

Cai

(2022). An explanatory multidimensional random item effects rating scale model. Educational and Psychological Measurement. https://doi.org/10.1177/00131644221140906

27.

Falk

C. F.

(2019). Modeling response styles in cross-country self-reports: An application of a multilevel multidimensional Nominal Response Model. Journal of Educational Measurement, 56, 169–191.

28.

Kamata

(2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38, 79–93.

29.

Lambert

P. C.

(2006). Comment on article by Browne and Draper. Bayesian Analysis, 1, 543–546.

30.

Leckie

(2009). The complexity of school and neighbourhood effects and movements of pupils on school differences in models of educational achievement. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172, 537–554.

31.

Lei

Leroux

A. J.

(2018). Does a teacher’s classroom observation rating vary across multiple classrooms? Educational Assessment, Evaluation and Accountability, 30, 27–46.

32.

Levels

Dronkers

Kraaykamp

(2008). Immigrant children’s educational achievement in western countries: Origin, destination, and community effects on mathematical performance. American Sociological Review, 73, 835–853.

33.

Lord

F. M.

Novick

M. R.

(1968). Statistical theories of mental test scores. Addison-Wesley.

34.

Louis

T. A.

(1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, 44, 226–233.

35.

Luo

Kwok

O.-m.

(2009). The impacts of ignoring a crossed factor in analyzing cross-classified data. Multivariate Behavioral Research, 44, 182–212.

36.

Maier

K. S.

(2001). A Rasch hierarchical measurement model. Journal of Educational and Behavioral Statistics, 26, 307–330.

37.

Metropolis

Rosenbluth

A. W.

Rosenbluth

M. N.

Teller

A. H.

Teller

(1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.

38.

Meyers

J. L.

Beretvas

S. N.

(2006). The impact of inappropriate modeling of cross-classified data structures. Multivariate Behavioral Research, 41, 473–497.

39.

Monroe

Cai

(2014). Estimation of a Ramsay-curve item response theory model by the Metropolis–Hastings Robbins–Monro algorithm. Educational and Psychological Measurement, 74, 343–369.

40.

Murphy

D. L.

Beretvas

S. N.

(2015). A comparison of teacher effectiveness measures calculated using three multilevel models for Raters effects. Applied Measurement in Education, 28, 219–236.

41.

Patz

R. J.

Junker

B. W.

(1999a). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.

42.

Patz

R. J.

Junker

B. W.

(1999b). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178.

43.

Pedan

Varasteh

L. T.

Schneeweiss

(2007). Analysis of factors associated with statin adherence in a hierarchical model considering physician, pharmacy, patient, and prescription characteristics. Journal of Managed Care Pharmacy, 13, 487–496.

44.

R Core Team. (2018). R: A language and environment for statistical computing [Computer software manual] . Vienna, Austria. https://www.R-project.org/

45.

Rabe-Hesketh

Skrondal

Pickles

(2004). Generalized multilevel structural equation modeling. Psychometrika, 69, 167–190.

46.

Rasbash

Leckie

Pillinger

Jenkins

(2010). Children’s educational progress: Partitioning family, school and area effects. Journal of the Royal Statistical Society: Series A (Statistics in Society), 173, 657–682.

47.

Raudenbush

S. W.

(1993). A crossed random effects model for unbalanced data with applications in cross-sectional and longitudinal research. Journal of Educational Statistics, 18, 321–349.

48.

Robbins

Monro

(1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22, 400–407.

49.

Robert

Casella

(2013). Monte Carlo statistical methods. Springer Science & Business Media.

50.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 17.

51.

Spooren

(2010). On the credibility of the judge: A cross-classified multilevel analysis on students’ evaluation of teaching. Studies in Educational Evaluation, 36, 121–131.

52.

Van den Noortgate

De Boeck

Meulders

(2003). Cross-classification multi-level logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28, 369–386.

53.

Yang

J. S.

Cai

(2014). Estimation of contextual effects through nonlinear multilevel latent variable modeling with a Metropolis–Hastings Robbins–Monro algorithm. Journal of Educational and Behavioral Statistics, 39, 550–582.

54.

Daniel

(2017). The impact of inappropriate modeling of cross-classified data structures on random-slope models. Journal of Modern Applied Statistical Methods, 16, 25.

Cross-Classified Item Response Theory Modeling With an Application to Student Evaluation of Teaching

Abstract

Keywords

1. Introduction

2. A Multilevel IRT Model for Cross-Classified Data

2.1. Latent Structural Model

2.2. Measurement Model

2.3. Observed and Complete Data Likelihood

3. Estimation of the Cross-Classified IRT Model

3.1. MH-RM Algorithm

3.1.1. Stochastic imputation

3.1.2. Stochastic approximation

3.1.3. RM update

3.2. Implementation of the MH Sampler

3.3. SE Estimation

3.3.1. Recursively approximated SEs

3.3.2. Post-convergence approximated SEs

4. Simulation Study

4.1. Data Generation

4.2. Estimation Details

4.3. Simulation Results

5. Empirical Demonstration

5.1. Sample and Measure

5.2. Research Questions and Analysis

5.3. Results

6. Discussion

6.1. Research Summary

6.2. Future Directions

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

References