Sage Journals: Discover world-class research

Abstract

Evaluating and comparing models with respect to their predictive performance is a cornerstone of Bayesian statistics. Two related and important techniques are leave-one-out cross-validation and stacking. Both quantify and set in relation the ability to predict unseen observational units from the same data-generating process for a set of models. Recent advancements in software development—in particular, the Stan modeling framework—have made it possible to apply these techniques easily to a wide range of models. However, in more complex models, such as the widely applied classes of hierarchical models and mixture models, the choice of observational unit is not trivial and can result in the need for numerical integration, in particular in non-normal models. We present a case study of Bayesian mixture item response models for cross-classified multirater data, where the most parsimonious choice of observational unit required two-dimensional integration. We show that implementing a numerical quadrature scheme directly within the Stan model code, which is available as Supplemental Data, allows for efficient and accurate estimation of predictive performance.

Keywords

Bayesian inference predictive model evaluation cross-validation LOO stacking mixture models cross-classification item response theory teaching evaluation

Bayesian predictive model evaluation is increasingly popular in the social and behavioral sciences, especially for evaluating and comparing complex statistical (e.g., psychometric) models. The leave-one-out cross-validation (LOO-CV; Bernardo & Smith, 2004; Geisser & Eddy, 1979; Vehtari & Ojanen, 2012; Vehtari et al., 2017) index can be viewed as a generalization of the Akaike information criterion (AIC; Akaike, 1998; Stone, 1977), providing an estimation of out-of-sample predictive accuracy. In contrast to the AIC, it quantifies the uncertainty contained in the assessment of predictive accuracy and takes into account all the information contained in the Bayesian analysis. Moreover, it provides additional diagnostics, such as stability coefficients and an effective number of parameters, informing researchers of the reliability of results as well as potentially misbehaved observational units and the degree of overfitting, respectively. For this reason, the usage of LOO-CV is preferred to traditional information criteria such as AIC or, similarly, the Bayesian information criterion (BIC; Schwarz, 1978), in a fully Bayesian analysis (Gelman et al., 2014; McElreath, 2020; Weakliem, 1999).

LOO-CV amounts to calculating the sum of the log predictive densities for each observational unit, conditional on all other units (Vehtari & Ojanen, 2012). In Bayesian analysis, it is typically implemented using efficient approximations, such as Pareto-Smoothed Importance Sampling (PSIS-LOO; Vehtari et al., 2017), which avoids the need to re-fit the model multiple times, making it computationally efficient. Stacking weights (or model averaging weights) can be used alongside PSIS-LOO (Yao et al., 2018). They can improve predictive performance by optimally combining multiple models, thereby capturing a broader range of uncertainty and mitigating the impact of any single model’s misspecification (Piironen & Vehtari, 2017).

How observational units are defined is a choice left to the practitioner. Although more general forms of LOO-CV have been introduced (see, e.g., Bürkner et al., 2020, 2021), the usual conceptualization of LOO-CV and its implementation via PSIS requires observational units to be exchangeable such that the likelihood can be factorized (Vehtari et al., 2017; Yao et al., 2018). This structure is required, for example, by the frequently used R (R Core Team, 2021) package loo (Vehtari, Gabry, et al., 2022) that is closely associated with the Stan modeling framework (Stan Development Team, 2022).

In simple models, such as linear regression, the only available choice is the singular data points. For hierarchical models, however, each level represents a possible choice of unit (Vehtari & Ojanen, 2012; Vehtari et al., 2017). If, for example, students are nested in classes that are themselves nested in schools, cross-validation can refer to predicting new students from existing classes, new classes with new students from existing schools, or new schools altogether.

An important special case of hierarchical models in psychology is psychometric latent variable models (Hoyle, 2012; Irwing et al., 2018). Typically, they are used to analyze data with respondents giving answers to a set of items, which induces a natural nesting of item answers in persons. These models are used to measure a latent construct, and therefore each person is associated with a person-specific parameter.

When persons are taken as the observational unit, this is problematic for PSIS-LOO (and thus, efficient stacking implementations based on it), because leaving out a person means leaving out the information that is used to estimate this person’s parameter (Gelman et al., 2014). In this case, each observational unit is highly informative for the posterior distribution (Vehtari et al., 2017). Thus, unit-specific parameters must be eliminated to use PSIS-LOO (Millar, 2018)—either analytically, by approximation, or, when such efficient solutions are not available, by numerical integration. In psychometrics, this is the case for ordinal factor models and other models with non-normal likelihoods, such as those based on item response theory (IRT; van der Linden & Hambleton, 1997). These models were developed to analyze responses on items with ordered categories that are common in surveys as well as psychological testing.

One reason to define the person as the observational unit in a psychometric latent variable model, instead of the individual item responses are theoretical considerations. Specifically, leaving out individual responses implies evaluating the model’s ability to predict new responses by the same persons on the same items, whereas leaving out persons (i.e., response vectors on the given set of items) estimates the predictive accuracy for new responses by new persons on the same items. The latter is the theoretically more pertinent quantity because researchers are usually interested in evaluating how well a model generalizes to a target population of respondents (Merkle et al., 2019).

Another reason to perform cross-validation on the person level is that in some models, the likelihood cannot be factorized within persons. This is the case for mixture models assuming latent subpopulations of respondents that a model is tasked to infer from the data (De Ayala & Santiago, 2017; von Davier & Rost, 2006). In these models, the probability of a person giving some response on an item depends on class membership, which implies non-exchangeability. Then, only full response vectors representing persons, with class membership parameters marginalized out of the individual likelihood contributions, can be exchangeable.

The problem of marginalizing out unit-specific parameters is exacerbated when multiple levels are nested within the observational unit. This is the case, for example, in multi-rater models (Eid et al., 2025). Here, each target (e.g., teachers) is associated with multiple raters (e.g., students), and the target as well as all raters are equipped with parameters. When each rater is additionally associated with multiple targets, then cross-classified models (Eid et al., 2025; Koch et al., 2016) can be fit, which separate rater-, target-, and interaction effects, which all represent different latent variables. Then, taking either targets or raters as the observational unit implies integrating out their specific parameters as well as interaction effects, resulting in the need for two-dimensional integration.

In this paper, we present a case study of how cross-validation and stacking can be performed for Bayesian cross-classified IRT models containing a mixture distribution assumption for the rater population. The analyzed data are teaching evaluations from our university, and the models were chosen to find and control for student response styles, while also accounting for the ordinal nature of responses as well as the cross-classified data structure to maximize the validity of the evaluation. The models are fit in the Stan modeling language, and we show how two-dimensionally integrating out unit-specific parameters makes it possible to do model evaluation and comparison by means of the loo package.

Predictive Model Evaluation

In this section, we briefly summarize the mathematical basis of PSIS-LOO and stacking to introduce the quantities of interest we calculated in our study. An extensive overview of Bayesian predictive methods is given by Vehtari and Ojanen (2012), and an abridged and more practically oriented version containing recommendations was published by Piironen and Vehtari (2017). More information on and introductions to PSIS-LOO can be found in Gelman et al. (2014) as well as Vehtari et al. (2017), and, for stacking, Yao et al. (2018).

PSIS-LOO

Consider a collection of $K$ models. The goal of PSIS-LOO is to estimate each model’s expected log posterior density for data from the same data-generating process that the sample is from. When the data-generating process is unknown and out-of-sample data is unavailable, this quantity is approximated for each model $k$ , by leaving out each of the $n$ individual observations $y_{i}$ and calculating the sum of their log posterior densities $\log p_{k} (y_{i} | y_{- i})$ given all other units $y_{- i} = (y_{1}, \dots, y_{i - 1}, y_{i + 1}, \dots, y_{n})$ (Vehtari et al., 2017). In the terminology introduced by Vehtari et al. (2017), the result is the expected log pointwise posterior density,

{ELPD}_{LOO} = \sum_{i = 1}^{n} \log p_{k} (y_{i} | y_{- i}) .

(1)

When individual units are predicted conditional on the complete data $y$ instead of $y_{- i}$ , the result is an overestimation of predictive accuracy. By comparison with the leave-one-out estimate, the effective number of parameters,

p_{LOO} = \sum_{i = 1}^{n} \log p_{k} (y_{i} | y) - {ELPD}_{LOO},

(2)

can be calculated. When compared with the actual number of parameters a model has, $p_{LOO}$ gives a quantitative estimate of how much information (encoded in parameters) is contained in the model and thus can serve as an indicator of overfitting (Vehtari et al., 2017).

Exact LOO-CV requires re-fitting the model $n$ times for each left-out data point. Pareto-Smoothed Importance-Sampling is a way to get an estimate of ${ELPD}_{LOO}$ . It consists of, first, quantifying the relative importance of each observation for the posterior distribution by using the individual likelihood contributions in each step $s$ of the Markov Chain of length $S$ to calculate importance ratios. Second, the largest ratios of each observation are smoothed by means of the Pareto distribution (Vehtari et al., 2017; Vehtari, Simpson, et al., 2022). The resulting $n \times S$ importance weights $w_{i, s}$ are used to construct the estimator

{\hat{ELPD}}_{LOO} = \sum_{i = 1}^{n} \log (\frac{\sum_{s = 1}^{S} w_{i, s} p_{k} (y_{i} | θ_{s})}{\sum_{s = 1}^{S} w_{i, s}}) .

(3)

Moreover, for each observation $i$ , the shape parameter ${\hat{k}}_{i}$ of its associated Pareto distribution can be used to gauge the relative importance of this observation for the posterior distribution. For ${\hat{k}}_{i} < 0.5$ , the corresponding distribution has finite variance, and for ${\hat{k}}_{i} < 1$ it has finite mean. Values that exceed unity indicate that Pareto-Smoothing has failed for this observation. Vehtari et al. (2017) further heuristically distinguish “ok” shape parameters in the interval $(. 5, . 7)$ from “bad” shape parameters in the interval $(. 7, 1)$ .

Stacking

In stacking, LOO-CV is performed with all $K$ models simultaneously in an optimal way (Yao et al., 2018). This is done by finding

\max_{v \in S_{1}^{K}} \sum_{i = 1}^{n} \log \sum_{k = 1}^{K} v_{k} p_{k} (y_{i} | y_{- i}),

(4)

where $S_{1}^{K}$ designates the $(K - 1)$ -simplex. Optimality means that the resulting stacked predictive distribution is closest (of all possible stacks) to the true data-generating process with respect to the logarithmic score. For the LOO posterior density evaluations of each model, the importance weights from PSIS are re-used and each $p_{k} (y_{i} | y_{- i})$ is estimated as in LOO-CV by $\sum_{s = 1}^{S} w_{i, s} p_{k} (y_{i} | θ_{s}) / \sum_{s = 1}^{S} w_{i, s}$ . Stacking can be interpreted as emulating the ideal of a continuous model space that has each candidate model as a special case (Gelman et al., 2020; Yao et al., 2018).

Elimination of Unit-Specific Parameters

As discussed in the introduction section, when each observational unit is associated with its own unit-specific parameter(s), they must be eliminated to do PSIS-LOO by means of the loo R package. In certain special cases, this can be done by conditioning on sufficient statistics (Basu, 1977). Generally, however, the likelihood must be marginalized with respect to these parameters (Merkle et al., 2019; Vehtari et al., 2016), that is, they must be integrated out. In this case, the model can be represented in two ways: either based on the conditional likelihood, still containing the nuisance parameters, or on the marginal one (cf. Merkle et al., 2019). Using the marginal likelihood while fitting the model reduces the size of its parameter space and thus stabilizes the MCMC sampling process (Merkle & Rosseel, 2018). However, when marginalization is computationally expensive, it can be faster to integrate out parameters post-hoc, that is, for each draw from the posterior distribution, the likelihood is marginalized with the draw’s remaining parameter values plugged in.

For Gaussian models, closed-form expressions for the integrals might exist (Merkle & Rosseel, 2018) or efficient approximations can be used (Vehtari et al., 2016). For many other important classes of models, neither closed forms nor approximations exist. Then, numerical integration must be performed. Merkle et al. (2019) used adaptive Gaussian quadrature post-hoc to eliminate person parameters in psychometric models in order to do PSIS-LOO. Stan comes with its own integrator function that uses double exponential quadrature as implemented in the Boost C++ library (Agrawal et al., 2017; Takahasi & Mori, 1973). In preliminary tests, we found it to perform more efficiently than adaptive Gaussian quadrature when implemented directly in the generated quantities block of our Stan models, which allowed us to optimize this process by reusing calculations. The code for all models can be found in Supplemental Data (available in the online version of this article).

Analyzing Teaching Evaluation Data

In educational research and evaluation, cross-classified structures arise naturally since students usually attend multiple courses and therefore submit multiple ratings (cf. Pagani & Seghieri, 2002). Figure 1 visualizes the data structure for a minimal teaching evaluation design with four teachers and three students. As can be gathered from this figure, every response vector (i.e., a row in the table) is nested within one rater as well as one target. A set of response vectors pertaining to the same target, however, is not nested within any single rater (indicated by the black rectangle around all response vectors exemplarily for target 2 in Figure 1). On the other hand, a set of response vectors pertaining to the same rater is not nested within any single target (indicated by the dashed rectangles around all response vectors exemplarily for rater 2 in Figure 1). Thus, there are two possible hierarchies—one for raters and the other for targets (NB: targets and raters can also be considered dummy-coded variables on an auxiliary top-level; for details see Skrondal & Rabe-Hesketh, 2004).

Figure 1.

Cross-classified data structure for a minimal teaching evaluation design.

On the right, the updating of class membership probabilities is schematically illustrated for a mixture of raters as was implemented in the present work. Each rater is assigned prior class membership probabilities $π_{c}$ for each class $c$ which are then updated to individual posterior class membership probabilities $P (c_{tr} = c | {(Y_{tr})}_{t \in A_{r}^{T}})$ that depend on data. The $A_{r}^{T}$ denote the sets of targets assessed by each rater $r$ . In this minimal example, $A_{1}^{T} = {1, 3, 4}$ , $A_{2}^{T} = {1, 2, 4}$ , and $A_{3}^{T} = {1, 2, 3, 4}$ .

While this structure renders classical multilevel models inappropriate, in return, it makes it possible to assess target and rater biases separately. This enables researchers to quantify the interdependency of targets and raters and relate this measure to the variation in target and rater latent traits. Specifically, a high degree of interdependency indicates that responses cannot be well explained by the latent trait of the target (e.g., teaching ability) and is, therefore, detrimental to the validity of the evaluation—which is a controversial subject in the pertinent literature (Kromrey, 1994; Marsh, 1984, 1987; Rindermann, 2001; Wolbring, 2013a, 2013b).

Despite the ubiquity of this complex data structure in teaching evaluation, often rather simplistic statistical tools (e.g., sum scores and multiple regression) are used to analyze these data in practice, not controlling for confounding effects such as measurement error (Ziegler & Weis, 2015). This shows a lack of suitable models and analysis techniques that have been tested and tried and are available to a broad audience. One candidate is cross-classified multilevel models (or crossed random effects models; Goldstein, 1994, 2010). They allow the explicit modeling of the dependencies induced by the data structure and separate target, rater, as well as target–rater interaction effects. Simulation studies show that fitting multilevel models and thus ignoring the cross-classification leads to biased estimates (in particular, inflation of standard errors; Schultze et al., 2015). Multilevel confirmatory factor analysis models have been recommended (Sengewald & Vetterlein, 2015). They allow to explicitly model measurement error and to specify measurement models on multiple levels (see, e.g., Koch et al., 2014, 2015, 2016).

Few publications have unified these two approaches. Koch et al. (2016) provided an extension to cross-classification for multitrait-multimethod designs with continuous observed variables based on the multilevel correlated trait-correlated method minus 1 [CTC(M − 1)] model by Eid et al. (2008). For these types of data, the model allows the explicit modeling of measurement error and cross-classified effects (latent traits), the combination of various types of methods (e.g., self-assessment and multiple rater assessments per target), estimation of convergent and discriminant validity, regression of cross-classified effects on covariates, as well as the calculation of variance coefficients and (un)reliability.

Nevertheless, teaching evaluation is based on Likert scales, and thus models for continuous outcomes are not appropriate (Liddell & Kruschke, 2018). Rather, because the assessment of a latent trait, such as teaching quality, is the main concern, IRT models should be used. Two major families of IRT models are cumulative models on the one hand and adjacent-category models on the other (Bürkner & Vuorre, 2019). They are often related to different assumptions about the underlying response process (see also Andrich, 1995). While the first family is linked to factor analysis and the discretization of an auxiliary latent continuous variable (Takane & de Leeuw, 1987), the second can be related to an assumed decision process (see, e.g., Plieninger & Meiser, 2014). Oftentimes, it is not obvious a priori which class of models is more appropriate to a given application, and a choice is made based on which mathematical properties of either model are more in line with the research goals or, if available, on fit indices (Bürkner & Vuorre, 2019). Therefore, in the present work, we extend the model by Koch et al. (2016) to each of these families’ most prominent representatives: the graded response model (GRM; Samejima, 1969, 1997), respectively, the generalized partial credit model (GPCM; Muraki, 1992, 1997).

Additional complexity in teaching evaluation data comes from students differing qualitatively in their utilization of questionnaires (Bacci & Gnaldi, 2015), which results in differential item functioning. In particular, students may show certain response styles, such as the preference for the upper and lower ends of a scale, known as extreme responding, or the tendency to agree and thus choose positively worded categories in the upper half of the scale, called acquiescence (for an overview, see Van Vaerenbergh & Thomas, 2013). Mixture IRT (Mix-IRT) models (De Ayala & Santiago, 2017; Rost, 1990; von Davier & Rost, 2006) represent a combination of latent class analysis (LCA) and IRT and are one of many approaches appropriate to model this kind of heterogeneity (cf. Henninger & Meiser, 2020). They represent an exploratory approach by assuming that each observational unit (e.g., students; for more details, see the Model Description section) belongs to any of a fixed number of groups (or classes, in the vocabulary of LCA) that do not need to be known beforehand but can be inferred from the data. Advantageously, this implies relaxing the assumptions of unidimensionality (cf. Rijmen & De Boeck, 2005) and local independence. At the same time, Mix-IRT models allow to check for measurement invariance because the latent classes can potentially differ in any of the parameters of the model, such as difficulty parameters or latent variances.

There are various examples of these models being applied to standard multilevel data (Cho & Cohen, 2010; Fox, 2005; Lee et al., 2018; Vermunt, 2008a, 2008b). However, few publications considered Mix-IRT models in the context teaching evaluation (Bacci & Gnaldi, 2015) and especially in conjunction with cross-classified data structures (Jin & Wang, 2017; Kelcey et al., 2014).

To account for all aforementioned intricacies of teaching evaluation data, we integrated these modeling techniques into a unified approach that we call mixture cross-classified item response theory (Mix-CC-IRT) models. In the following, we give their likelihoods after introducing their components.

Model Description

Consider the evaluation of target (i.e., teacher) $t$ by rater (i.e., student) $r$ in the form of responses to a set of $K$ Likert scale items with $J + 1$ ordered categories, each collected in the response vector $y_{tr •} : = (y_{tr 1}, \dots, y_{trK}) \in {1, \dots, J + 1}^{K}$ (in the following, we use the dot notation to indicate a collection of parameters with the same index). In IRT, the response probabilities for each item $Y_{k}$ are modeled by means of a response function $P$ depending on $J$ variables $x_{1}, \dots, x_{J}$ that are placeholders for the model parameters to be defined shortly. For the GRM, it is given by

P (Y_{k} = j | x_{1}, \dots, x_{J}) : = {\begin{matrix} 1 - \frac{\exp (x_{1})}{1 + \exp (x_{1})}, & j = 1 \\ \frac{\exp (x_{j - 1})}{1 + \exp (x_{j - 1})} - \frac{\exp (x_{j})}{1 + \exp (x_{j})}, & 1 < j < J + 1 \\ \frac{\exp (x_{J})}{1 + \exp (x_{J})} & j = J + 1 \end{matrix}

(5)

And for the GPCM by

P (Y_{k} = j | x_{1}, \dots, x_{J}) : = \frac{\exp (\sum_{j^{'} = 0}^{j - 1} x_{j^{'}})}{\sum_{j^{″} = 1}^{J + 1} \exp (\sum_{j^{'} = 0}^{j^{″} - 1} x_{j^{'}})} (x_{0} \equiv 0) .

(6)

Because the GRM is a cumulative model and probabilities cannot be negative, it needs to hold that $x_{1} < \dots < x_{J}$ (Samejima, 1969).

We assume that there are $C$ distinct latent classes of response behavior in raters, which translates to the assumption that the probability of each response vector is different for each class that the associated rater can belong to. This requires, for each class $c$ , an additional parameter $π_{c}$ that represents the unconditional (i.e., a priori) probability of a rater belonging to that class. These parameters enter a weighted sum with the response functions whose parameters are class-specific.

In the present case, for each item $k$ on which rater $r$ evaluates target $t$ there is, in each model, for each class $c$ a set of $J$ parameters ${τ_{trjkc} : 1 \leq j \leq J}$ . To reduce dimensionality, these parameters are decomposed into item-independent target, rater, and target-rater-interaction parameters as well as person-independent item parameters as follows:

τ_{trjkc} = {\tilde{λ}}_{k}^{T} T_{t} + {\tilde{λ}}_{kc}^{R} R_{r} + {\tilde{λ}}_{k}^{I} I_{tr} - {\tilde{δ}}_{jkc} .

(7)

The parameters $T_{t}$ , $R_{r}$ , and $I_{tr}$ are factor scores (used interchangeably with effects in the following) on the target, rater, and interaction factor, respectively. Thus, the ${\tilde{λ}}_{k}^{T}$ , ${\tilde{λ}}_{kc}^{R}$ , and ${\tilde{λ}}_{k}^{I}$ parameters can be considered factor loadings on each item $k$ . A central assumption of cross-classified multi-level modeling is that raters and targets are independent. Then, defining the interaction effect as a residual implies that all factors are uncorrelated (see for details Koch et al., 2016). Lastly, the parameters ${\tilde{δ}}_{jkc}$ can be considered (class-dependent) negative category intercepts (i.e., thresholds), also called difficulty parameters.

Bayesian posterior sampling methods (such as Hamiltonian Monte Carlo as implemented in Stan; Stan Development Team, 2022) can be made more efficient by breaking down parameters into their independent and centered components. Thus, we further split the item parameters into common tendency and deviation parameters. For the thresholds, we set $δ_{jkc} + μ_{c} : = {\tilde{δ}}_{jkc}$ with

\sum_{j = 1}^{J} \sum_{k = 1}^{K} δ_{jkc} = 0 (\forall c) .

(8)

As scaling parameters, the factor loadings need to be positive. In Stan, this constraint is implemented via a $\log$ transformation (Stan Development Team, 2022). To keep the standard method of decomposition that was used for the threshold parameters, also for the loading parameters, an additive decomposition with a sum-to-zero constraint was implemented on the $\log$ scale, which naturally transforms into a multiplicative decomposition in the transformed parameter space, thus yielding

{\tilde{λ}}_{k}^{T} = ξ^{T} λ_{k}^{T},

(9a)

{\tilde{λ}}_{kc}^{R} = ξ_{c}^{R} λ_{k}^{R} (\forall c),

(9b)

{\tilde{λ}}_{k}^{I} = ξ^{I} λ_{k}^{I},

(9c)

and

Π_{k = 1}^{K} λ_{k}^{T} = Π_{k = 1}^{K} λ_{k}^{R} = Π_{k = 1}^{K} λ_{k}^{I} = 1 .

(10)

Note that the only elementary parameters to vary across classes are $δ_{jkc}$ , $μ_{c}$ , and $ξ_{c}^{R}$ , which means that response probabilities can differ with respect to the overall central tendency across items, how far they are spread across the latent continuum, and how strongly they are correlated with rater bias.

Collecting all parameters in $Θ$ , the likelihood of the whole data set in the $Mix - CC - {IRT}^{C}$ with $C$ classes is given by

P (y_{• • •} | Θ) = Π_{r = 1}^{N_{R}} \sum_{c = 1}^{C} π_{c} \underset{t \in A_{r}^{T}}{Π} Π_{k = 1}^{K} P (y_{trk} | τ_{tr • kc}),

(11)

where the sets $A_{r}^{T} \subset {1, \dots, N_{T}}$ denote the collection of all targets assessed by the rater $r$ (see Figure 1). As a defining property of cross-classified in opposition to nested multilevel data, these sets are overlapping (that is, for every $r$ and for all $t \in A_{r}^{T}$ there exists $r^{'}$ such that $t \in A_{r^{'}}^{T}$ ).

For the application of PSIS-LOO and stacking, marginalized likelihood contributions for each rater need to be calculated by approximating the following quantity:

\begin{matrix} \int \int P (Y_{• r •} = y_{• r •} | Θ) d I_{r •} d R_{r} \\ = \int \int \sum_{c = 1}^{C} π c \underset{t \in A_{r}^{T}}{Π} Π_{k = 1}^{K} P (Y = y_{trk} | τ_{tr • kc}) d I_{r •} d R_{r} \end{matrix}

(12a)

= \sum_{c = 1}^{C} π_{c} \int \prod_{t \in A_{r}^{T}} [\int \prod_{k = 1}^{K} P (Y = y_{t r k} | τ_{t r • k c}) d I_{r t}] d R_{r}

(12b)

The parameters $τ_{trjkc} = τ_{trjkc} (T_{t}, R_{r}, I_{rt})$ , as defined in Equation7, are a function of the target-, rater-, and interaction effects. First, for each rater $r$ , all interaction effects $I_{rt}$ are integrated out (with the inner integral containing being a shorthand for a collection of integrals, one for each target $t$ that was evaluated by that rater). This is necessary since the interactions are nested within raters, and leaving out the respective rater has the consequence that the data lack information about the interaction effects as well. By standard theory, the integral can be pulled inside the product such that each interaction effect can be integrated out separately for each rater. Then, the rater effect $R_{r}$ is integrated out.

Since calculating integrals in each proposed jump of the Markov Chain is very costly, the estimation of the model is based on the conditional, rather than marginalized, likelihood. For this reason, the integral in Equation 12b is evaluated only in the “generated quantities” block of our Stan models. To reduce the number of individual function evaluations, we implemented the double exponential quadrature directly inside the generated quantities block of our Stan models, which allowed us to circumvent needlessly recalculating nonvarying parts of the likelihood (e.g., the sums of item parameters and teacher ability parameters within the composite parameters $τ_{trjkc}$ ; for more information see the Stan code in the Supplemental Data in the online version of the journal).

Application

We fitted both $Mix - CC - {IRT}^{C}$ models as defined in the previous section, each with $C = 1$ , 2, and 3 classes, to 9 years of teaching evaluation data from our university, which were based on an evaluation survey developed for this purpose (Born et al., 2006). We varied the number of classes and then compared convergence measures of the competing models. All analyses were performed using R (4.1.2; R Core Team, 2021) with the data.table package (1.14.4; Dowle & Srinivasan, 2022) and the tidyverse package collection (1.3.2; Wickham et al., 2019) for data management and ggplot2 (3.3.6; Wickham, 2016) for visualization. We used the cmdstanr package (0.5.3; Gabry & Češnovar, 2022) to fit the Bayesian models via Stan (2.31; Stan Development Team, 2022) and the posterior package (1.3.1; Bürkner et al., 2022) for the calculation of convergence measures. Cross-validation and stacking were performed by means of the loo package (2.5.1; Vehtari, Gabry, et al., 2022).

Ethical Statement

The data analysis was planned and approved as part of a research project funded by the German Research Foundation (project number 405463675). The data were used with permission from the Friedrich Schiller University Jena under a usage agreement and the analysis fully complied with the ethical guidelines of the German Psychological Society.

Data Availability

The data cannot be shared with third parties for privacy reasons.

Data Preparation

The data preparation process is described in detail in Supplemental Appendix A (available in the online version of this article). It resulted in $N_{T} = 286$ teachers assessed by $N_{R} = 1, 783$ students with a total of $N_{I} = 4, 113$ individual ratings. The set of $K = 12$ items is given in Table A2 (available in the online version of this article).

Fitting Procedure

For each model, four chains were used with 2,000 warmup and 8,000 sampling iterations per chain. Because Hamiltonian Monte Carlo [HMC] is highly effective no thinning is necessary (except for the Mix-CC- ${GPCM}^{3}$ , see below) such that every fit contained 32,000 posterior samples.

Prior Specification and Identification of Mixture Parameters

The full prior specification can be found in Table A1 (available in the online version of this article). To achieve a high efficiency of the sampler and a satisfactory convergence of the Markov chains, appropriate reparameterizations (e.g., restricting the use of hyperparameters and soft constraints for parameter decompositions) and distributions with low variances were employed. A full justification for the prior model is given in Supplemental Appendix A (available in the online version of this article).

A known issue in mixture modeling is that latent classes are exchangeable, that is, the posterior distribution is invariant with respect to the permutation of class indices (Frühwirth-Schnatter, 2001). Apart from the post-hoc relabeling of classes (e.g., by means of the label.switching package; Papastamoulis, 2016), this issue can be dealt with by imposing an ordering constraint on any quantity varying across classes in the model. We used the latter technique to identify the latent classes, the details of which are given in Supplemental Appendix A (available in the online version of this article).

Results

Convergence

For all models, no divergent transitions were observed, no iteration reached the maximal treedepth, and the estimated Bayesian fraction of missing information was close to 0.6 for all chains (for more information about these measures, refer to Vehtari et al., 2021, Betancourt, 2018, as well as the Stan users guide, Stan Development Team, 2022). Moreover, all models but the $Mix - CC - {IRT}^{3}$ converged satisfactorily with respect to the parameter-specific convergence diagnostics we calculated. Most notably, all $\hat{R}$ values were below 1.005 for all quantities that were sampled (i.e., the parameters) as well as calculated subsequently (i.e., parameter transformations), which falls below the usually required threshold of 1.01 necessary for convergence (Vehtari et al., 2021).

For the Mix-CC- ${GPCM}^{3}$ , we resorted to fitting 12 parallel chains with 4,000 warmup and 10,000 sampling iterations each. After the chains were fit, we removed two chains that got stuck in local solutions (identified by deviations in chain-wise posterior means from the bulk of the chains) and subsequently selected the subset of four chains that resulted in the smallest maximum $\hat{R}$ value for any parameter. Lastly, we removed every fifth iteration in each remaining chain (i.e., we performed thinning with a ratio of 1.25) to get the same number of posterior samples as for the other models and obtained an upper bound for the $\hat{R}$ values of 1.05, which means that interpretations of the resulting estimates (which includes cross-validation results) should be made with caution.

Selected Parameters

The focus of this article is on cross-validation and not on the analysis itself, which is why we only report posterior quantities of parameters that can inform the cross-validation results. In particular, it is important to understand how, within one type of model (i.e., either within the GRM or the GPCM), the two- and three-class solutions differ with respect to their predictions. In the following, a circumflex designates a posterior quantity. In the main body of the text, we report posterior means and 90% posterior intervals. Quantities whose Monte Carlo standard errors exceed 0.01 are underlined.

In the Mix-CC- ${GRM}^{2}$ , classes were unevenly split such that the first class was the minority class with $\hat{π_{1}} = 0.381$ [0.344; 0.418]. For completeness, $\hat{π_{2}} = 0.619$ [0.582; 0.656].

Overall, items were very easy and overall easiness did not differ between classes when taking into account the 90% posterior intervals. Indeed, $\hat{μ_{1}} = 3.520$ [3.361; 3.680] and $\hat{μ_{2}} = 3.634$ [3.487; 3.783]. This means that in both subpopulations, overall teaching quality was rated high.

On the other hand, the sample variance of the centered thresholds (note the sum-to-zero constraint) $Var [δ_{• • c}] : = \frac{1}{JK - 1} \sum_{jk} {(δ_{jkc})}^{2}$ in each class differed greatly, with $\hat{Var [δ_{• • 1}]} = 4.050$ [3.782; 4.328] and $\hat{Var [δ_{• • 2}]} = 9.507$ [9.093; 9.942]. This means that thresholds in the first class are closer together than in the second, while, by the previous paragraph, the overall location of thresholds is similar across classes. Hence, the minority class can be considered an extreme-responder class without a difference in average item difficulty.

In the $Mix - CC - {GRM}^{3}$ , one class collapsed and the remaining classes are similar to those of the $Mix - CC - {GRM}^{2}$ . That is, $\hat{π_{1}} = 0.007$ [0.002; 0.014], $\hat{π_{2}} = 0.392$ [0.353; 0.431], and $\hat{π_{3}} = 0.602$ [0.561; 0.640]. The latter two classes had, as in the $Mix - CC - {GRM}^{2}$ , similar mean item difficulties as well as differing threshold variance: $\hat{μ_{2}} = 3.609$ [3.479; 3.745] and $\hat{μ_{3}} = 3.510$ [3.380; 3.638], respectively, $\hat{Var [δ_{• • 1}]} = 4.243$ [3.954; 4.540] and $\hat{Var [δ_{• • 2}]} = 9.353$ [8.966; 9.757].

In the $Mix - CC - {GPCM}^{2}$ , unconditional class membership probabilities were somewhat similar to those of the $Mix - CC - {GRM}^{2}$ with $\hat{π_{1}} = 0.445$ [0.411; 0.479] and $\hat{π_{2}} = 0.555$ [0.521; 0.589]. Moreover, as was the case for the $Mix - CC - {GRM}^{2}$ , the classes did not differ with respect to item easiness but with respect to the threshold variance: $\hat{μ_{1}} = 2.414$ [2.300; 2.528] and $\hat{μ_{2}} = 2.537$ [2.416; 2.661], respectively, $\hat{μ_{2}} = 2.537$ [1.004; 1.304] and $\hat{Var [δ_{• • 2}]} = 4.888$ [4.551; 5.245] (note that item parameter values are not comparable across models because they differ in meaning).

In the $Mix - CC - {GPCM}^{3}$ one unconditional class membership probability was low, but still an order of magnitude larger than that of the collapsed class in the $Mix - CC - {GRM}^{3}$ : $\hat{π_{1}} = 0.050$ [0.033; 0.073], $\hat{π_{2}} = 0.439$ [0.398; 0.482], and $\hat{π_{3}} = 0.510$ [0.460; 0.555]. Regarding item easiness and threshold sample variance of the remaining classes, the $Mix - CC - {GPCM}^{3}$ is similar to the $Mix - CC - {GPCM}^{2}$ : $\hat{μ_{2}} = 2.255$ [2.133; 2.374] and $\hat{μ_{3}} = 2.498$ [2.375; 2.621], respectively, $\hat{Var [δ_{• • 2}]} = 1.299$ [1.124; 1.512] and $\hat{Var [δ_{• • 3}]} = 4.962$ [4.611; 5.340].

Predictive Model Evaluation

Leave-One-Out Cross-Validation

First, we checked Pareto- ${\hat{k}}_{i}$ values to make sure that the PSIS-LOO estimates were well-behaved. For no model and no observation, ${\hat{k}}_{i}$ values above 1 were observed. In all but the Mix-CC-GPCM² and Mix-CC-GRM¹, one out of the $N_{R} = 1783$ observational units had a Pareto- ${\hat{k}}_{i}$ value above 0.7 (this problematic observational unit was the same in each fit). All other values were, in the terminology of Vehtari et al. (2017), at least “ok,” with more than 99% “good” in all models (the exact numbers can be found in Table 1).

Table 1.

Leave-One-Out Cross-Validation Results

Model	$\hat{k} > 0.5$	$\hat{k} > 0.7$	$p_{true}$	$\hat{p_{LOO}}$	$\hat{{ELPD}_{LOO}}$	$\hat{{ELPD}_{Δ}}$
Mix-CC- ${GRM}^{3}$	9	1	470	376.12 (10.95)	−42,155.92 (426.98)
Mix-CC- ${GRM}^{2}$	9	1	420	358.69 (7.71)	−42,156.73 (426.78)	−0.81 (3.51)
Mix-CC- ${GPCM}^{3}$	17	1	470	406.03 (11.15)	−42,233.71 (427.52)	−76.98 (24.69)
Mix-CC- ${GPCM}^{2}$	15	0	420	365.75 (8.47)	−42,271.99 (427.18)	−38.28 (11.62)
Mix-CC- ${GRM}^{1}$	6	0	370	313.33 (7.26)	−42,675.09 (433.29)	−403.09 (48.37)
Mix-CC- ${GPCM}^{1}$	6	1	370	324.82 (8.14)	−42,976.87 (433.75)	−301.78 (30.25)

Note. The second and third columns indicate the number of observational units with a Pareto- $\hat{k}$ value greater than 0.5 or 0.7, respectively. The next two columns contain the true number of free parameters, $p_{true}$ , and the PSIS-LOO estimate of the effective number of parameters, $\hat{p_{LOO}}$ , with the respective standard errors. Because rater and interaction effects are integrated out, the former is given by the number of target effects (i.e., the number of targets) plus the number of free location and scaling parameters (i.e., considering their identification constraints) and the number of free unconditional class membership probabilities (i.e., $C - 1$ ). The second-to-last column contains the respective model’s $\hat{{ELPD}_{LOO}}$ estimate with standard errors. In the last column, successive average pointwise differences in $\hat{{ELPD}_{LOO}}$ estimates with standard errors are given, that is, the number in each row represents the difference in $\hat{{ELPD}_{LOO}}$ estimates to the row above. PSIS = Pareto-Smoothed Importance Sampling; LOO = leave-one-out; GRM = graded response model; GPCM = generalized partial credit model.

As another diagnostic, the effective number of parameters estimated by PSIS-LOO, $\hat{p_{LOO}}$ , can be compared with the actual number of free parameters, $p_{true}$ . In the present case, $\hat{p_{LOO}}$ is smaller than $p_{true}$ in all models. However, increasing the number of classes from one to two corresponds to an increase in $\hat{p_{LOO}}$ that is approximately equal (considering the respective standard errors) to the actual increment in parameters of 50 (i.e., for an additional class, $12 \cdot 4$ free location parameters $μ_{c}$ and $δ_{c • •}$ , one average rater effect contribution $ξ_{c}^{R}$ and one unconditional class membership probability parameter $π_{c}$ ). This suggests an overparameterization in target factor scores as well as a slight overparameterization in item parameters.

Since the statistical quantities $\hat{p_{LOO}}$ and $\hat{{ELPD}_{LOO}}$ are distributed approximately normally, their standard errors can be interpreted in terms of the standard deviation of a normal distribution, which gives a reference for comparing these quantities (Vehtari et al., 2017). As a general heuristic, differences exceeding two standard errors can be deemed “significant.” Notably, in opposition to the comparison of the Mix-CC-GRM³ and Mix-CC-GRM², the Mix-CC-GPCM³ model had a $\hat{p_{LOO}}$ that was significantly larger than that of the Mix-CC-GPCM², which is likely due to its additional class actually contributing to the prediction process and thus utilizing its corresponding class-specific parameters.

As Table 1 suggests, the Mix-CC-GRM³ dominates the other models in terms of $\hat{{ELPD}_{LOO}}$ . Nevertheless, the difference in $\hat{{ELPD}_{LOO}}$ between the Mix-CC-GRM³ and the Mix-CC-GRM² is smaller than its standard error, which indicates that these two models have approximately equal predictive accuracy.

On the other hand, the Mix-CC-GPCM³ had a higher predictive accuracy than the Mix-CC-GPCM², even considering the standard error of the differences. For the remaining one-class models, the successive differences in $\hat{{ELPD}_{LOO}}$ greatly exceed multiples of the respective standard errors, such that a nested ranking in terms of $\hat{{ELPD}_{LOO}}$ can be stated with the two- and three-class models having higher predictive accuracy than the one-class models, and within this distinction the $Mix - CC - {GRM}^{C}$ superseding the $Mix - CC - {GPCM}^{C} .$

Stacking

We performed stacking in three different ways. First, we stacked all models together, resulting in six weights that altogether sum up to 1. Second, we stacked the models by class, that is, we only included models with the same number of classes in the stacking procedure, resulting in three pairs of weights. Third, we stacked the one-, two-, and three-class solutions of each type of IRT model, resulting in two weight triples.

The results are given in Table 2. They reflect, roughly, those of the PSIS-LOO procedure. Whereas the Mix-CC-GRM² and Mix-CC-GRM³ share considerable parts of the prediction weight in Tables 2a and 2c, the Mix-CC-GPCM³ outperforms the Mix-CC-GPCM² in both stacks. Interestingly, the one-class solutions take up around 10% of the weight, both in Table 2a, when adding the weights of both one-class solutions, as well as in the stacking by model in Table 2c. This indicates that there are observational units that are either not adequately predicted by the higher-class solutions or, conversely, that are as accurately predicted by models with only one class.

Table 2.

Stacking Weights $\hat{w}$ for Different Stacks

Model	$C = 1$	$C = 2$	$C = 3$
(a) Stacking all models
Mix-CC-GPCM ^C	0.051	0.000	0.329
Mix-CC-GRM ^C	0.035	0.431	0.154
(b) Stacking by class
Mix-CC-GPCM ^C	0.027	0.284	0.370
Mix-CC-GRM ^C	0.973	0.716	0.630
(c) Stacking by model
Mix-CC-GPCM ^C	0.109	0.062	0.829
Mix-CC-GRM ^C	0.099	0.394	0.506

Note. GRM = graded response model; GPCM = generalized partial credit model.

Discussion

In this study, we showed how to perform predictive model evaluation using PSIS-LOO and stacking in mixture cross-classified IRT models. These procedures were made possible by integrating out rater and interaction parameters. Our implementation of the models, as well as the double exponential quadrature, led to satisfactory results. The corresponding Stan code is part of the Supplemental Data (available in the online version of this article) and can be extended and adapted to other applications.

The advantages of predictive model evaluation are particularly apparent in the context of our application because of the high number of degrees of freedom in the modeling process. First, specifying a higher number of latent classes can make the models more flexible and thus improve the fit to the data, which, at the same time, bears the risk of overfitting. The information provided by LOO-CV and stacking can help to find redundancies and identify overparameterizations. In our case, this meant comparing the effective and true number of parameters as well as considering the stacking weights for the one-class models.

Second, adjacent-category and cumulative models can be employed for the same purpose, and they share many features. It is often assumed that the choice between adjacent-category vs. cumulative models is one of researcher preferences or interpretability (Andrich, 1995). The present results paint a more nuanced picture. The magnitude of the difference in predictive power depends on how many classes are used. In the one-class solutions, the Mix-CC- ${GRM}^{1}$ clearly dominates the Mix-CC-GPCM¹ in terms of predictive accuracy, with a difference in $\hat{{ELPD}_{LOO}}$ that exceeds tenfold its standard error. In the two- and three-class solutions, this difference is attenuated, while still indicating a superiority of the cumulative models.

In addition, this case study shows how stacking can provide valuable information beyond raw prediction scores. When stacked only against the Mix-CC-GRM¹, the Mix-CC-GPCM¹ is assigned negligible weight, reinforcing the evidence for a true benefit in prediction when choosing the former over the latter. Considering either the class-wise stacking results or when all models are stacked together, the combined weights of both types of models are of almost equal magnitude, which indicates that they performed similarly for a sizeable portion of observational units. Nevertheless, from the distribution of weights, it can also be concluded that, while the Mix-CC-GPCM³ might perform comparably to the Mix-CC-GRM², it requires an additional class to do so.

The previous considerations make it clear that predictive model evaluation can yield unique insight beyond conventional fit indices. They provide diagnostic information concerning not only the predictive criteria but also with respect to the stability of estimation, as well as conspicuous observational units. Next, the $\hat{{ELPD}_{LOO}}$ index comes with standard errors of differences that let researchers judge their magnitude. Lastly, stacking different permutations of models lets researchers find the optimal model for their use case.

This has important practical implications for educational researchers. First, it enables the systematic testing and comparison not only of measurement invariance assumptions but also of fundamentally different classes of models that assume distinct underlying measurement processes. For example, in an educational assessment context, a researcher might find that a model including response-style effects outperforms one without such effects in predictive accuracy, yet both yield similar ability estimates. This result would indicate the presence of response styles while also supporting the unbiasedness of ability estimates. Conversely, two models assuming different response processes might share predictive weight in the stacking procedure but produce conflicting ability estimates, implying noninvariance in how the ability parameter should be interpreted.

Second, it is straightforward to extend these models by including covariates to predict teaching ability. Estimating the degree of parameter pooling that leads to optimal predictive performance can provide valuable information for policymakers, helping to distinguish, for example, between school-specific and district-level dependencies in teaching effectiveness.

Lastly, predictive model selection can guide future research by revealing whether theoretical assumptions about data structures lead to optimal predictions. Discrepancies may point to problems in model formulation or the interpretation of parameters. A promising direction for future work lies in exploring why certain observational units are better predicted by particular models. The diagnostic information provided by LOO-CV offers a first line of investigation: Pareto- $\hat{k}$ values can be checked, indicating which units are particularly surprising for the model and thus require further investigation.

Hierarchical stacking (Yao et al., 2022) offers another, more advanced, approach: unit-level log-likelihoods from multiple models are analyzed jointly within a hierarchical mixture framework, possibly with covariates. This allows each unit to receive its own set of model weights, and the resulting distribution of weights can illuminate interactions between model structure and unit characteristics.

Overall, a key advantage of cross-validation and stacking is that they allow inferences that extend beyond the mere selection of a single “best” model from a candidate set. As demonstrated in this case study, these approaches can be successfully applied even to complex, non-normal models with nested random effects using the standard Bayesian modeling toolkit.

Supplemental Material

sj-pdf-1-jeb-10.3102_10769986261422706 – Supplemental material for Predictive Model Evaluation in Bayesian Mixture and Hierarchical Models for Ordinal Data: A Teaching Evaluation Case Study

Supplemental material, sj-pdf-1-jeb-10.3102_10769986261422706 for Predictive Model Evaluation in Bayesian Mixture and Hierarchical Models for Ordinal Data: A Teaching Evaluation Case Study by R. Maximilian Bee and Tobias Koch in Journal of Educational and Behavioral Statistics

Supplemental Material

sj-zip-2-jeb-10.3102_10769986261422706 – Supplemental material for Predictive Model Evaluation in Bayesian Mixture and Hierarchical Models for Ordinal Data: A Teaching Evaluation Case Study

Supplemental material, sj-zip-2-jeb-10.3102_10769986261422706 for Predictive Model Evaluation in Bayesian Mixture and Hierarchical Models for Ordinal Data: A Teaching Evaluation Case Study by R. Maximilian Bee and Tobias Koch in Journal of Educational and Behavioral Statistics

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work was funded by the German Research Foundation (DFG project number: KO 4770/2-1).

ORCID iD

R. Maximilian Bee

Authors

RICHARD MAXIMILIAN BEE is a PhD student at the Psychological Methods division, Friedrich-Schiller-Universität Jena, Am Steiger 3, 07743 Jena, email: richard.maximilian.bee@uni-jena.de. His research interests are Bayesian methods, hierarchical models, structural equation modeling, mathematical modeling.

TOBIAS KOCH is full professor for Psychological Methods at the Friedrich-Schiller-Universität Jena, Am Steiger 3, 07743 Jena, email: tobias.koch@uni-jena.de. His research interests focus on multitrait-multimethod analysis, multirater analysis, bifactor models, structural equation models, item response theory models, mixture distribution models, longitudinal data analysis, Bayesian estimation, and causal inference.

References

Agrawal

Bikineev

Bristow

P. A.

Guazzone

Kormanyos

Holin

Lalande

Maddock

Murphy

Råde

Sewani

Sobotta

Thompson

Berg

van den, Walker

Zhang

(2017). Double-exponential quadrature. https://www.boost.org/doc/libs/1_66_0/libs/math/doc/html/math_toolkit/double_exponential.html

Akaike

(1998). Information theory and an extension of the maximum likelihood principle. In Parzen

Tanabe

Kitagawa

(Eds.), Selected papers of Hirotugu Akaike (pp. 199–213). Springer New York. https://doi.org/10.1007/978-1-4612-1694-0_15

Andrich

(1995). Distinctive and incompatible properties of two common classes of IRT models for graded responses. Applied Psychological Measurement, 19(1), 101–119. https://doi.org/10.1177/014662169501900111

Bacci

Gnaldi

(2015). A classification of university courses based on students’ satisfaction: An application of a two-level mixture item response model. Quality & Quantity, 49(3), 927–940. https://doi.org/10.1007/s11135-014-0101-0

Basu

(1977). On the elimination of nuisance parameters. Journal of the American Statistical Association, 72(358), 355–366. https://doi.org/10.1080/01621459.1977.10481002

Bernardo

J. M.

Smith

A. F. M.

(2004). Bayesian theory (Repr). Wiley.

Betancourt

(2018, July 15). A conceptual introduction to Hamiltonian Monte Carlo (2). arXiv: 1701.02434 [stat]. https://doi.org/10.48550/arXiv.1701.02434

Born

Loßnitzer

Schmidt

(2006). Lehrveranstaltungsevaluation an der Friedrich-Schiller-Universität Jena—Eine Analyse der Dimensionalität der eingesetzten Fragebögen. In Krause

Metzler

(Eds.), Empirische Evaluationsmethoden (Vol. 10, pp. 99–116). ZeE Verlag.

Bürkner

P.-C.

Vuorre

(2019). Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science, 2(1), 77–101. https://doi.org/10.1177/2515245918823199

10.

Bürkner

P.-C.

Gabry

Vehtari

(2020). Approximate leave-future-out cross-validation for Bayesian time series models. Journal of Statistical Computation and Simulation, 90(14), 2499–2523. https://doi.org/10.1080/00949655.2020.1783262

11.

Bürkner

P.-C.

Gabry

Vehtari

(2021). Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models. Computational Statistics, 36(2), 1243–1261. https://doi.org/10.1007/s00180-020-01045-4

12.

Bürkner

P.-C.

Gabry

Kay

Vehtari

(2022). posterior: Tools for working with posterior distributions (Version 1.3.1). https://mc-stan.org/posterior/

13.

Cho

S.-J.

Cohen

A. S.

(2010). A multilevel mixture IRT model with an application to DIF. Journal of Educational and Behavioral Statistics, 35(3), 336–370. https://doi.org/10.3102/1076998609353111

14.

De Ayala

Santiago

(2017). An introduction to mixture item response theory models. Journal of School Psychology, 60, 25–40. https://doi.org/10.1016/j.jsp.2016.01.002

15.

Dowle

Srinivasan

(2022). data.table: Extension of “data.frame” (Version 1.14.4). https://CRAN.R-project.org/package=data.table

16.

Eid

Geiser

Koch

(2025). Structural equation modeling of multiple rater data. The Guilford.

17.

Eid

Nussbeck

F. W.

Geiser

Cole

D. A.

Gollwitzer

Lischetzke

(2008). Structural equation modeling of multitrait-multimethod data: Different models for different types of methods. Psychological Methods, 13(3), 230–253. https://doi.org/10.1037/a0013219

18.

Fox

J.-P.

(2005). Multilevel IRT using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58(1), 145–172. https://doi.org/10.1348/000711005X38951

19.

Frühwirth-Schnatter

(2001). Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association, 96(453), 194–209. https://doi.org/10.1198/016214501750333063

20.

Gabry

Češnovar

(2022). cmdstanr: R interface to “cmdstan” (Version 0.5.3). https://mc-stan.org/cmdstanr/

21.

Geisser

Eddy

W. F.

(1979). A predictive approach to model selection. Journal of the American Statistical Association, 74(365), 153–160. https://doi.org/10.1080/01621459.1979.10481632

22.

Gelman

Hwang

Vehtari

(2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6), 997–1016. https://doi.org/10.1007/s11222-013-9416-2

23.

Gelman

Vehtari

Simpson

Margossian

C. C.

Carpenter

Yao

Kennedy

Gabry

Bürkner

P.-C.

Modrák

(2020). Bayesian workflow (1). https://doi.org/10.48550/ARXIV.2011.01808

24.

Goldstein

(1994). Multilevel cross-classified models. Sociological Methods & Research, 22(3), 364–375. https://doi.org/10.1177/0049124194022003005

25.

Goldstein

(2011). Cross-classified data structures. In Shewhart

W. A.

Wilks

S. S.

(Eds.), Multilevel statistical models (pp. 243–254). John Wiley & Sons. doi:10.1002/9780470973394.ch12

26.

Henninger

Meiser

(2020). Different approaches to modeling response styles in divide-by-total item response theory models (part 1): A model integration. Psychological Methods, 25(5), 560–576. https://doi.org/10.1037/met0000249

27.

Hoyle

R. H.

(Ed.). (2012). Handbook of structural equation modeling (1st ed.). The Guilford Press.

28.

Irwing

Booth

Hughes

D. J.

(Eds.). (2018). The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development (1st ed.). Wiley. https://doi.org/10.1002/9781118489772

29.

Jin

K.-Y.

Wang

W.-C.

(2017). Assessment of differential rater functioning in latent classes with new mixture facets models. Multivariate Behavioral Research, 52(3), 391–402. https://doi.org/10.1080/00273171.2017.1299615

30.

Kelcey

McGinn

Hill

(2014). Approximate measurement invariance in cross-classified rater-mediated assessments. Frontiers in Psychology, 5, 1469. https://doi.org/10.3389/fpsyg.2014.01469

31.

Koch

Schultze

Burrus

Roberts

R. D.

Eid

(2015). A multilevel CFA-MTMM model for nested structurally different methods. Journal of Educational and Behavioral Statistics, 40(5), 477–510. https://doi.org/10.3102/1076998615606109

32.

Koch

Schultze

Eid

Geiser

(2014). A longitudinal multilevel CFA-MTMM model for interchangeable and structurally different methods. Frontiers in Psychology, 5, 311. https://doi.org/10.3389/fpsyg.2014.00311

33.

Koch

Schultze

Jeon

Nussbeck

F. W.

Praetorius

A.-K.

Eid

(2016). A cross-classified CFA-MTMM model for structurally different and nonindependent interchangeable methods. Multivariate Behavioral Research, 51(1), 67–85. https://doi.org/10.1080/00273171.2015.1101367

34.

Kromrey

(1994). Wie erkennt man “gute Lehre”? Was studentische Vorlesungsbefragungen (nicht) aussagen [How do you assess good teaching? What teaching evaluations do (not) assess]. Empirische Pädagogik, 2(8), 153–168.

35.

Lee

W.-Y.

Cho

S.-J.

Sterba

S. K.

(2018). Ignoring a multilevel structure in mixture item response models: Impact on parameter recovery and model selection. Applied Psychological Measurement, 42(2), 136–154. https://doi.org/10.1177/0146621617711999

36.

Liddell

T. M.

Kruschke

J. K.

(2018). Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–348. https://doi.org/10.1016/j.jesp.2018.08.009

37.

Marsh

H. W.

(1984). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential baises, and utility. Journal of Educational Psychology, 76, 707–754. https://doi.org/10.1037/0022-0663.76.5.707

38.

Marsh

H. W.

(1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11(3), 253–388. https://doi.org/10.1016/0883-0355(87)90001-2

39.

McElreath

(2020). Statistical rethinking: A Bayesian course with examples in R and Stan (2nd ed.). Taylor and Francis, CRC Press.

40.

Merkle

E. C.

Rosseel

(2018). Blavaan: Bayesian structural equation models via parameter expansion. Journal of Statistical Software, 85(4), 1–30. https://doi.org/10.18637/jss.v085.i04

41.

Merkle

E. C.

Furr

Rabe-Hesketh

(2019). Bayesian comparison of latent variable models: Conditional versus marginal likelihoods. Psychometrika, 84(3), 802–829. https://doi.org/10.1007/s11336-019-09679-0

42.

Millar

R. B.

(2018). Conditional vs marginal estimation of the predictive loss of hierarchical models using WAIC and cross-validation. Statistics and Computing, 28(2), 375–385. https://doi.org/10.1007/s11222-017-9736-8

43.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i–30. https://doi.org/10.1002/j.2333-8504.1992.tb01436.x

44.

Muraki

(1997). A generalized partial credit model. In van der Linden

W. J.

Hambleton

R. K.

(Eds.), Handbook of modern item response theory (pp. 153–164). Springer. https://doi.org/10.1007/978-1-4757-2691-6_9

45.

Pagani

Seghieri

(2002). A statistical analysis of teaching effectiveness from students’ point of view. Developments in Statistics, 17, 197–208.

46.

Papastamoulis

(2016). label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69, 1–24. https://doi.org/10.18637/jss.v069.c01

47.

Piironen

Vehtari

(2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3), 711–735. https://doi.org/10.1007/s11222-016-9649-y

48.

Plieninger

Meiser

(2014). Validity of multiprocess IRT models for separating content and response styles. Educational and Psychological Measurement, 74(5), 875–899. https://doi.org/10.1177/0013164413514998

49.

R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

50.

Rijmen

De Boeck

(2005). A relation between a between-item multidimensional IRT model and the mixture rasch model. Psychometrika, 70(3), 481–496. https://doi.org/10.1007/s11336-002-1007-7

51.

Rindermann

(2001). Die studentische Beurteilung von Lehrveranstaltungen—Forschungsstand und Implikationen [Student evaluations of university courses—Present research and implications]. In Spiel

(Ed.), Evaluation universitärer Lehre: zwischen Qualitätsmanagement und Selbstzweck (pp. 61–88). Waxmann.

52.

Rost

(1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14(3), 271–282. https://doi.org/10.1177/014662169001400305

53.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(S1), 1–97. https://doi.org/10.1007/BF03372160

54.

Samejima

(1997). Graded response model. In van der Linden

W. J.

Hambleton

R. K.

(Eds.), Handbook of modern item response theory (pp. 85–100). Springer. https://doi.org/10.1007/978-1-4757-2691-6_5

55.

Schultze

Koch

Eid

(2015). The effects of nonindependent rater sets in multilevel–multitrait–multimethod models. Structural Equation Modeling: A Multidisciplinary Journal, 22(3), 439–448. https://doi.org/10.1080/10705511.2014.937675

56.

Schwarz

(1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136

57.

Sengewald

Vetterlein

(2015). Multilevel Faktorenanalyse für Fragebögen zur Lehrveranstaltungsevaluation [Multilevel factor analysis of questionnaire on teaching evaluation]. Diagnostica, 61(3), 116–123. https://doi.org/10.1026/0012-1924/a000140

58.

Skrondal

Rabe-Hesketh

(2004, May 11). Generalized latent variable modeling (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780203489437

59.

Stan Development Team. (2022). Stan modeling language users guide and reference manual (Version 2.31). https://mc-stan.org

60.

Stone

(1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B: Statistical Methodology, 39(1), 44–47. https://doi.org/10.1111/j.2517-6161.1977.tb01603.x

61.

Takahasi

Mori

(1973). Double exponential formulas for numerical integration. Publications of the Research Institute for Mathematical Sciences, 9(3), 721–741. https://doi.org/10.2977/prims/1195192451

62.

Takane

de Leeuw

(1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52(3), 393–408. https://doi.org/10.1007/BF02294363

63.

van der Linden

W. J.

Hambleton

R. K

. (Eds.). (1997). Handbook of modern item response theory. Springer. https://doi.org/10.1007/978-1-4757-2691-6

64.

Van Vaerenbergh

Thomas

T. D.

(2013). Response styles in survey research: A literature review of antecedents, consequences, and remedies. International Journal of Public Opinion Research, 25(2), 195–217. https://doi.org/10.1093/ijpor/eds021

65.

Vehtari

Ojanen

(2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6, 142–228. https://doi.org/10.1214/12-SS102

66.

Vehtari

Gabry

Magnusson

Yao

Bürkner

P.-C.

Paananen

Gelman

(2022). loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models (Version 2.5.1). https://mc-stan.org/loo/

67.

Vehtari

Gelman

Gabry

(2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

68.

Vehtari

Gelman

Simpson

Carpenter

Bürkner

P.-C.

(2021). Rank-normalization, folding, and localization: An improved

\hat{R}

for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2), 667–718. https://doi.org/10.1214/20-BA1221

69.

Vehtari

Mononen

Tolvanen

Sivula

Winther

(2016). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. Journal of Machine Learning Research, 17(103), 1–38.

70.

Vehtari

Simpson

Gelman

Yao

Gabry

(2022, August 4). Pareto smoothed importance sampling. arXiv: 1507.02646 [stat]. http://arxiv.org/abs/1507.02646

71.

Vermunt

J. K.

(2008a). Multilevel latent variable modeling: An application in education testing. Austrian Journal of Statistics, 37, 285–299. https://doi.org/10.17713/ajs.v37i3&4.309

72.

Vermunt

J. K.

(2008b). Latent class and finite mixture models for multilevel data sets. Statistical Methods in Medical Research, 17(1), 33–51. https://doi.org/10.1177/0962280207081238

73.

von Davier

Rost

(2006). Mixture distribution item response models. In Sinharay

Rao

C. R.

(Eds.), Handbook of statistics (Vol. 26, pp. 643–661). Elsevier. https://doi.org/10.1016/S0169-7161(06)26019-X

74.

Weakliem

D. L.

(1999). A critique of the Bayesian information criterion for model selection. Sociological Methods & Research, 27(3), 359–397. https://doi.org/10.1177/0049124199027003002

75.

Wickham

(2016). ggplot2: Elegant graphics for data analysis. Springer. https://ggplot2.tidyverse.org

76.

Wickham

Averick

Bryan

Chang

McGowan

L. D.

François

Grolemund

Hayes

Henry

Hester

Kuhn

Pedersen

T. L.

Miller

Bache

S. M.

Müller

Ooms

Robinson

Seidel

D. P.

Spinu

Yutani

(2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

77.

Wolbring

(2013a). Fallstricke der Lehrevaluation: Möglichkeiten und Grenzen der Messbarkeit von Lehrqualität [Pitfalls of teaching evaluation: Possibilities and limitations of measurability of teaching quality]. Campus.

78.

Wolbring

(2013b). Fallstricke der Lehrevaluation. Ein Plädoyer für einen sachgemäßen Umgang mit studentischen Lehrveranstaltungsbewertungen [Pitfalls of teaching evaluation. A plea for a proper handling of student Course Reviews]. Forschung & Lehre, (12), 1012–1013.

79.

Yao

Pirš

Vehtari

Gelman

(2022). Bayesian hierarchical stacking: Some models are (somewhere) useful. Bayesian Analysis, 17(4), 1043–1071. https://doi.org/10.1214/21-BA1287

80.

Yao

Vehtari

Simpson

Gelman

(2018). Using stacking to average Bayesian predictive distributions. Bayesian Analysis, 13, 917–1007. https://doi.org/10.1214/17-BA1091

81.

Ziegler

Weis

(2015). Editorial: Lehrevaluation als Mittel zur Erfassung und Verbesserung Universitärer Lehre? [Teaching evaluation as a means for recording and improving university teaching?] Diagnostica, 61(3), 113–115. https://doi.org/10.1026/0012-1924/a000145

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.30 MB

0.03 MB