Sage Journals: Discover world-class research

Abstract

Research on academic performance typically revolves around average achievement scores of students or schools. Focusing solely on averages can miss important aspects of the learning experience. The recent development of mixed-effects location scale models (MELSM) has provided a modeling technique that incorporates a scale model that captures and explains the consistency of academic achievement within the cluster of interest. Here, we formally introduce an extension to the MELSM, a Spike-and-Slab Mixed-Effects Location Scale Model (SS-MELSM), for simultaneously modeling location and scale parameters while incorporating a spike-and-slab prior to select or shrink random effects. Our approach identifies clusters with unusually large or small within-cluster variance in academic achievement, which can indicate overly inconsistent or consistent outcomes. To assess the performance of the proposed method, we conducted a simulation study, followed by an application to a dataset of 160 schools from the Brazilian Evaluation System of Elementary Education (Saeb) to illustrate its use in educational data analysis. Moreover, we show how to compare models with varying parameters regarding expected predictive accuracy. The results demonstrate that the SS-MELSM successfully identifies schools with unusually high and low consistency in mathematics achievement and that school- and student-level SES were relevant covariates when modeling the location and scale components. The methods presented in this paper are implemented in the R package ivd.

Keywords

Mixed-Effects Location Scale Model intra-cluster variability spike- and slab variable selection

Research on academic performance typically focuses on the average academic achievement of a student or a school (or other clustering units such as classrooms, districts, counties, etc.). While average performance is indubitably an important metric for academic achievement, it provides an incomplete picture. Specifically, it does not provide information on the variability or the consistency of academic achievement over time or within a cluster. To illustrate the distinction between average academic achievement and consistency, we can concoct a very simple example consisting of two hypothetical schools in which students take a standardized exam: Both schools achieve the same average score of 75% among their students, suggesting a similar academic performance. However, the consistency of student performance within each school could vary greatly. In one school, individual student scores may range widely from 50% to 100%, while the other school’s student scores cluster around 75%. Although the two schools have the same average score, their students’ experiences and levels of mastery of different topics are almost certainly different.

Capturing cluster-level variability alongside the average score offers a more comprehensive view of academic achievement, with both metrics providing unique insights that can guide more targeted student or school-specific evaluations and support. The idea that average performance and variability convey distinct information is by no means a novel one, and it has been discussed at the levels of students and at the level of school classes and schools.

At the school level, Raudenbush and Bryk (1987) developed a statistical framework to study how organizational characteristics predict variability in academic achievement. Their work showed that dispersion in outcomes is not just statistical noise. Instead, these differences can reveal whether schools help reduce or increase inequality. Socioeconomic status (SES) is a key factor that affects differences both within and between schools. A student’s family SES influences their access to resources and learning opportunities, which affects their achievement and persistence (Lurie et al., 2021; Tompsett & Knoester, 2023; von Stumm et al., 2022). At the same time, the school-level SES, which often reflects patterns of residential and financial segregation, shapes the availability of resources, teacher quality, and curriculum offerings (Perry et al., 2022; Sirin, 2005). These factors lead to large differences between schools in both average achievement and the extent of achievement variation, with lower-quality schools often worsening socioeconomic inequality, while higher-quality schools can sometimes help reduce it (Borgen et al., 2025). Teacher biases against students from lower-SES backgrounds and differences in how schools are organized, such as school size or the range of math courses offered, can also affect how much achievement varies within schools (Doyle et al., 2023).

The idea of using dispersion as a meaningful metric appears in other achievement research as well. For instance, research by Brunner et al. (2013) shows that gender differences in academic achievement are complex. While boys and girls often have similar average achievement, boys tend to show greater variability. As a result, boys are more likely to be overrepresented at both the highest and lowest achievement levels. These results highlight why it is important to look at both average performance and consistency to better understand achievement and guide interventions.

A related, though conceptually distinct, line of research has examined variability at the individual level. For example, Connell and Wellborn (1991) found that consistency in academic performance is associated with better psychological engagement and motivation in students. Similarly, Gottfried et al. (2008) showed that students with more consistent academic performance achieve higher test scores and are more likely to enroll in college. By contrast, other work, such as Wright and von Stumm (2022), questioned the predictive utility of within-person grade variability, showing in a twin study that it did not forecast later educational outcomes.

While these studies of within-person variability are informative, the focus of the present work returns to the within-cluster level. Overall, inconsistent performance may reflect unaccounted factors influencing learning. Detecting such variability can uncover potential learning barriers, optimize individualized teaching strategies, and possibly improve educational outcomes. Moreover, understanding the nature of inconsistency may shed light on systemic issues within the school environment that not only affect academic outcomes but also contribute to students’ social and emotional development (Spörlein & Schlueter, 2018).

In this work, we present an approach that identifies clustering units (students, classrooms, etc.) that exhibit either unusually large or unusually small within-cluster variance – indicating either consistent or inconsistent academic achievement. As such, the goal of this current paper is not to settle the discussion on whether within-cluster variance is predictive of future educational outcomes, but to present a tool that allows researchers to identify and isolate clusters (such as students, classrooms, schools, etc.) that display unusual amounts of residual variability. A similar idea has been put forth by Leckie et al. (2023), who used school-specific expected residual variance from a mixed-effects location scale model (MELSM) to identify variables that contribute to inconsistency in academic achievement and to refine school value-added models. The MELSM approach allowed them to rank order schools in terms of within-school variance (see also Brunton-Smith et al., 2017). The work presented here shares the same general approach of modeling residual variances in a MELSM. Our focus, however, is on the identification of clustering units that show unusually high or low consistency in academic achievement, rather than systematically studying predictors of variability.

In order to identify clusters (students, classes, schools, etc.) with unusual consistency in academic achievement, we present an adaptation of the MELSM by means of shrinking random effects to their fixed effect using the spike-and-slab regularization technique (George & McCulloch, 1993, 1997; Kuo & Mallick, 1998) that has been fruitfully applied in the context of the selection of random effects in classic multilevel models (Frühwirth-Schnatter & Wagner, 2011; Rodriguez et al., 2022; Williams et al., 2021). The spike-and-slab prior serves as the Bayesian analog to lasso regularization, but it allows one to combine it with Bayes factors that provide a decision boundary for the identification of clusters with unusually large or small residual variability.

The remainder of the manuscript is organized as follows. First, we review and formally describe the MELSM for multilevel-type data. Second, we introduce the novel Spike-and-Slab MELSM (SS-MELSM), which identifies clusters that are unusually consistent or inconsistent in their academic achievement, which will also be referred to as atypical schools for ease of exposition throughout this manuscript. Next, we conduct a simulation study to evaluate the SS-MELSM accuracy in finding clusters with unusual residual variability, and compare it to a two-stage hierarchical linear model (HLM). Finally, we provide an example involving empirical data from Brazil’s Elementary Education Evaluation System (Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira, 2021). We conclude by discussing the findings in the context of educational research and pointing out possible limitations and extensions of the SS-MELSM.

MELSM for Educational Data

Educational data are typically analyzed using multilevel or mixed-effects models (MLM) that partition between- and within-cluster variance. In the examples presented throughout this paper, the clustering level 2 refers to schools, while the level-1 units are students within schools.

MLMs capture the expected mean response conditioned on clustering levels and predictors situated at different levels of analysis. The standard assumption in the MLM is one of constant error variance, and consequently, the residual variance, or scale part of the MLM, is left unmodeled (Raudenbush & Bryk, 2002). This assumption can be relaxed using so-called MELSM that include a submodel to address potential differences in the residual variance that lead to cluster-specific heteroskedasticity (Hedeker et al., 2008; Leckie et al., 2014; Rast et al., 2012). The idea behind these models is that the unexplained residual variance is not merely Gaussian white noise but that it still contains some degree of information about the student or, more generally, the clustering unit that can be modeled and explained. Specifically, the nature of the variability itself might provide insights over and above the expected academic achievement captured in the conditional means or location. MELSMs are relatively new in the context of educational research (Goldstein et al., 2018; Leckie et al., 2014, 2023), but the idea that the nature of the residual variance structure itself contains behavioral information has been entertained for a long time (see Rausch, 1948; Woodrow, 1932).

The MELSM allows the simultaneous estimation of a model for the means (location) and a model for the residual variance (scale). Both these submodels allow the inclusion of specific predictors and are conceptualized as mixed-effect models. As with any multilevel model, the MELSM addresses the nested structure of typical educational data by estimating fixed and random effects for the location and the scale while including predictors at the respective levels. A distinguishing feature of the MELSM is the simultaneous estimation of the means and the residual variance part of the model, allowing random effects of both location and scale to correlate. In other words, the model specifies a classic multilevel model for the observed values, y_ij for school j and student i, and a multilevel model for the within-cluster residual variances $σ_{ϵ_{ij}}^{2}$ .

To keep the model description in line with our examples, we show the models with the clustering unit School at Level 2 and the achievement score for Students at Level 1. The starting point is the standard linear MLM for $i = 1, 2, \dots, n_{j}$ students, and $j = 1, 2, \dots, S$ schools, specified as:

\begin{matrix} Level 1 : y_{ij} = β_{0 j} + β_{1 j} X_{ij} + e_{ij}, \\ Level 2 : β_{0 j} = γ_{00} + γ_{01} W_{1 j} + u_{0 j}, \\ β_{1 j} = γ_{10} + γ_{11} W_{2 j} + u_{1 j}, \\ Combined model : y_{ij} = γ_{00} + γ_{01} W_{1 j} + γ_{10} X_{ij} + γ_{11} W_{2 j} X_{ij} + u_{0_{i}} + u_{1 j} X_{ij} + e_{ij}, \end{matrix}

(1)

with [\begin{matrix} u_{0} \\ u_{1} \end{matrix}] ~ N ([\begin{matrix} 0 \\ 0 \end{matrix}], [\begin{matrix} τ_{u_{0}}^{2} \\ τ_{u_{0} u_{1}} & τ_{u_{1}}^{2} \end{matrix}]),

(2)

and e_{ij} ~ N (0, σ_{ϵ}^{2}) .

(3)

The combined model is given in Equation (1), where y_ij captures the achievement score for student i within school j, and X_ij is a predictor that can contain school or student-level information. The fixed effects coefficients are captured by the $γ$ ’s and the random effects by u’s. These fixed and random effects characterize a student’s mean response and are referred to as the model for the location. Furthermore, the between-school variance is captured by the $τ_{u_{0}}^{2}$ and $τ_{u_{1}}^{2}$ , the diagonal elements of the covariance matrix in Equation (2), while the within-school variance is represented by $σ_{ϵ}^{2}$ . The covariance between the random intercepts and random slopes, $τ_{u_{0} u_{1}}$ , represents the relationship between a school’s average achievement and the strength of X_ij’s effect within that school.

It’s important to note that in this standard model, the within-school or residual variance $σ_{ϵ}^{2}$ defined in Equation (3) is assumed to be constant. The MELSM relaxes the assumption of constant residual variance by introducing a model for the corresponding standard deviation (SD), the scale. That is, to allow the within-school standard deviation to differ among schools, we add the subscript j to the respective term $σ_{j}$ . Moreover, we also allow the residual standard deviation to differ among i-students within each school to obtain $σ_{ϵ_{ij}}$ (Hedeker et al., 2008). Changes in the within-school standard deviation $σ_{ϵ_{ij}}$ can now be explained by differences in school-level covariates P and/or by student-level covariates captured in M.

This relation is shown in the scale model

\begin{array}{l} Level 1 : σ_{ϵ_{i j}} = \exp (α_{0 j} + α_{1 j} M_{i j}), \\ Level 2 : α_{0 j} = η_{00} + η_{01} P_{1 j} + t_{0 j}, \\ α_{1 j} = η_{10} + η_{11} P_{2 j} + t_{1 j}, \end{array}

(4)

\begin{matrix} Combined model : σ_{ϵ_{ij}} = \exp (η_{00} + η_{01} P_{1 j} + η_{10} M_{ij} + η_{11} P_{2 j} M_{ij} + t_{0 j} + t_{1 j} M_{ij}), \end{matrix}

(5)

where $σ_{ϵ_{ij}}$ is the predicted residual standard deviation for school j and student i within that school.

Comparable to the regression weights in Equation (1), $η_{0}$ ’s define the fixed effect for the average within-cluster standard deviation, and $η_{1}$ ’s are the fixed slope effects of the predictor on the standard deviation. The individual departures from the fixed effects are captured in the random effects t_0j and t_1j. Note that the Level 1 and Level 2 predictors (M and P) may or may not be the same across the scale and location part of the model. Given that Equation (5) pertains to the residual standard deviation, it is essential to ensure that the estimates result in positive real values. This is accomplished via the exponential function, which guarantees positivity. Consequently, $σ_{ϵ_{ij}}$ is assumed to be log-normally distributed.

We now have random effects u_0j and u_1j from the location of the model (the means structure) and random effects t_0j and t_1j from the scale of the model (the within-cluster standard deviation structure). All these random effects are assumed to come from the same multivariate Gaussian Normal distribution with zero means and covariance matrix $Σ$ . By stacking both $u_{j} = [u_{0 j}, u_{1 j}]$ and $t_{j} = [t_{0 j}, t_{1 j}]$ vectors into $v_{j} = [u_{j}, t_{j}]$ , we obtain for our example:

v_{j} = [\begin{matrix} u_{0 j} \\ u_{1 j} \\ t_{0 j} \\ t_{1 j} \end{matrix}] ~ N (0 = [\begin{matrix} 0 \\ 0 \\ 0 \\ 0 \end{matrix}], Σ = [\begin{matrix} τ_{u_{0}}^{2} \\ τ_{u_{0} u_{1}} & τ_{u_{1}}^{2} \\ τ_{u_{0} t_{0}} & τ_{u_{1} t_{0}} & τ_{t_{0}}^{2} \\ τ_{u_{0} t_{1}} & τ_{u_{1} t_{1}} & τ_{t_{0} t_{1}} & τ_{t_{1}}^{2} \end{matrix}]) .

(6)

Here, $Σ$ is a k×k ( $k = 1, \dots, K$ random effects) covariance matrix that contains all random effect variances for both location and scale on its diagonal. The off-diagonal elements encode the covariances of both within and across the location and scale random effects. The MELSM can also introduce a submodel for the between-cluster variance (diagonal elements of $Σ$ ), but we omitted it here for the sake of simplicity (but see, Hedeker et al., 2008; Leckie et al., 2014; Rast & Ferrer, 2018).

Alternatively, we can express the relation among the random effects on the location and scale models via the Cholesky parameterization (Hedeker et al., 2008). This parameterization facilitates the inclusion of a variable selection mechanism presented later in the Spike and Slab MELSM section, ensures positive-definiteness, and improves the efficiency of Hamiltonian Monte Carlo (HMC) estimation. That is, $Σ$ is first decomposed into $Σ = τ Ω τ^{'}$ , where $τ$ is a diagonal matrix holding the random-effect standard deviations and $Ω$ is the correlation matrix that contains the correlations among all random effects. Next, we can decompose $Ω$ via the Cholesky factor L of $Ω = L^{'} L$ . In scalar notation, our example with four random effects results in

L = [\begin{matrix} 1 & 0 & 0 & 0 \\ ρ_{u_{0} u_{1}} & \sqrt{1 - l_{21}^{2}} & 0 & 0 \\ ρ_{u_{0} t_{0}} & \frac{ρ_{u_{1} t_{0}} - l_{21} l_{31}}{l_{22}} & \sqrt{1 - l_{31}^{2} - l_{32}^{2}} & 0 \\ ρ_{u_{0} t_{1}} & \frac{ρ_{u_{1} t_{1}} - l_{21} l_{41}}{l_{22}} & \frac{ρ_{t_{0} t_{1}} - l_{31} l_{41} - l_{32} l_{42}}{l_{33}} & \sqrt{1 - l_{41}^{2} - l_{42}^{2} - l_{43}^{2}} \end{matrix}],

(7)

where ρ is the correlation between the subscripted random effects and l_nm are the elements in $L^{'} s n$ th row and m th column. Equivalently to the representation in Equation (6), we can obtain the random effects vector v_j, by multiplying L with the standard deviations $τ$ and scale it with a standard normally distributed z_j:

v_{j} = τ L z_{j} .

(8)

The individual elements of v_j, for our example with two location and two scale random effects, are defined as

\begin{matrix} u_{0 j} = τ_{u_{0}} z_{j u_{0}}, \\ u_{1 j} = τ_{u_{1}} (ρ_{u_{0} u_{1}} z_{j u_{0}} + z_{j u_{1}} \sqrt{1 - l_{21}^{2}}), \\ t_{0 j} = τ_{t_{0}} (ρ_{u_{0} t_{0}} z_{j u_{0}} + l_{32} z_{j u_{1}} + z_{t_{0} j} \sqrt{1 - l_{31}^{2} - l_{32}^{2}}), \\ t_{1 j} = τ_{t_{1}} (ρ_{u_{0} t_{1}} z_{j u_{0}} + l_{42} z_{j u_{1}} + l_{43} z_{j t_{0}} + z_{j t_{1}} \sqrt{1 - l_{41}^{2} - l_{31}^{2} - l_{32}^{2}}) . \end{matrix}

(9)

Equation (9) demonstrates how the Cholesky decomposition allows us to express the four random effects in our example model in terms of the standard deviations and correlations. This approach is particularly useful in complex models like MELSM, where both the mean and variance structures are modeled jointly.

From a Bayesian modeling perspective, so far, we have defined the likelihood part of the MELSM in Equations (1) and (5), which can be written concisely in matrix notation

\begin{matrix} y_{i} ~ N (μ_{j}, φ_{j}), \end{matrix} μ_{j} = X_{j} γ + Z_{j} u_{j}, φ_{j} = \exp (W_{j} α + V_{j} t_{j}) .

Here, X_j and W_j contain the fixed within- and between-cluster predictors while Z_j and V_j contain the corresponding predictors governing the random effects.

When specifying a Bayesian model, we assign prior probability distributions to the parameters of interest. The priors for the fixed effects parameters are given as

γ ~ N (μ_{g}, σ_{g}^{2} I_{k}),

(10)

α ~ N (μ_{a}, σ_{a}^{2} I_{k}) .

(11)

The priors for the random effect parameters in $[\begin{matrix} u_{j} \\ t_{j} \end{matrix}] = v_{j} = τ L z_{j}$ are defined as

τ ~ half - t (0, I_{k}, ν = 3),

(12)

L ~ LKJcorr - Cholesky (κ),

(13)

z_{j} ~ N (0, I_{k}),

(14)

where I_k is a k×k identity matrix. The half-t prior for the standard deviations $τ$ is a Student-t distribution truncated at 0, with standard deviation of 1, and ν= 3 degrees of freedom. This prior is commonly used because it accommodates heavy tails, allowing for larger-than-expected standard deviations while imposing minimal restrictions. The Cholesky factor L is derived from the Cholesky parameterization of the Lewandowski-Kurowicka-Joe (LKJ-correlation) prior (Lewandowski et al., 2009), controlled by the shape parameter κ. This parameterization ensures valid correlation matrices and introduces flexibility by controlling the concentration around the identity matrix.

To summarize, by estimating a joint model for both the location and the scale components simultaneously, we are able to account for possible correlations that arise among location and scale effects, ensuring that we can make valid inferences about parameter estimates (Leckie, 2014; Verbeke & Davidian, 2009). Hence, in the MELSM, we do not need to assume that the residual standard deviation is homogeneous across all Level-1 units. Rather, the variability can be conditional on predictor variables, such as school-level or student-level variables, and it effectively models heteroskedastic processes. For example, the variability of student performance within a school can be modeled as a function of parental SES, with the assumption that low SES leads to more variability in student achievement. In other words, a student from a lower socioeconomic background might be associated with larger residual standard deviations and less consistent achievement. Consequently, its performance will be more difficult to estimate reliably while controlling for school-level effects (for different MELSM examples, see Brunton-Smith et al., 2017; Martin & Rast, 2022; Williams et al., 2021).

The MELSM allows one to estimate residual standard deviations that vary both at the school- and the student level. This distinction forms the foundation for the next steps, where we introduce a Bayesian variable selection method capable of identifying schools (clusters) whose student-level residual standard deviations are not captured well by scale fixed effects. This approach helps pinpoint schools that produce students with inconsistent academic achievement. It aligns with the idea presented by Leckie et al. (2023) that focuses on estimating different school-level standard deviations. However, our method differs by employing a Bayesian model selection approach to identify key schools (clusters). Specifically, we use the spike-and-slab approach, allowing the model to switch between two assumptions for each school: one that assigns a high probability to a common error standard deviation ( $σ$ ) and another that allows for a broader distribution of standard deviations capturing school- and student-specific error variability $σ_{ij}$ . With the model structure established, we now incorporate the spike-and-slab regularization technique.

Spike and Slab MELSM

With the Cholesky parameterization of the covariance matrix in Equation (7) at hand, we can now introduce a selection mechanism to the MELSM based on the Bayesian spike-and-slab regularization technique (Frühwirth-Schnatter & Wagner, 2011; Kuo & Mallick, 1998; Mitchell & Beauchamp, 1988). This addition gives rise to the SS-MELSM. That is, Equation (8) can be expanded to include an indicator vector $δ_{j}$ of length k (for $1, \dots, k$ random effects) for each random effect to be subjected to shrinkage

v_{j} = τ L z_{j} δ_{j} .

(15)

Accordingly, we can include the individual elements ${[δ_{j u_{0}}, δ_{j u_{1}}, δ_{j t_{0}}, δ_{j t_{1}}]}^{'}$ of indicator $δ_{j}$ in Equation (9) and obtain:

\begin{matrix} u_{0 j} = τ_{u_{0}} z_{j u_{0}} δ_{j u_{0}}, \\ u_{1 j} = τ_{u_{1}} (ρ_{u_{0} u_{1}} z_{j u_{0}} + z_{j u_{1}} \sqrt{1 - l_{21}^{2}}) δ_{j u_{1}}, \\ t_{0 j} = τ_{t_{0}} (ρ_{u_{0} t_{0}} z_{j u_{0}} + l_{32} z_{j u_{1}} + z_{j t_{0}} \sqrt{1 - l_{31}^{2} - l_{32}^{2}}) δ_{j t_{0}}, \\ t_{1 j} = τ_{t_{1}} (ρ_{u_{0} t_{1}} z_{j u_{0}} + l_{42} z_{j u_{1}} + l_{43} z_{j t_{0}} + z_{j t_{1}} \sqrt{1 - l_{41}^{2} - l_{31}^{2} - l_{32}^{2}}) δ_{j t_{1}} . \end{matrix}

(16)

Each element in $δ_{j}$ takes integers $\in {0, 1}$ and follows a $δ_{jk} ~ Bernoulli (π)$ distribution. Depending on $δ$ ’s value, the computations in Equation (16) will either retain the random effect or shrink it to exactly zero. For example, the school-level effect for the scale intercept $α_{0 j} = η_{00} + η_{01} P_{1 j} + t_{0 j}$ from Equation (4) either retains the random effect or reduces only to the fixed effect, according to $δ$ ’s value:

α_{0 j} = {\begin{matrix} η_{00} + η_{01} P_{1 j}, if δ_{j t_{0}} = 0, \\ η_{00} + η_{01} P_{1 j} + [τ_{t_{0}} (ρ_{u_{0} t_{0}} z_{j u_{0}} + l_{32} z_{j u_{1}} + z_{j t_{0}} \sqrt{1 - l_{31}^{2} - l_{32}^{2}})], if δ_{j t_{0}} = 1 . \end{matrix}

In this context, the indicator vector $δ_{j}$ acts as the selection mechanism, determining whether each corresponding random effect is included in the model or shrunk to zero. Because $δ_{j}$ is specific to each school j, the decision to include a random effect or shrink it to zero is made individually for every school in the sample.

The prior probability of retaining the k’th random effect is given by the parameter $π$ in the Bernoulli distribution. The selection of $π$ can either be defined a priori or estimated from the data. Two popular choices for $π$ are fixing it to 0.5 (i.e. the indifference prior George & McCulloch, 1993) or endowing it with the prior $π ~ Beta (a, b)$ (Cui & George, 2008; Scott & Berger, 2010). Here, we will apply the former approach and set $π = 0.5$ to indicate equal odds for including or excluding the random effect.

Combining the prior Bernoulli distribution with the standard normal prior defined for z_j in Equation (14) results in the spike-and-slab approach that governs the shape of the combined density distribution. If $δ = 0$ , the density “spikes” at the zero point mass. Conversely, if $δ = 1$ , the standard normal prior from z_jk is retained and scaled by $τ_{k}$ , thereby introducing the “slab.”

Note that Equation (15) and example (16) introduce the spike-and-slab as j vectors, each of length k that potentially shrink all random effect deviations from the fixed effects in both the location and the scale part of the model. Given that the focus of this paper is on identifying clusters with large departures from the average residual standard deviation, we limit ourselves in the remainder of this work to including the spike-and-slab on the scale model only. This is easily achieved by constraining the $δ$ elements to 1 that refer to the location random effects, such as $δ_{j u_{0}} = δ_{j u_{1}} = 1$ , instead of estimating them.

Posterior Inclusion Probability

In the context of spike-and-slab models, the posterior inclusion probability (PIP) quantifies the probability that a given random effect is included in the model, conditional on the observed data, Y. Specifically, for each random effect k within cluster j, the PIP is defined as

\Pr (δ_{jk} = 1 | Y) = \frac{\Pr (Y | δ_{jk} = 1) \Pr (δ_{jk} = 1)}{\Pr (Y)},

(17)

where $δ_{jk}$ is the binary indicator variable for inclusion of the kth random effect in cluster j. PIPs provide a probabilistic measure of whether a random effect should be included in the model or not. A high PIP indicates strong evidence that the random effect is necessary to explain the data, while a low PIP suggests that the effect can be excluded.

In practice, the PIP can be directly estimated from the Markov Chain Monte Carlo (MCMC) sampling approach. For each iteration (s) of the MCMC chain, the indicator variable $δ_{jk}$ is sampled from a Bernoulli distribution with prior probability $π$ . Once the model has reached the typical set, the MCMC chains will visit the two different states of $δ_{jk}$ with a frequency that is proportional to the probability of including the random effect. Consequently, the estimated PIP of any random effect k is determined by the proportion of MCMC samples where $δ_{jk} = 1$ :

\Pr (δ_{jk} = 1 | Y) = \frac{1}{S} \sum_{s = 1}^{S} δ_{jks},

(18)

where S is the total number of posterior samples. The subscript j on the indicator variable ( $δ_{jk}$ ) indicates that PIPs are computed separately for each school.

While the PIP gives us a probabilistic measure of whether a random effect should be included or not, it does not perform automatic variable selection. To determine whether a given random effect is warranted, we follow Williams et al. (2021) and Rodriguez et al. (2022) by estimating the strength of evidence through Bayes factors (Rouder et al., 2018)

\frac{\Pr (δ_{jk} = 1 | Y)}{\underset{Posterior Odds}{\underset{︸}{\Pr (δ_{jk} = 0 | Y)}}} = \frac{\Pr (δ_{jk} = 1)}{\underset{Prior Odds}{\underset{︸}{\Pr (δ_{jk} = 0)}}} \times \frac{\Pr (Y | δ_{jk} = 1)}{\underset{Bayes Factor}{\underset{︸}{\Pr (Y | δ_{jk} = 0)}}} .

Setting the prior probability of $π$ to 0.5 implies equal prior odds, $\Pr (δ_{jk} = 1) / \Pr (δ_{jk} = 0) = 1$ , indicating a lack of prior information about whether a random effect is non-zero. Under this assumption, the Bayes factor for including the kth random effect for the jth school simplifies to:

B F_{10 j} = \frac{\Pr (δ_{jk} = 1 | Y)}{1 - \Pr (δ_{jk} = 1 | Y)} .

(19)

By calculating PIPs and Bayes factors, we can quantify the evidence supporting whether a school’s standard deviation differs from the average within-school standard deviation. It is common practice to interpret a Bayes factor greater than three as substantial evidence in favor of the target hypothesis (Kass & Raftery, 1995).

Simulation Study

In this section, we present a simulation study to evaluate the SS-MELSM’s performance in correctly identifying clusters (schools) with unusual within-cluster variability. We aim to evaluate the method’s classification accuracy as well as its ability to recover the data-generating parameters under various conditions. In addition, we compare SS-MELSM to a two-stage V-known HLM, as proposed by Raudenbush and Bryk (1987). This is a framework where the sampling variance for each school’s dispersion estimate is treated as a known quantity in the second stage of the analysis. This method first calculates a transformed measure of dispersion for each school and then, in a second stage, models these measures to produce shrunken estimates. A school is identified as atypical if its standardized residual falls outside the 95% confidence intervals of the expected normal distribution.

Data for this study were generated from the following model:

\begin{array}{l} y_{i j} ~ N (μ_{j}, σ_{j}) μ_{j} = γ_{0} + u_{0 j}, \\ u_{0 j} ~ N (0, τ_{u_{0}}^{2}) σ_{j} = \exp (η_{0} + t_{0 j}), \\ t_{0 j} ~ {\begin{matrix} N (0, τ_{t_{0}}^{2}), & if school j has unusual variability (slab) . \\ 0, & otherwise (spike) . \end{matrix} \end{array}

(20)

This model defines the outcome, y_ij, for student i in school j via the fixed intercept parameter $γ_{0}$ and the random intercept u_0j, capturing the deviation of the j-th school from the fixed effect. Each school’s residual standard deviation is a function of the fixed effect $η_{0}$ and the school-specific deviation t_0j, both defined on the log scale. The slab group contains schools with non-zero scale effects, whereas the spike group consists of schools with no additional residual variability beyond the baseline. It’s important to note that t_0j and u_0j are sampled independently because we intentionally set the covariance between the location and scale random effects to zero, to simplify the simulation design.

We fixed $η_{0}$ to −0.25, matching the estimated scale fixed effect value taken from our Illustrative Example. The number of students per school was set to $n_{j} \in {30, 75, 150}$ , and the number of schools was set to $S \in {50, 150, 300}$ . We varied the probability that a school was assigned to the slab (i.e. unusual variability) to $π \in {0, 0.05, 0.10, 0.25}$ . Note that when this probability is zero, the data-generating mechanism reduces to an MLM, given that $σ$ is fixed for all $j = 1, 2, \dots, S$ schools. Given that our model specification assumes a fixed $π = 0.5$ value, there is a deliberate mismatch between the data-generating process and the fitted model, allowing us to evaluate how robust the estimation procedure is when the true prevalence of schools with unusual variability deviates from the assumed prior.

We defined the effect size for the slab standard deviations ( $τ_{t_{0}}$ ) as the multiplicative increase in $σ_{j}$ due to the random effects’ variance. That is, the expected value of $σ_{j}$ over schools is $E [σ_{j}] = \exp (η_{0} + τ_{t_{0}}^{2} / 2)$ , due to the log-normal moment-generating function. So, the multiplicative increase in $σ_{j}$ compared to the baseline $\exp (η_{0})$ , is $\exp (τ_{t_{0}}^{2} / 2)$ , and, solving for $τ_{t_{0}}$ , we get $τ_{t_{0}} = \sqrt{2 \log (s)}$ , where s is the multiplicative increase. Setting $s = (1.05, 1.10, 1.25, 1.50)$ resulted in $τ_{t_{0}} = (0.31, 0.44, 0.67, 0.90)$ , which corresponds to a percentage increase in the residual SD of 5, 10, 25, and 50%. Importantly, these effect size conditions only applied to scenarios where $π \neq 0$ . When the slab probability was zero, $τ_{t_{0}}$ was also set to zero.

To assess performance in more challenging small-sample scenarios where the benefits of shrinkage in the HLM are particularly relevant, we conducted a targeted secondary simulation. In this study, we varied the average number of students per school to $n_{j} \in {10, 30}$ and the number of schools to $S \in {20, 50}$ . To create unbalanced designs, the number of students in each school was drawn from a Poisson distribution, ensuring a minimum of 2 students per school. The slab assignment probability was set to $π \in {0, 0.1}$ , and the effect size was held constant at $τ_{t_{0}} = 0.3$ for $π = 0.1$ .

In each simulated dataset, atypical schools were randomly selected with probability $π$ , and random effects for both the location and scale submodels were drawn from a bivariate normal distribution with specified standard deviations and zero correlation. The outcome, y_ij, was generated based on the resulting location and scale parameters. Each of the 125 conditions was replicated 100 times. For each replication, we fit both the SS-MELSM and a two-stage HLM.

Software and Estimation

The SS-MELSM relies on Bayesian estimation approaches, as computing PIPs through the spike and slab method is inherently Bayesian. In order to obtain the elements in $δ_{j}$ from the discrete Bernoulli distribution, the employed algorithm needs to be able to sample from a discrete distribution. The Gibbs sampler is well-suited for this task, in contrast to other popular samplers such as HMC that require smooth and continuous functions. Given that the coding of these models can be challenging, we implemented the SS-MELSM in the ivd ¹ R-package (Rast & Carmo, 2025) that serves as the frontend to the nimble package (de Valpine et al., 2017). The code and results for the simulation studies can be found at https://osf.io/w2r98 under the GitHub tab.

Spike-and-Slab MELSM

The fitted MESLM follows the data-generating model shown in Equation (20). This model’s formulation contains only two random effects: one for the location and another for the scale part. In this case, the spike and slab indicator $δ_{j}$ only operates on the scale random effect. Following Equation (8), the random effects are estimated as

\begin{matrix} u_{0 j} = τ_{u_{0}} z_{j u_{0}} \\ t_{0 j} = τ_{t_{0}} (z_{j u_{0}} ρ_{u_{0} t_{0}} + z_{j t_{0}} \sqrt{1 - ρ_{u_{0} t_{0}}^{2}}) δ_{j t_{0}} . \end{matrix}

(21)

We define the priors for the parameters in Equation (21) as

z_{j u_{0}}, z_{j t_{0}} ~ N (0, 1)

(22)

γ_{0}, η_{0} ~ N (0, 1, 000)

(23)

\begin{matrix} τ_{u_{0}}, τ_{t_{0}} ~ half - t (μ = 0, σ = 1, ν = 3) \\ L ~ LKJcorr - Cholesky (κ = 1) \\ δ_{j t_{0}} ~ Bernoulli (π = 0.5) . \end{matrix}

(24)

We assign uninformative normal priors to the fixed effects, as indicated by the large prior variances in Equation (23), and by the degrees of freedom of the random effects standard deviation (ν= 3) in Equation (24). For the Cholesky factor L of $Ω$ , we set the shape parameter κ= 1, to assign a uniform probability density function over the space of the correlation matrix (LKJ Lewandowski et al., 2009). Note, however, that the LKJ correlation prior is not invariant to its dimension in the sense that larger correlation matrices will tend to concentrate the probability mass around zero even when κ is small. As discussed in the PIP section, the indicator variable in the random effect for the scale intercept is given a prior probability of $π = 0.5$ . Consequently, in each MCMC sample, the scale intercept for the jth school is given by:

\begin{matrix} η_{0}, if δ_{j t_{0}} = 0, \\ η_{0} + τ_{t_{0}} (z_{j u_{0}} ρ_{u_{0} t_{0}} + z_{j t_{0}} \sqrt{1 - ρ_{u_{0} t_{0}}^{2}}), if δ_{j t_{0}} = 1 . \end{matrix}

As shown in Equation (18), this setup allows us to compute the PIP of the random effects and assess the relative evidence for its inclusion using the Bayes factor (Equation 19), assuming equal prior odds for the spike and slab components. We consider $\Pr (δ_{j} = 1 | Y) \geq 0.75$ as evidence for the slab, which corresponds to a Bayes factor of at least 3-to-1 in favor of including a random effect rather than maintaining only the fixed effects. In other words, a PIP larger than 0.75 suggests evidence for unusually large (or unusually small) within-cluster variance.

All models were fitted with three chains of 2,000 iterations and 1,000 warm-up samples. The number of iterations was chosen to ensure good quality of the parameter estimates, with the models converging as indicated by potential scale reduction factors $\hat{R}$ generally below 1.1 (cf. Gelman, 2006).

Two-Stage HLM

As a baseline for comparison, we also fitted a two-stage HLM, following the large-sample approach detailed by Raudenbush and Bryk (1987). Similar to the MELSM, this framework also defines a model for the residual variance as an outcome, but the location and scale models are estimated separately in two stages.

First, a standard random-intercept MLM is fitted to the student-level data. From this model, we calculate a bias-corrected, log-transformed measure of dispersion for each school j:

d_{j} = \ln (σ_{j}) + \frac{1}{2 {df}_{j}}

where $σ_{j}$ is the standard deviation of the residuals in school j, and ${df}_{j} = n_{j} - p$ are the corresponding degrees of freedom. Key to this transformation is that the sampling variance of d_j is known to be approximately $ν_{j} = 1 / (2 {df}_{j})$ , which accounts for the fact that dispersion estimates from larger schools are more precise.

In the second stage, these d_j values are treated as the outcome in a hierarchical model with known variance. This model yields empirical Bayes estimates ( $δ_{j}^{*}$ ) of each school’s true dispersion. The empirical Bayes residual, $δ_{j}^{*} - \hat{Δ}$ , represents the shrunken school-level deviation from the mean, which is analogous to the MELSM’s t_0j.

To identify schools with unusual variability, we standardized these empirical Bayes residuals to obtain Z_j and constructed a normal Q-Q plot of the resulting values. A school was flagged as having an extreme level of dispersion if its Z_j value fell outside the 95% confidence bands of the Q-Q plot. This approach identifies only those schools whose variability is statistically inconsistent with the overall distribution.

Performance Metrics

For each simulated dataset, we computed the set of schools truly belonging to the slab group, as determined by the data-generating mechanism. We then compared this to the set of schools flagged by each method as having unusual within-cluster variance.

Using the flagged and true slab schools, we computed the sensitivity, specificity, and precision for each method:

Sensitivity = \frac{TP}{TP + FN}, Specificity = \frac{TN}{TN + FP}, Precision = \frac{TP}{TP + FP} .

TP and FP are the number of true and false positives, respectively, whereas TN and FN are the number of true and false negatives, respectively. We used these rates to compute the F1 score (Powers, 2020), which is the harmonic mean of precision and sensitivity. It is defined as

F 1 = 2 \times \frac{Precision \times Sensitivity}{Precision + Sensitivity} .

We focus on the F1 score because, while we hope to maximize the identification of true positives, a false positive in an educational context can have relevant consequences. For example, it can initiate unnecessary and expensive investigations, diverting finite administrative time, funding, and support personnel away from schools that genuinely require intervention.

We also evaluated the quality of parameter recovery for the standard deviation of the scale random effect, $τ_{t_{0}}$ , under the SS-MELSM. For each condition, we computed the average posterior mean estimate of $τ_{t_{0}}$ across replications and compared it to the true generating value. These estimates were used to assess bias and the root mean squared error (RMSE) in recovering the magnitude of residual variability heterogeneity.

Results

Primary Simulation Study

Parameter Recovery

Our primary simulation evaluated the SS-MELSM’s ability to recover the standard deviation of the scale random effect, $τ_{t_{0}}$ . We found that the direction of bias was a function of the slab probability ( $π$ ) and that larger samples helped to decrease it. For instance, with no atypical schools, the $τ_{t_{0}}$ estimates were larger by 0.06 on average, with a minimum of 0.02 and a maximum of 0.15. When the probability of atypical schools was low (5%), the model showed a small negative bias, with ${\hat{τ}}_{t_{0}}$ underestimating $τ_{t_{0}}$ on average by −0.15, ranging from −0.85 to 1.47. Conversely, when this proportion was higher (10% and 25%), the bias became positive on average by 0.24, ranging from −0.77 to 1.69. To account for the wider spread in school-level variability, the model inflates ${\hat{τ}}_{t_{0}}$ , leading to a positive bias that grows with the magnitude of the true $τ_{t_{0}}$ . As expected, larger effect sizes increased the average RMSE of the estimate, as larger parameters have more room for absolute error.

Interestingly, increasing the number of students per school did not uniformly reduce bias but rather amplified some existing patterns. In a condition prone to overestimation ( $τ_{t_{0}} = 0.9, π = 0.25, S = 150$ ), increasing the number of students from 50 to 150 caused the positive bias to grow from 0.59 to 0.72. With more data per school, the estimates of the individual school effects were more precise, amplifying the over- or underestimation of $τ_{t_{0}}$ . These differences, however, were minimal. On average, the bias difference between 50 and 150 students was 0.05 units, ranging from −0.04 to 0.19.

Classification Performance

To evaluate the ability of each method to correctly classify schools as belonging to the “spike” or “slab” group, we assessed sensitivity, specificity, and the F1 score. Visual inspection of these metrics revealed that their distribution was primarily driven by the number of students and the effect size, while being roughly invariant to the number of schools. Therefore, we focus our illustrative results on the conditions with 150 schools.

Both methods demonstrated reasonably good sensitivity, which improved with larger effect sizes. However, as shown in Table 1, the two-stage HLM was often slightly more sensitive, particularly in conditions with more students and a higher proportion of atypical clusters. For example, with 150 schools, $τ_{t_{0}} = 0.90$ , and $π = 0.25$ , the HLM’s sensitivity reached 0.90, slightly outperforming the SS-MELSM’s 0.86 score.

Table 1.

Average Values of Sensitivity and Specificity for SS-MELSM and Two-Stage HLM, Including only Conditions of 150 Schools

		Sensitivity						Specificity
		SS-MESLM			Two-stage HLM			SS-MESLM			Two-stage HLM
n _j	$τ_{t_{0}}$	0.05	0.10	0.25	0.05	0.10	0.25	0.05	0.10	0.25	0.05	0.10	0.25
50	0.31	$0.40$	0.46	$0.50$	0.39	0.46	0.47	$0.99$	$0.97$	0.97	0.96	0.94	0.96
	0.44	$0.58$	$0.61$	0.61	0.54	0.60	$0.64$	$0.98$	$0.97$	$0.98$	0.96	0.94	0.94
	0.67	$0.76$	0.73	0.71	0.68	$0.76$	$0.76$	$0.97$	$0.98$	$0.98$	0.95	0.94	0.93
	0.90	0.77	$0.81$	0.78	$0.80$	0.80	$0.84$	$0.98$	$0.98$	$0.99$	0.96	0.94	0.92
75	0.31	$0.53$	$0.57$	0.56	0.50	0.56	$0.59$	$0.98$	$0.97$	$0.98$	0.96	0.96	0.94
	0.44	$0.68$	$0.68$	0.66	0.58	0.67	$0.69$	0.97	$0.97$	$0.98$	0.97	0.95	0.94
	0.67	$0.77$	0.76	0.76	0.76	$0.79$	$0.81$	$0.98$	$0.98$	$0.99$	0.95	0.94	0.92
	0.90	0.82	0.83	0.81	$0.83$	$0.85$	$0.86$	$0.98$	$0.99$	$0.99$	0.95	0.93	0.92
150	0.31	$0.67$	0.67	0.67	0.65	$0.68$	$0.70$	$0.97$	$0.98$	$0.98$	0.96	0.94	0.94
	0.44	0.74	0.75	0.74	$0.75$	$0.79$	$0.80$	$0.97$	$0.98$	$0.99$	0.96	0.94	0.93
	0.67	$0.86$	0.81	0.83	0.83	$0.84$	$0.88$	$0.98$	$0.99$	$0.99$	0.95	0.93	0.92
	0.90	0.86	0.86	0.86	$0.87$	$0.89$	$0.90$	$0.98$	$0.99$	$0.99$	0.96	0.93	0.93

Note. The column headers 0.05, 0.10, and 0.25 refer to values of the slab probability, $π$ . Bold indicates the higher value between methods for each metric and the $π$ condition. SS-MELSM = Spike-and-Slab Mixed-Effects Location Scale Model; HLM = Hierarchical linear model.

On the other hand, the SS-MELSM was consistently better at avoiding false positives, maintaining a high and stable specificity across all conditions. For example, for SS-MELSM, specificity was never below 0.91, while for HLM, it degraded in more complex scenarios, dropping as low as 0.58 as the effect size and the proportion of “slab” schools increased. Though not shown in Table 1, when $τ_{t_{0}} = 0$ and $π = 0$ , specificity was 1.00 for SS-MELSM and 0.99 for HLM across all n_j. It is also relevant to observe Figure 1, which shows that SS-MELSM’s false positive rate was frequently lower and varied much less than the two-stage HLM across simulations.

Figure 1.

False Positive Rate values across simulations, including 150 schools.

The trade-off between the two methods is best resolved by the F1 score, which balances sensitivity and precision. As summarized in Figure 2, the results favor the SS-MELSM, which demonstrated a superior F1 score in challenging conditions with small effect sizes (e.g. scores of 0.55 vs. 0.48) as well as favorable conditions with larger samples and effect sizes.

Figure 2.

F1-score values across simulations, including 150 schools.

Finally, we also addressed one reviewer’s concern about differences between HLM and SS-MELSM being driven by the direction of the variability (i.e. unusually high or low). A small-scale simulation study, included in our OSF repository, shows that both methods’ sensitivity performance generally does not depend on the direction of residual deviations. Additionally, we found a greater tendency for two-stage HLM to produce false positives when classifying schools as unusually consistent, compared to a more balanced specificity from SS-MELSM in both directions of the effect.

Secondary Simulation: Small and Unbalanced Samples

An important contribution of Raudenbush and Bryk (1987) transformation is its ability to account for degrees of freedom, which is especially relevant in small-sample contexts. To assess performance in these more challenging scenarios, we conducted a secondary simulation with 20 and 50 schools and smaller, unbalanced school sizes of 10 and 30 on average.

In terms of classification, both methods performed similarly with respect to sensitivity. However, mirroring the results of the primary study, the two-stage HLM had significantly higher and more variable false positive rates. Regarding parameter recovery, the two-stage HLM showed near-zero bias when no true effects existed. However, when there were atypical schools in the sample, the SS-MELSM’s estimates were, on average, much closer to the true value, demonstrating a clear advantage when effects are present. Detailed results of the secondary simulation study can be found in https://osf.io/w2r98/files/wfnsk.

Summary of Findings

Our simulations showed that SS-MELSM can recover true signals with good specificity. Across conditions, SS-MELSM was especially effective at finding true signals while erring on the side of caution. The two-stage HLM sometimes showed higher sensitivity, but this was accompanied by lower specificity and more false positives. The SS-MELSM approach may miss some true positives, but it is substantially better and more consistent at avoiding misclassification. Overall, we found support for using SS-MELSM as a cautious yet reliable tool for identifying clusters with unusual variability.

Illustrative Example

Building on the results of the simulation study, we next demonstrate the model’s practical utility by applying it to a real-world education dataset. Our goal in this section is to show how SS-MELSM can be used to identify schools whose students are unusually inconsistent (or consistent) in their math achievement, even after accounting for school- and student-level predictors. The proposed model estimates fixed and random intercepts for both location and scale and adds SES as a covariate. We also use this empirical data to briefly illustrate the model comparison approach employed by ivd .

Subjects and Procedure

We use data from The Elementary Education Evaluation System (Saeb), an assessment program conducted by the Brazilian government (Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira, 2021). Saeb is designed to evaluate the quality of elementary education across Brazil by administering standardized tests and questionnaires to students in public schools and a representative sample of private schools every 2 years. Student scores are reported on the Saeb unified scale, standardized in the reference population to have a mean of 0 and an SD of 1. For this study, we analyze a subset of 11,386 11th- and 12th-grade students from 160 randomly selected schools in the state of Rio de Janeiro who took the 2021 exam. We limited the sample to a single state to avoid potential differences arising from variations in educational systems across states. Additionally, the reduced sample size allows the inclusion of this dataset in the ivd package and facilitates the reproducibility of our analyses in modest hardware.

In this subsample, school sizes ranged from 1 to 493 students, with a mean of 71.2 (SD = 74.5). The schools were nearly all public (99.4%) and located in urban areas (95.2%). Mathematics achievement had a mean of 0.15 (SD = 0.85), ranging from 1.82 to 3.23. Student SES averaged 4.94 (SD = 0.81), ranging from 2.20 to 7.62. Saeb computes SES based on parental education, household assets, and purchased services, scaling it in the reference population to a mean of 5 and SD of 1.

Statistical Analyses

To identify schools with unusual achievement consistency, we specified and compared three SS-MELSMs of increasing complexity. Model 1 served as an intercept-only baseline, Model 2 added fixed effects for SES, and Model 3 extended the model with a random slope for student-level SES.

Model 1: Intercept-Only Model

This baseline model estimates the average math achievement and the average within-school variability across all schools, including random intercepts for both the location (u_0j) and scale (t_0j) components to account for school-level differences. The model is specified as shown in Equation (20) from the simulation study.

Model 2: Covariate Model

This model extends the baseline by adding student- and school-level SES as predictors in both the location and scale submodels. The likelihood is defined as:

\begin{array}{l} μ_{i j} = γ_{0} + γ_{1} {SES}_{i j}^{(w)} + γ_{2} {SES}_{j}^{(b)} + γ_{3} ({SES}_{i j}^{(w)} \times {SES}_{j}^{(b)}) + u_{0 j} \\ σ_{i j}^{2} = \exp [η_{0} + η_{1} {SES}_{i j}^{(w)} + η_{2} {SES}_{j}^{(b)} + η_{3} ({SES}_{i j}^{(w)} \times {SES}_{j}^{(b)}) + t_{0 j}] . \end{array}

(25)

The fixed effects $γ_{1}$ and $η_{1}$ represent the influence of student-level SES, while $γ_{2}$ and $η_{2}$ capture the effects of school-level SES in the location and scale models, respectively. The cross-level interactions are represented by $γ_{3}$ and $η_{3}$ .

Model 3: Random-Slope Model

The final model builds on Model 2 by adding a random slope for student-level SES (t_1j) to the scale component. This allows the effect of a student’s relative SES on their achievement consistency to vary from school to school. The scale submodel is thus defined as:

\begin{matrix} σ_{ij}^{2} = \exp [η_{0} + η_{1} {SES}_{ij}^{(w)} + η_{2} {SES}_{j}^{(b)} + η_{3} ({SES}_{ij}^{(w)} \times {SES}_{j}^{(b)}) + t_{0 j} + t_{1 j} {SES}_{ij}^{(w)}] . \end{matrix}

Across Models 2 and 3, the SES predictor was partitioned to disentangle the within-school effect ( ${SES}_{ij}^{(w)}$ ), representing a student’s SES relative to their school’s average, from the between-school effect ( ${SES}_{j}^{(b)}$ ), representing the school’s average SES. In the scale model, the school-level predictor addresses whether the socioeconomic context of a school relates to the overall consistency of its students’ achievement. In contrast, the student-level predictor tests whether a student’s SES, in relation to their peers, influences the predictability of their achievement score. In other words, the student-level SES is used to account for the spread of the error variance.

All models included random intercepts for location (u_0j) and scale (t_0j), with a spike-and-slab prior placed on the scale random effect. Priors were consistent with those defined in the Simulation Study.

Model Fit and Model Comparison

All models were fit with six chains of 3,000 iterations and 12,000 warm-up samples (example code is provided in https://github.com/consistentlyBetter/ivd). The number of iterations was chosen to ensure good quality of the parameter estimates, with the models converging as indicated by potential scale reduction factors $\hat{R}$ generally below 1.1 (cf. Gelman, 2006). However, $\hat{R}$ values were slightly above 1.1 for the correlations between the location and scale random effects, indicating that they have not been sampled from the typical set yet. Additionally, we computed the effective sample size (ESS) following the approach described in Vehtari et al. (2021). The ESS measures estimation efficiency by quantifying how many independent samples would be needed to achieve the same precision for the posterior mean. However, it is important to note that ESS applies specifically to the posterior mean’s efficiency, and the number of samples required for other posterior functionals, such as quantiles, may differ significantly.

We compared the predictive accuracy of the estimated models using Pareto smoothed importance sampling leave-one-out cross-validation (PSIS-LOO), reporting both the expected log pointwise predictive density ( ${\hat{elpd}}_{loo}$ ) for each model and the difference in their expected predictive accuracy ( $Δ {\hat{elpd}}_{loo}$ ; Vehtari et al., 2017). Following Sivula et al. (2022), meaningful model comparisons can be made as long as the absolute difference in $Δ {\hat{elpd}}_{loo}$ is greater than four. If it is smaller, the models can be assumed similar. Model comparisons are made via the standard error of the $Δ {\hat{elpd}}_{loo}$ in order to obtain information on the similarity of the models. Computing ${\hat{elpd}}_{loo}$ is a fully Bayesian approach to assessing the predictive fit of each model and is asymptotically equivalent to the widely applicable information criterion (Watanabe, 2010), which in turn asymptotically converges to the Akaike information criterion (Akaike, 1998).

Note that the SS-MELSM is a single joint model in which all random effects are estimated simultaneously under the spike-and-slab prior. Therefore, the model fit statistics reported here, such as PSIS-LOO, assess the overall predictive accuracy of this single, integrated model structure.

Results

We fit three competing SS-MELSMs to the data and compared their predictive accuracy using PSIS-LOO to determine the best-fitting model. The models included a baseline intercept-only model (Model 1), a model with SES covariates (Model 2), and a model that added a random slope for student-level SES to the scale component (Model 3).

The model comparison results showed that Model 2 provided a substantially better fit than the intercept-only Model 1 ( $Δ {\hat{elpd}}_{loo} = - 43.6, se = 10.5$ ), with a corresponding standard error that was approximately four times smaller. Adding a random slope in Model 3 did not meaningfully improve predictive accuracy over Model 2 ( $Δ {\hat{elpd}}_{loo} = - 1.3, se = 0.6$ ). Furthermore, in Model 3, the PIPs for the random slope were all below the 0.75 threshold, indicating insufficient evidence for its inclusion. Therefore, we selected Model 2 as the best-fitting model and will focus the remainder of our interpretation on its results.

In Model 2, both student-level ( $γ_{1} = 0.08$ , 95% credible interval (CrI) [0.06, 0.10]) and school-level ( $γ_{2} = 0.68$ , 95% CrI [0.51, 0.85]) SES were positively associated with higher average math achievement. In the scale model, we found modest but positive associations between SES and the residual standard deviation, indicating that students and schools with higher SES tended to have more variable math scores: exp( $η_{1} = 0.031$ ) = 1.031 (95% CrI [1.014; 1.049]), exp( $η_{2} = 0.114$ ) = 1.121 (95% CrI [1.048; 1.198]), and exp( $η_{3} = 0.075$ ) = 1.078 (95% CrI [1.002; 1.160]). A key finding from the joint model was the strong negative correlation between the location and scale random effects ( $ρ_{u_{0} t_{0}} =$ −0.76), indicating that schools with higher-than-average achievement tended to have more consistent student performance (i.e. lower within-school variability). Detailed parameter estimates for Models 1, 2, and 3 are provided in the Supplemental Materials (available in the online version of this article) associated with this article.

The primary goal of the analysis was to identify schools with atypical achievement consistency. Using the PIP > 0.75 criterion, the SS-MELSM flagged eight schools in Model 2. As shown in Figure 3, Panel A, these schools represent both extremes of the consistency spectrum. Five flagged schools exhibited unusually high consistency (a smaller-than-average residual SD), while three exhibited unusually low consistency (a larger-than-average residual SD). Panel B of Figure 3 shows that some of the most consistent schools were also among the highest achieving: school 46 was not only highly consistent in math achievement but also performed better than the average in our sample. We can also observe a mix of more inconsistent and consistent schools within 0.5 standard deviations of the estimated mean. These findings are not unexpected, as there is a negative correlation between mean and variance (high-achieving schools tend to have smaller variance). Still, some schools closer to the average showed both high and low within-cluster variance. This lends some support to the idea that average performance and consistency of performance can be distinct features of a school. The SS-MELSM can identify schools that are outliers on this latter dimension, even after accounting for covariates such as SES.

Figure 3.

Scatter plots of PIP for the scale random intercept.

It appears that the model’s decision to flag a school was driven by both the school’s size and the weight of SES. A modest deviation from the average in a large school, like 115 ( $n =$ 329), yielded a high PIP because the estimate was precise, while a more extreme deviation in a smaller school, like 124 ( $n =$ 27), was shrunk toward the mean and dismissed as sampling noise. Similarly, the model evaluates a school’s dispersion relative to the expectation set by its covariates. For example, Schools 153 and 95 had similar estimated within-school SDs (0.76 vs. 0.74), but very different SES values (5.15 vs. 4.47). The first was flagged as an unexpected case relative to the SES-dispersion association. This means a low-SES school with low within-cluster SD is consistent with model expectations and unlikely to be flagged, given the estimated positive association between school-SES and cluster variance ( $η_{3} = 0.114$ ). These results support SS-MELSM’s value-added over a naive inspection of the raw dispersion.

Discussion

In this paper, we introduced the SS-MELSM, an extension of the standard MELSM, to identify clusters with unusual within-cluster variability. The motivation for this work is rooted in the growing body of literature suggesting that variability in academic achievement reflects elements that go unaccounted for when focusing on the mean performance. As such, the modeling framework presented here could serve policymakers and other stakeholders in identifying schools or school classes that are subject to factors such as SES or resource use, which exacerbate or mitigate differences in student achievement. From a policy perspective, reducing variability in achievement does not usually mean lowering high performance and raising low performance to meet in the middle. More often, the goal is to raise all students’ achievement levels, thereby reducing disparities while increasing the mean.

The statistical framework presented in this study incorporates both mean and variance, enabling policymakers to evaluate interventions with greater nuance. For example, interventions that provide more resources or support to students who need it most can reduce differences within a school and also raise overall performance. Conversely, recognizing situations where low variability corresponds to uniformly low performance is equally important for informed policy decisions.

Methodologically, the SS-MELSM builds upon the foundation laid by Hedeker et al. (2008), who introduced a framework for simultaneously modeling cluster-specific means and variances, with variance submodels depending on covariates. This one-stage approach stands in contrast to two-stage methods for modeling dispersion (Raudenbush & Bryk, 1987). While two-stage models are well established, they introduce an implicit assumption about the independence of location and scale random effects that is not present in our approach. Specifically, the main difference between an MELSM model and the two-stage approach is that in the MELSM, the covariance matrix (and thus the correlation matrix) of all random effects across location and scale is estimated jointly within the likelihood (see Equation 6). In two-step approaches, these correlations do not exist during estimation, implying perfect orthogonality among location and scale random effects. While it is possible to compute their empirical correlation after estimating each step, this correlation plays no role in model fitting. In contrast, given the joint model of the MELSM, dependencies between location and scale can influence all other parameters during estimation – this is a general advantage of joint models (Wolfinger & Tobias, 1998).

As noted by Leckie et al. (2014), this approach yields more precise variance estimates in the scale model. Simulation results available in the OSF repository further demonstrate this benefit, as the correlation between schools’ deviations from the mean and their variability is, on average, underestimated when employing a two-stage approach.

By incorporating a spike-and-slab prior into the scale component, the SS-MELSM provides probabilistic measures of random-effect inclusion via PIPs and Bayes factors. In contrast to traditional MLM, which assumes all random effects are relevant, this method selectively differentiates between meaningful and negligible random components. This helps make informed decisions about whether a school demonstrates unusually consistent or inconsistent achievement, thereby highlighting clusters where differences in variability may be due to factors such as teaching quality, student engagement, or available resources.

Our simulation study gave further support for the SS-MELSM. Compared with the two-stage V-known HLM approach (Raudenbush & Bryk, 1987), the SS-MELSM had higher specificity and better control of false positives, while matching or exceeding overall accuracy using the F1 score. In cases where no schools differed in variability, the two-stage method still identified some schools as atypical, whereas the SS-MELSM showed almost perfect specificity. This means the SS-MELSM is less susceptible to over-interpreting natural extremes and is more reliable at finding truly unusual patterns. It is important to remember that while screening often focuses on sensitivity to avoid missing cases, this approach only works well when follow-up costs are low. In education policy, a model with high sensitivity but low precision could lead to too many referrals for further study, potentially overwhelming the system it is intended to support (VanMeveren et al., 2020).

Having established the SS-MELSM’s performance in simulation, we next applied it to data from the Brazilian Evaluation System of Elementary Education (Saeb), a large-scale educational assessment. The Saeb application illustrated how the SS-MELSM provides insights beyond mean scores, showing that higher-achieving schools were also more consistent. The model identified 8 of 160 schools with unusual variability, even after accounting for SES differences. The inclusion of SES improved the model’s predictive accuracy, as evidenced by the PSIS-LOO model comparison, which favored Model 2 over Model 1. This finding aligns with existing research highlighting the significant role of SES in academic achievement (Davis-Kean, 2005; Davis-Kean et al., 2021; Sirin, 2005).

Despite the promising results of SS-MELSM, several limitations should be acknowledged. First, estimating the SS-MELSM can be computationally demanding, particularly for models with many random effects. While the ivd ivd package facilitates estimation, high-dimensional random effects structures may require significant computational resources. Future work could explore more efficient estimation algorithms. Second, as with any model of variance, interpretation can be affected by mean-variance dependencies, such as ceiling or floor effects on standardized tests (Baird et al., 2006; Eid & Diener, 1999; Mestdagh et al., 2018). For example, a very high-achieving school may appear unusually consistent simply because most students are hitting the maximum possible score. Third, the current model does not disentangle true performance variability from measurement error, which are confounded in the residual variance. However, when measurement error varies, including an additional term in the variance model to capture measurement error could enhance the model (Vansteelandt & Verbeke, 2016).

The models discussed, especially those with multiple random effects, involve many parameters, raising questions about necessary sample sizes and observations per cluster. While simpler MELSMs can be estimated with relatively few data points (Leckie et al., 2014; Walters et al., 2018), increased complexity demands more data. While broad, our simulation was designed to compare the relative performance of the models, not as a formal power analysis to establish minimum sample size requirements for a given effect, and a proper investigation of data needs is still lacking.

In our simulation, the fitted model correctly matched the data-generating process, which is often not the case in applied research. A key limitation is that we did not test the consequences of omitting an important covariate that influences both the mean and the variance. Such a model misspecification could bias the estimates of the random effects and potentially inflate the perceived heterogeneity. Furthermore, we generated data from a normal distribution, and the performance of the tested methods might be different if the underlying student-level data were skewed or heavy-tailed.

Finally, we used a standard set of priors and a common threshold for classification. While these are justified by convention, a formal sensitivity analysis was not conducted to determine how different prior specifications might affect the results.

In conclusion, the SS-MELSM offers a valuable extension to the MELSM framework. By jointly modeling the mean and variance components and incorporating a principled method for random effect selection, it offers a deeper understanding of academic achievement. By moving beyond simple averages, researchers and educators can use the SS-MELSM to identify and support educational environments that help students achieve well and do so consistently.

Supplemental Material

sj-pdf-1-jeb-10.3102_10769986261426004 – Supplemental material for Beyond Average Scores: Identification of Consistent and Inconsistent Academic Achievement in Grouping Units

Supplemental material, sj-pdf-1-jeb-10.3102_10769986261426004 for Beyond Average Scores: Identification of Consistent and Inconsistent Academic Achievement in Grouping Units by Marwin Carmo, Donald Williams and Philippe Rast in Journal of Educational and Behavioral Statistics

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work reported in this publication was partially supported by an award to PR from the Learning Engineering Tools Competition, organized by The Learning Agency LLC. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

ORCID iDs

Marwin Carmo

Philippe Rast

Notes

Authors

MARWIN CARMO is a PhD student at the University of California, Davis, 135 Young Hall One Shields Avenue, Davis, CA 95616; e-mail: mmcarmo@ucdavis.edu. His research interests are mixed-effects location-scale models, Bayesian inference, and Gaussian graphical models.

DONALD WILLIAMS is a PhD in psychology from the University of California, Davis, 135 Young Hall One Shields Avenue, Davis, CA 95616; e-mail: drwwilliams@ucdavis.edu. His research interests are Gaussian graphical models, multilevel models with heterogeneous variance components, meta-analysis, Bayesian inference, measurement reliability, predictive modeling, and regularization.

PHILIPPE RAST is a professor at the University of California, Davis, 135 Young Hall One Shields Avenue, Davis, CA 95616; e-mail: prast@ucdavis.edu. His research interests are developing quantitative methods for examining change over time and how individuals differ in these changes.

References

Akaike

(1998). Information theory and an extension of the maximum likelihood principle. In Parzen

Tanabe

Kitagawa

(Eds.), Selected papers of Hirotugu Akaike. Springer Series in statistics (Perspectives in statistics) (pp. 199–213). Springer. https://doi.org/10.1007/978-1-4612-1694-0_15

Baird

B. M.

Lucas

R. E.

(2006). On the nature of intraindividual personality variability: Reliability, validity, and associations with well-being. Journal of Personality and Social Psychology, 90(3), 512–527. https://doi.org/10.1037/0022-3514.90.3.512

Borgen

N. T.

Zachrisson

H. D.

Sandsør

A. M. J.

(2025). Do schools equalize or exacerbate inequality? Between-school variability in the relationship between socioeconomic background and academic achievement. https://doi.org/10.31235/osf.io/tw9r4_v2

Brunner

Gogol

K. M.

Sonnleitner

Keller

Krauss

Preckel

(2013). Gender differences in the mean level, variability, and profile shape of student achievement: Results from 41 countries. Intelligence, 41(5), 378–395. https://doi.org/10.1016/j.intell.2013.05.009

Brunton-Smith

Sturgis

Leckie

(2017). Detecting and understanding interviewer effects on survey data by using a cross-classified mixed effects location-scale model. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(2), 551–568. https://doi.org/10.1111/rssa.12205

Connell

J. P.

Wellborn

J. G.

(1991). Competence, autonomy, and relatedness: A motivational analysis of self-system processes. In Gunnar

M. R.

Sroufe

L. A.

(Eds.), Self processes and development (pp. 43–77). Lawrence Erlbaum Associates.

Cui

George

E. I.

(2008). Empirical Bayes vs. Fully Bayes variable selection. Journal of Statistical Planning and Inference, 138(4), 888–900. https://doi.org/10.1016/j.jspi.2007.02.011

Davis-Kean

P. E.

(2005). The influence of parent education and family income on child achievement: The indirect role of parental expectations and the home environment. Journal of Family Psychology, 19(2), 294–304. https://doi.org/10.1037/0893-3200.19.2.294

Davis-Kean

P. E.

Tighe

L. A.

Waters

N. E.

(2021). The role of parent educational attainment in parenting and children’s development. Current Directions in Psychological Science, 30(2), 186–192. https://doi.org/10.1177/0963721421993116

10.

de Valpine

Turek

Paciorek

C. J.

Anderson-Bergman

Lang

D. T.

Bodik

(2017). Programming with models: Writing statistical algorithms for general model structures with NIMBLE. Journal of Computational and Graphical Statistics, 26(2), 403–413. https://doi.org/10.1080/10618600.2016.1172487

11.

Doyle

Easterbrook

M. J.

Harris

P. R.

(2023). Roles of socioeconomic status, ethnicity and teacher beliefs in academic grading. The British Journal of Educational Psychology, 93(1), 91–112. https://doi.org/10.1111/bjep.12541

12.

Eid

Diener

(1999). Intraindividual variability in affect: Reliability, validity, and personality correlates. Journal of Personality and Social Psychology, 76(4), 662–676. https://doi.org/10.1037/0022-3514.76.4.662

13.

Frühwirth-Schnatter

Wagner

(2011). Bayesian variable selection for random intercept modeling of gaussian and non-gaussian data. In Bernardo

J. M.

Bayarri

M. J.

Berger

J. O.

Dawid

A. P.

Heckerman

Smith

A. F. M.

West

(Eds.), Bayesian statistics 9 (pp. 165–200). Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199694587.003.0006

14.

Gelman

(2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–534. https://doi.org/10.1214/06-BA117A

15.

George

E. I.

McCulloch

R. E.

(1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423), 881–889. https://doi.org/10.1080/01621459.1993.10476353

16.

George

E. I.

McCulloch

R. E.

(1997). Approaches for Bayesian Variable Selection [Publisher: Institute of Statistical Science, Academia Sinica]. Statistica Sinica, 7(2), 339–373. Retrieved December 22, 2023, from https://www.jstor.org/stable/24306083

17.

Goldstein

Leckie

Charlton

Tilling

Browne

W. J.

(2018). Multilevel growth curve models that incorporate a random coefficient model for the level 1 variance function. Statistical Methods in Medical Research, 27(11), 3478–3491. https://doi.org/10.1177/0962280217706728

18.

Gottfried

A. E.

Gottfried

A. W.

Morris

P. E.

Cook

C. R.

(2008). Low academic intrinsic motivation as a risk factor for adverse educational outcomes: A longitudinal study from early childhood through early adulthood. In Hudley

Gottfried

A. E.

(Eds.), Academic motivation and the culture of school in childhood and adolescence (pp. 36–69). Oxford University Press. https://doi.org/10.1093/acprof:oso/9780195326819.003.0003

19.

Hedeker

Mermelstein

R. J.

Demirtas

(2008). An application of a mixed-effects location scale model for analysis of ecological momentary assessment (EMA) data. Biometrics, 64(2), 627–634. https://doi.org/10.1111/j.1541-0420.2007.00924.x

20.

Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira. (2021). Sistema de Avaliação da Educação Básica (Saeb). https://www.gov.br/inep/pt-br/areas-de-atuacao/avaliacao-e-exames-educacionais/saeb

21.

Kass

R. E.

Raftery

A. E.

(1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795.

22.

Kuo

Mallick

(1998). Variable selection for regression models. Sankhyā: The Indian Journal of Statistics, Series B (1960–2002), 60(1), 65–81. https://www.jstor.org/stable/25053023

23.

Leckie

(2014). Runmixregls: A program to run the MIXREGLS mixed-effects location scale software from within Stata. Journal of Statistical Software, 59, 1–39.

24.

Leckie

French

Charlton

Browne

(2014). Modeling heterogeneous variance-covariance components in two-level models. Journal of Educational and Behavioral Statistics, 39(5), 307–332. https://doi.org/10.3102/1076998614546494

25.

Leckie

Parker

Goldstein

Tilling

(2023). Mixed-effects location scale models for joint modeling school value-added effects on the mean and variance of student achievement. Journal of Educational and Behavioral Statistics, 49, 879–911. https://doi.org/10.3102/10769986231210808

26.

Lewandowski

Kurowicka

Joe

(2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100(9), 1989–2001. https://doi.org/10.1016/j.jmva.2009.04.008

27.

Lurie

L. A.

Hagen

M. P.

McLaughlin

K. A.

Sheridan

M. A.

Meltzoff

A. N.

Rosen

M. L.

(2021). Mechanisms linking socioeconomic status and academic achievement in early childhood: Cognitive stimulation and language. Cognitive Development, 58, Article 101045. https://doi.org/10.1016/j.cogdev.2021.101045

28.

Martin

S. R.

Rast

(2022). The reliability factor: Modeling individual reliability with multiple items from a single assessment. Psychometrika, 87(4), 1318–1342. https://doi.org/10.1007/s11336-022-09847-9

29.

Mestdagh

Pestman

Verdonck

Kuppens

Tuerlinckx

(2018). Sidelining the mean: The relative variability index as a generic mean-corrected variability measure for bounded variables. Psychological Methods, 23, 690–707. https://doi.org/10.1037/met0000153

30.

Mitchell

T. J.

Beauchamp

J. J.

(1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404), 1023–1032. https://doi.org/10.1080/01621459.1988.10478694

31.

Perry

L. B.

Saatcioglu

Mickelson

R. A.

(2022). Does school SES matter less for high-performing students than for their lower-performing peers? A quantile regression analysis of PISA 2018 Australia. Large-Scale Assessments in Education, 10(1), Article 17. https://doi.org/10.1186/s40536-022-00137-5

32.

Powers

D. M. W.

(2020). Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation. https://arxiv.org/abs/2010.16061

33.

Rast

Carmo

(2025). Ivd: Individual variance detection [R package version 0.1.0]. https://cran.r-project.org/package=ivd

34.

Rast

Ferrer

(2018). A mixed-effects location scale model for dyadic interactions. Multivariate Behavioral Research, 53(5), 756–775. https://doi.org/10.1080/00273171.2018.1477577

35.

Rast

Hofer

S. S.

Sparks

(2012). Modeling individual differences in within-person variation of negative and positive affect in a mixed effects location scale model using BUGS/JAGS. Multivariate Behavioral Research, 47(2), 177–200. https://doi.org/10.1080/00273171.2012.658328

36.

Raudenbush

S. W.

Bryk

A. S.

(1987). Examining correlates of diversity. Journal of Educational Statistics, 12(3), 241–269. https://doi.org/10.3102/10769986012003241

37.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). Sage Publications.

38.

Rausch

O. P.

(1948). The effects of individual variability on achievement. Journal of Educational Psychology, 39(8), 469–478. https://doi.org/10.1037/h0059307

39.

Rodriguez

J. E.

Williams

D. R.

Rast

(2022). Who is and is not “average”? Random effects selection with spike-and-slab priors. Psychological Methods, 29, 117–136. https://doi.org/10.1037/met0000535

40.

Rouder

J. N.

Haaf

J. M.

Vandekerckhove

(2018). Bayesian inference for psychology, part IV: Parameter estimation and Bayes factors. Psychonomic Bulletin & Review, 25(1), 102–113. https://doi.org/10.3758/s13423-017-1420-7

41.

Scott

J. G.

Berger

J. O.

(2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, 38(5), 2587–2619. https://doi.org/10.1214/10-AOS792

42.

Sirin

S. R.

(2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453. https://doi.org/10.3102/00346543075003417

43.

Sivula

Magnusson

Matamoros

A. A.

Vehtari

(2022, March). Uncertainty in Bayesian leave-one-out cross-validation based model comparison [arXiv:2008.10296 [stat]]. http://arxiv.org/abs/2008.10296

44.

Spörlein

Schlueter

(2018). How education systems shape cross-national ethnic inequality in math competence scores: Moving beyond mean differences. Plos One, 13(3), e0193738. https://doi.org/10.1371/journal.pone.0193738

45.

Tompsett

Knoester

(2023). Family socioeconomic status and college attendance: A consideration of individual-level and school-level pathways. PLOS ONE, 18(4), e0284188. https://doi.org/10.1371/journal.pone.0284188

46.

VanMeveren

Hulac

Wollersheim-Shervey

(2020). Universal screening methods and models: Diagnostic accuracy of reading assessments. Assessment for Effective Intervention, 45(4), 255–265. https://doi.org/10.1177/1534508418819797

47.

Vansteelandt

Verbeke

(2016). A mixed model to disentangle variance and serial autocorrelation in affective instability using ecological momentary assessment data. Multivariate Behavioral Research, 51(4), 446–465. https://doi.org/10.1080/00273171.2016.1159177

48.

Vehtari

Gelman

Gabry

(2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

49.

Vehtari

Gelman

Simpson

Carpenter

Bürkner

P.-C.

(2021). Rank-normalization, folding, and localization: An improved r^ for assessing convergence of MCMC (with Discussion). Bayesian Analysis, 16(2), 677–718. https://doi.org/10.1214/20-BA1221

50.

Verbeke

Davidian

(2009). Joint models for longitudinal data: Introduction and overview. In Fitzmaurice

Davidian

Verbeke

Molenberghs

(Eds.), Longitudinal data analysis (pp. 319–326). Chapman & Hall/CRC.

51.

von Stumm

Cave

S. N.

Wakeling

(2022). Persistent association between family socioeconomic status and primary school performance in Britain over 95 years. npj Science of Learning, 7(1), Article 4. https://doi.org/10.1038/s41539-022-00120-3

52.

Walters

R. W.

Hoffman

Templin

(2018). The power to detect and predict individual differences in intra-individual variability using the mixed-effects location-scale model. Multivariate Behavioral Research, 53(3), 360–374. https://doi.org/10.1080/00273171.2018.1449628

53.

Watanabe

(2010). Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. Journal of Machine Learning Research, 11, 3571–3594. http://jmlr.csail.mit.edu/papers/volume11/watanabe10a/watanabe10a.pdf

54.

Williams

D. R.

Martin

S. R.

Rast

(2021). Putting the individual into reliability: Bayesian testing of homogeneous within-person variance in hierarchical models. Behavior Research Methods, 54, 1272–1290. https://doi.org/10.3758/s13428-021-01646-x

55.

Wolfinger

R. D.

Tobias

R. D.

(1998). Joint estimation of location, dispersion, and random effects in robust design. Technometrics, 40(1), 62–71.

56.

Woodrow

(1932). Quotidian variability. Psychological Review, 39(3), 245–256. https://doi.org/10.1037/h0073076

57.

Wright

von Stumm

(2022). Within-person variability in performance across school subjects. Learning and Individual Differences, 93, Article 102091. https://doi.org/10.1016/j.lindif.2021.102091

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB