Sage Journals: Discover world-class research

Abstract

Normative studies are needed to obtain norms for comparing individuals with the reference population on relevant clinical or educational measures. Norms can be obtained in an efficient way by regressing the test score on relevant predictors, such as age and sex. When several measures are normed with the same sample, a multivariate regression-based approach must be adopted for at least two reasons: (1) to take into account the correlations between the measures of the same subject, in order to test certain scientific hypotheses and to reduce misclassification of subjects in clinical practice, and (2) to reduce the number of significance tests involved in selecting predictors for the purpose of norming, thus preventing the inflation of the type I error rate. A new multivariate regression-based approach is proposed that combines all measures for an individual through the Mahalanobis distance, thus providing an indicator of the individual’s overall performance. Furthermore, optimal designs for the normative study are derived under five multivariate polynomial regression models, assuming multivariate normality and homoscedasticity of the residuals, and efficient robust designs are presented in case of uncertainty about the correct model for the analysis of the normative sample. Sample size calculation formulas are provided for the new Mahalanobis distance-based approach. The results are illustrated with data from the Maastricht Aging Study (MAAS).

Keywords

Mahalanobis distance normative data optimal design sample size calculation

To classify individuals’ performance on relevant clinical or educational measures, such as neuropsychological tests or language tests, psychologists and educators need reference values or norms. Three examples of studies that provide normative data are Van der Elst et al. (2006), who derived norms for the three subtasks of the Stroop Color-Word Test, which is a general measure of cognitive flexibility and control, Van der Elst et al. (2011), who normed the Animal Verbal Fluency and the Design Fluency tests in school-aged children, which are aimed to assess vocabulary size and motor planning, respectively, and Dujardin et al. (2021), who provided reference values for several tests to assess vocabulary, reading, and spelling skills in university students.

Norms allow practitioners to directly compare a subject’s performance on a test with the reference population, that is, a group of individuals with the same characteristics deemed relevant for the norming (e.g., the same age and sex). What predictors make the comparison fair or, in other words, what determines the reference population, depends on the purpose of the test (e.g., diagnosis of dementia versus job selection), which is the first crucial choice in a normative study. The second important choice is that of the norming approach. In the traditional approach, after drawing a sample of subjects (i.e., the normative sample), the predictors defining the reference population are categorized and norms are derived within subgroups based on these categories. This approach has been methodologically surpassed by regression-based norming (Oosterhuis et al., 2016; Timmerman et al., 2021; Van Breukelen & Vlaeyen, 2005), which allows test developers to obtain more precise norms for a number of reasons. First, the regression-based approach is more efficient than traditional norming (Lenhard & Lenhard, 2021; Oosterhuis et al., 2016; Zhu & Chen, 2011) because the test score is modeled as a function of the predictors defining the reference population and the norms are estimated from the cumulative distribution of the test score conditional on the fitted norming model, without splitting the normative sample (Timmerman et al., 2021; Van Breukelen & Vlaeyen, 2005). Second, continuous predictors (e.g., age) do not need to be categorized, like in the traditional approach, thus allowing to obtain smoother and more granular norms than the traditional approach (see Figure 1 in Timmerman et al., 2021; Van Breukelen & Vlaeyen, 2005), that is, it is possible to derive norms for any combination of predictor levels within their range in the normative sample. Third, test developers can identify which predictors are actually related to the test score and provide norms based only on these predictors (Van Breukelen & Vlaeyen, 2005), in order to ensure efficient estimations of the norms (how to choose the candidate predictors in the norming context will be discussed in the next paragraph). For these three reasons, the regression-based approach is here adopted.

Figure 1.

The Mahalanobis distance $Δ_{0}$ is shown as a function of $Z_{01}$ and $Z_{02}$ , for different values of the correlation $ρ$ (0 in the top row, 0.5 in the second row, and −0.5 in the third row). In the rightmost column, $Δ_{0}$ increases by moving away from the center (i.e., $Z_{01} = Z_{02} = 0$ ) represented by a cross, so the darker a region, the larger $Δ_{0}$ . A profile is classified as “abnormal” if it falls, in each panel, outside the ellipse/circle based on the $Δ_{0}$ rule (i.e., $Δ_{0} > 2.15$ ), respectively, if it falls outside the dashed square in the center of the plot based on the disjunctive rule (i.e., if $| Z_{01} | > 1.65$ and/or $| Z_{02} | > 1.65$ ), respectively, if it falls inside any of the dotted-dashed disjoint squares in the four corners based on the conjunctive rule (i.e., if $| Z_{01} | > 1.65$ and $| Z_{02} | > 1.65$ ).

In the specification of a model for norming purposes, one should include, in the list of candidate predictors, only those variables that define the reference population as following from the purpose of the test (Timmerman et al., 2021). When adjusting for all possible predictors that are related to the outcome variable, this may alter the interpretation of the norms. For example, if performance on the Stroop test is known to decline with age and to differ between sexes and educational levels in the cognitively intact population (see, for instance, Van der Elst et al., 2006), then the question of whether to control for all these predictors or only for some of them, or even for none, depends on the intended use of the Stroop test. If the purpose of the Stroop test is to diagnose dementia, then patients’ performance on this test should be compared with the performance of subjects of the same age, sex, and education. But if the Stroop test is used for selecting job applicants, the question becomes which score is more predictive of future job performance, the unadjusted test score, or the score adjusted for one or more of the above predictors. In the initial selection of the candidate predictors, any available prior knowledge about similar tests can be of valuable help, such as the meta-analytic evidence about the relationship between several neuropsychological tests and demographic factors provided by Mitrushina et al. (2005). Furthermore, one should in general not adjust for predictors that can be affected by the trait measured by the test, such as educational level in intelligence testing. There are exceptions to this, for instance, if an intelligence test is used to diagnose cognitive decline instead of to establish a person’s IQ. Finally, one can control for higher order terms (i.e., interactions, nonlinear effects) obtained as a combination of the candidate predictors. Model selection strategies can then be used to identify which predictors, among those defining the reference population and their higher order terms, are significantly related to the outcome variable to ensure efficient estimation of the norms.

Now, normative studies often provide reference values for several tests and questionnaires. In a literature review of 65 regression-based normative studies, Innocenti et al. (2023) found that 54 studies (83%) derived norms for at least two tests or subscales of the same questionnaire and that norms were derived from separate univariate analyses based on the same sample. However, fitting a regression model for each test or subscale has (at least) three weaknesses (Van der Elst et al., 2017). First, the correlation between test scores of the same subject is taken into account neither in the model selection, which entails that regression coefficients cannot be tested across outcome variables (Johnson & Wichern, 2007), nor in norming because the subject’s performance on each test is assessed independently of the other tests. The possible consequence of this limitation is twofold: (1) researchers cannot test some relevant scientific hypotheses, such as the presence of a trial by demographics interaction in multitrial memory studies (Espenes et al., 2023; Van der Elst et al., 2017), and (2) incorrect classification of subjects in clinical practice. Indeed, normative comparisons using the correlation between outcome variables have been shown to yield higher sensitivity in detecting cognitive impairment in HIV infection (Su et al., 2015) and higher sensitivity and specificity in predicting progression to Parkinson’s disease dementia (Agelink van Rentergem et al., 2019). A second limitation of the univariate approach is that the number of significance tests of the predictor effects increases with the number of outcome variables. This increases the risk of type I errors in deciding whether or not to include, say, an interaction of age and sex, into the norming. A third disadvantage of fitting as many regression models as outcomes is that this approach becomes practically more cumbersome as the number of tests to be normed increases and may lead to a different model per outcome, which can be confusing for the application of norms.

To avoid these issues, Van der Elst et al. (2017) proposed to use multivariate multiple regression, which allows to norm several tests with a single model under the assumption of multivariate normality and homoscedasticity of the residuals. However, in Van der Elst et al. (2017)’s multivariate regression-based approach, once the parameters of the multivariate regression model have been estimated, each test is normed separately, just like in univariate regression-based norming. Within the framework of traditional norming, Huizenga et al. (2007) have proposed two multivariate approaches: (1) applying a Bonferroni correction when determining the cutoff for classifying subjects’ performance on each test separately or (2) summarizing subjects’ multivariate performance with Hotelling’s T ² statistic, which makes use of the covariance between each pair of tests. To facilitate the application of multivariate norming by practitioners, de Vent et al. (2016) have developed the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), a database storing scores on several neuropsychological tests of healthy participants from several studies conducted in the Netherlands and Flemish Belgium. A multivariate regression-based approach using the ANDI database has been developed by Agelink van Rentergem et al. (2017), who combined Hotelling’s T ², proposed by Huizenga et al. (2007), with a multivariate random intercept model to account for differences in test scores between the different studies in the database. The first goal of this article is to propose a new multivariate regression-based approach that is based on multivariate linear regression, as in Van der Elst et al. (2017), and on a multivariate norm statistic for classifying subjects’ performance, similarly to Huizenga et al. (2007) and Agelink van Rentergem et al. (2017). However, here the Mahalanobis distance is used as multivariate norm statistic. The Mahalanobis distance can be seen as a multivariate Z-score and Hotelling’s T ² is a multivariate version of Student's t-statistic, so Hotelling’s T ² is expected to outperform the Mahalanobis distance in terms of subjects’ classification in small samples, but no substantial differences are expected for large sample sizes, which are required for ensuring the stability of the norms anyway. Furthermore, deriving the optimal design of a normative study is easier if norms are based on the Mahalanobis distance than on Hotelling’s T ² because the variance of the Mahalanobis distance depends on the sample size and the predictors for norming in a much less complicated way than the variance of Hotelling’s T ², as has been derived in section 1.2 of the Online Supplement A.

The quality of test norms depends not only on the statistical methods used for analyzing the test data from the normative sample but also on the size and composition of the normative sample. To the best of our knowledge, there are no guidelines on how to determine the required sample size for multivariate norming. Indeed, the literature on sample size calculation for normative studies has focused only on univariate norming (Innocenti et al., 2023; Oosterhuis et al., 2016, 2017). Specifically, Oosterhuis et al. (2016) obtained sample size guidelines for percentile estimation under the traditional and the regression-based approach, assuming a quantitative and a qualitative predictor in their simulations. Sample size calculation for several norm statistics (e.g., Z-score, percentile rank score, stanines) can also be made under the traditional approach, by using the standard error formulas derived by Oosterhuis et al. (2017). This latter approach leads to the sample size per subgroup (e.g., per age group per sex), as the formulas do not allow for covariates. Innocenti et al. (2023) have derived the optimal design of the normative study for five univariate linear regression models with a qualitative and a quantitative predictor and proposed a sample size calculation procedure, such that individuals’ positions relative to the derived norms (i.e., univariate Z-scores and percentile rank scores) can be assessed with prespecified power and precision. In Innocenti et al. (2023), the optimal design was defined as the joint distribution of the predictors included in the norming model (i.e., the design of the normative study) that maximizes the precision of estimation of the desired norm statistics. The second goal of this article is to derive sample size calculation and optimal design for multivariate regression-based norming, thus extending Innocenti et al. (2023) to the multivariate case.

This article is structured as follows. First, a new multivariate regression-based approach, based on the Mahalanobis distance, is proposed and compared with that of Van der Elst et al. (2017) and that based on Hotelling’s T ² statistic. Second, optimal and robust designs are derived both for Van der Elst et al.’s (2017) approach and for the present approach. Third, a sample size calculation is developed for the proposed Mahalanobis distanced-based approach. The obtained results are illustrated through Van der Elst et al.’s (2006) normative study of the Stroop Color-Word Test in the application section, where the proposed multivariate approach is compared with that of Van der Elst et al. (2017) and that based on Hotelling’s T ² statistic. Finally, some concluding remarks are made. Online Supplement A presents the mathematical derivations of the results given in this article and additional figures related to the application section. The R codes (R Core Team, 2021) to find efficient designs that are robust against misspecification of the norming model, and to compute the required sample size, are given in Online Supplement B (https://github.com/FInnocenti-Stat/Multivariate-Norming).

Multivariate Regression-Based Norming

Multivariate Regression Models for Norming

To derive the norms for P tests (e.g., the scores on the three Stroop subtasks), a normative sample of N individuals is drawn from the reference population. The raw scores of the N individuals ( $i = 1, \dots, N$ ) on test p ( $p = 1, \dots, P$ ), represented by the $N \times 1$ vector y _p , can be expressed as a function of the individuals’ scores on k_p predictors (e.g., age and sex) with a standard univariate multiple linear regression model

y_{p} = X_{p} β_{p} + ε_{p},

where X _p is the $N \times (k_{p} + 1)$ design matrix, $β_{p}$ is the $(k_{p} + 1) \times 1$ vector of regression coefficients, and $ε_{p}$ is the $N \times 1$ vector of normally distributed, homoscedastic, and uncorrelated residual errors on test p (i.e., $ε_{p} \sim N (0, σ_{p}^{2} I_{N \times N})$ ). However, the P tests can be simultaneously normed with a standard multivariate regression model (Johnson & Wichern, 2007)

Y_{N \times P} = X_{N \times (k + 1)} B_{(k + 1) \times P} + E_{N \times P},

which assumes the following:

(i) The design matrix is the same for all P test scores, that is, that $X_{1} = X_{2} = \dots = X_{P} = X$ and $k_{1} = k_{2} = \dots = k_{P} = k$ , which is a realistic assumption at the design stage of the study, if all P tests are administered to the same normative sample at the same time point, and the same reference population applies.

(ii) Individuals are sampled independently of each other (i.e., the P residual errors for individual i, $ε_{i}$ , and individual j, $ε_{j}$ , are independent random vectors, where $ε_{i}$ and $ε_{j}$ are the ith and jth row of E, respectively).

(iii) Multivariate homoscedasticity, that is, the variance-covariance matrix of the residual errors is the same for all individuals (i.e., $C o v (ε_{i}) = Σ_{P \times P}$ for any $i = 1, \dots, N$ ).

(iv) Multivariate normality of the residual errors (i.e., $ε_{i}' \sim N (E (ε_{i}') = 0, C o v (ε_{i}) = Σ)$ ).

Note that, under model (2), test scores of different individuals are uncorrelated, like under model (1), but scores on different tests of the same individual can be related with correlation $ρ_{p q}$ . Thus, a multivariate approach based on model (2) has the advantage, relative to a univariate approach based on P separate analyses with model (1), of properly taking into account $ρ_{p q}$ , for instance, when testing hypotheses on the regression coefficients (Johnson & Wichern, 2007; Van der Elst et al., 2017). However, this correlation can also play an important role when comparing an individual to the reference population, as will be seen later.

Having estimated B and $Σ$ in the normative sample with their least squares estimators $\hat{B}$ and $\hat{Σ}$ (for details, see Online Supplement A, section 1), one can use these estimates to compare a new individual (not included into the normative study and for that reason denoted by the subscript “0”) with the reference population. This is done by converting the P test scores of the individual into a norm statistic of interest. For this new individual, denote by y ₀ the $P \times 1$ vector of test scores, by x ₀ the $(k + 1) \times 1$ vector containing the scores on the predictors, and by $ε_{0}$ the $P \times 1$ vector of residuals, such that $ε_{0} \sim N (0, Σ)$ , thus $y_{0} = B' x_{0} + ε_{0}$ . Under model (2), there are (at least) two possible approaches to convert the P test scores of the individual into a norm statistic, and they are described in the next section. The notation of this article is summarized in Table A1 in the Appendix.

Competing Approaches to Multivariate Regression-Based Norming

To translate the P test scores, y ₀, of a new individual into a vector of P Z-scores and, if needed, into a vector of P percentile rank scores, Van der Elst et al. (2017) have proposed the following approach:

(1) Fit the multivariate model in Equation 2, thus obtaining $\hat{B}$ and $\hat{Σ}$ from the normative sample.

For the new individual, not part of the normative sample, compute the P-dimensional vectors of

(2) predicted scores: ${\hat{y}}_{0} = \hat{B}' x_{0}$ ,

(3) residuals: ${\hat{ε}}_{0} = y_{0} - {\hat{y}}_{0}$ ,

(4) Z-scores: ${\hat{z}}_{0} = d i a g {(\hat{Σ})}^{- \frac{1}{2}} {\hat{ε}}_{0}$ , where the pth element is ${\hat{Z}}_{0 p} = \frac{{\hat{ε}}_{0 p}}{{\hat{σ}}_{p}} = \frac{Y_{0 p} - x_{0}' {\hat{β}}_{p}}{{\hat{σ}}_{p}}$ , where ${\hat{β}}_{p}$ and ${\hat{σ}}_{p}$ are obtained from the normative sample in Step 1, whereas $Y_{0 p}$ and x ₀ are the raw score on test p and the predictor values, respectively, of the new individual. Note that ${\hat{z}}_{0}$ estimates the individual’s $z_{0}$ that would follow from Steps 2–4 if B and $Σ$ were known instead of being estimated on the normative sample.

(5) Convert the P Z-scores into percentile rank scores using the standard normal cumulative distribution.

So, Van der Elst et al.’s (2017) approach yields a separate set of norms for each test, like in the univariate approach, and a psychologist or an educator can use these norms to determine on which test a subject has an “abnormal” performance. However, unless the subject’s performance is fairly consistent across all administered tests (e.g., always close to the average or always below the average), it is not straightforward how to combine their P Z-scores to evaluate their overall performance. For instance, can the overall performance on the Stroop Color-Word test be classified as “abnormal,” if a subject has Z-score $= - 0.2$ on the first subtask, Z-score $= - 1.2$ on the second subtask, and Z-score $= + 1.3$ on the third and most difficult subtask (a real-life example of such profile is given in the application section)? To address this issue, we propose an alternative approach based on the Mahalanobis distance

Δ_{0} = \sqrt{(y_{0} - B' x_{0})' {(Σ)}^{- 1} (y_{0} - B' x_{0})} = \sqrt{ε_{0}' {(Σ)}^{- 1} ε_{0}},

which can be seen as a multivariate Z-score (note that, for $P = 1$ , $Δ_{0} = | Z_{0 p} | = | \frac{ε_{0 p}}{σ_{p}} |$ ). The Mahalanobis distance thus can be used as an indicator of the individual’s overall performance across all P tests. The first three steps of our approach are the same as in Van der Elst et al.’s (2017) approach, while Steps 4 and 5 differ, that is:

(4) Compute the Mahalanobis distance value for the new subject not included into the normative study: ${\hat{Δ}}_{0} = \sqrt{(y_{0} - {\hat{y}}_{0})' {(\hat{Σ})}^{- 1} (y_{0} - {\hat{y}}_{0})} = \sqrt{{\hat{ε}}_{0}' {(\hat{Σ})}^{- 1} {\hat{ε}}_{0}}$ .

(5) Convert the Mahalanobis distance $Δ_{0}$ into a percentile rank score using the $χ_{P}^{2}$ distribution with P degrees of freedom, given that $Δ_{0}^{2} = ε_{0}' {(Σ)}^{- 1} ε_{0} \sim χ_{P}^{2}$ (Johnson & Wichern, 2007, p. 163).

To distinguish between “abnormal” and “normal” performance based on $Δ_{0}$ , the cutoff points can be chosen among the extreme percentiles of the $χ_{P}^{2}$ distribution, such as the 90th, 95th, and 99th percentiles (of which, for $P = 2$ , $Δ_{0} = 2.15$ , $2.45$ , and $3.03$ , respectively, are the square roots). A complication, dealt with in the next section, is the fact that $Δ_{0}$ is estimated by using $\hat{B}$ and $\hat{Σ}$ , and this introduces sampling error in the norm statistic value that should be properly acknowledged by reporting the standard error of the norm statistic. The same applies to $Z_{0 p}$ as univariate norm statistic, for which problem a solution is given in Innocenti et al. (2023).

While Van der Elst et al.’s (2017) approach leads to as many norm statistic values as tests for a given testee (or: person tested), the Mahalanobis distance-based approach provides a single norm statistic value for all P tests, thus yielding a general overview of an individual’s condition (i.e., performance/symptoms). Likewise, the approach of Huizenga et al. (2007) and Agelink van Rentergem et al. (2017) is based on a multivariate statistic, that is, Hotelling’s T ², which is closely related to the Mahalanobis distance, because $T_{0}^{2} \approx \frac{{\hat{Δ}}_{0}^{2}}{P}$ for N sufficiently large (for the exact relation between Hotelling’s T ² and the Mahalanobis distance, see section 1.2, Online Supplement A). Both $Δ_{0}$ and $T_{0}^{2}$ are intrinsically a two-tailed criterion, that is, they cannot discriminate between profiles scoring high or scoring low on all tests, but in practice, one is often interested in one-tailed comparisons (e.g., detecting a “too low” performance). To obtain a one-tailed comparison, one can lower the percentile used as cutoff for decision-making (e.g., to take the 90th percentile instead of the 95th percentile) first and then examine the signs of the univariate Z-scores, as obtained from Van der Elst et al.’s (2017) approach. A second solution to this problem could be to norm the (weighted) sum of the P test scores instead of using a multivariate statistic and then to apply a univariate regression-based approach that is naturally one-tailed. However, this latter approach ignores differences between tests in score variance and the sign of the pairwise correlations between tests, which can lead to misclassification of atypical profiles. If $P = 2$ , for instance, profiles scoring low on a test and high on the other test are atypical depending whether the correlation between tests is negative or positive (as will be explained in the next section), and summing the two scores incorrectly suggests that these profiles are average, while their overall performance can be more extreme than profiles scoring high (or low) on both tests (as will be illustrated in the application). In the next section, the Mahalanobis distance-based approach is compared with two classification rules for combining univariate norms as obtained by Van der Elst et al.’s (2017) approach.

Classification Rules to Combine P Z-Scores

To compare the Mahalanobis distance-based approach with Van der Elst et al.’s (2017) approach in terms of the assessment of the overall performance, we propose to apply a classification rule to the P Z-scores obtained with Van der Elst et al.’s (2017) approach. At least two classification rules can be considered: An individual’s performance/symptoms could be classified as “abnormal” when

d i s j u n c t i v e r u l e : At least one Z -score is extreme,

c o n j u n c t i v e r u l e : All Z -scores are extreme .

An example of the disjunctive rule is the Frascati criterion that defines a cognitive domain as abnormal if at least one test is 1 standard deviation below the mean (Su et al., 2015). The conjunctive rule can be used, for instance, as a criterion to recommend knee surgery to elderly persons if they score too low on a physical function scale (e.g., the Short Physical Performance Battery, see Bergland & Strand, 2019) and too high on a fear of falling questionnaire (e.g., the Fall Efficacy Scale-International, see Kempen et al., 2007).

Unlike the disjunctive and conjunctive rule, the Mahalanobis distance is a classification rule that not only takes into account the P Z-scores but also their pairwise covariances/correlations through $Σ$ (see Equation 3). For instance, the Mahalanobis distance $Δ_{0}$ for $P = 2$ tests reduces to $Δ_{0} = \sqrt{\frac{Z_{01}^{2} + Z_{02}^{2} - 2 Z_{01} Z_{02} ρ}{1 - ρ^{2}}}$ (this result follows from plugging $ε_{0} = d i a g {(Σ)}^{\frac{1}{2}} z_{0}$ into Equation 3 and some rewriting). Here, $Z_{01}$ and $Z_{02}$ are the Z-scores corresponding to the first and second test score, respectively, and $ρ$ is the correlation between $ε_{01}$ and $ε_{02}$ . For $P = 2$ , Figure 1 (left column) shows $Δ_{0}$ as a function of $Z_{01}$ and $Z_{02}$ (y- and x-axes), for three values of $ρ$ (rows). As shown in Figure 1 (left column), extreme $Δ_{0}$ values result from extreme Z-scores, but whether profiles with Z-scores of the same sign, or profiles with Z-scores of opposite signs, yield more extreme $Δ_{0}$ values, depends on the sign of the correlation between Z-scores (Figure 1, second and third rows).

The three classification rules (i.e., Equations 3–5) are also compared in Figure 1 (rightmost column), which is composed of three panels, one for each $ρ$ value, and with $Z_{01}$ and $Z_{02}$ on the x- and y-axes, respectively. Under the $Δ_{0}$ rule, a profile is classified as “abnormal” when the $Δ_{0}$ value is greater than the square root of the 90th percentile of the $χ_{2}^{2}$ distribution (i.e., $Δ_{0} > 2.15$ ), that is, when it falls outside the ellipse/circle in the panel ( $ρ$ value) at hand. Note that in Figure 1 (rightmost column), the darker a region is, the larger the value of $Δ_{0}$ is. “Abnormality” occurs, under the disjunctive rule, when at least one of the two individual’s Z-scores exceeds the 5th or 95th percentile of the standard normal distribution (i.e., $| Z_{01} | > 1.65$ and/or $| Z_{02} | > 1.65$ ), which is the region outside the dashed square in each panel in Figure 1 (rightmost column). Likewise, “abnormality” occurs under the conjunctive rule, when both Z-scores are beyond these cutoff points (i.e., $| Z_{01} | > 1.65$ and $| Z_{02} | > 1.65$ ), which is the area composed of the dotted-dashed disjoint squares in the four corners of each panel in Figure 1 (rightmost column). Note that $Δ_{0}$ depends on $Z_{01}$ , $Z_{02}$ , and $ρ$ , unlike the disjunctive and conjunctive rules that depend only on $Z_{01}$ and $Z_{02}$ . If $ρ = 0$ (Figure 1, first row), the patterns of $Δ_{0}$ as a function of $Z_{01}$ and $Z_{02}$ are symmetric, that is, for a given combination of $| Z_{01} |$ and $| Z_{02} |$ , the corresponding $Δ_{0}$ value is not affected by their signs. Furthermore, Figure 1 shows that, when $ρ = 0$ (top-right panel), then $Δ_{0}$ , as a classification rule (“normality” region represented by the circle), is in-between the conjunctive rule (“abnormality” region composed of the four dotted-dashed disjoint squares) and the disjunctive rule (“normality” region defined by a dashed square). As $ρ$ moves away from 0 (Figure 1, second and third rows), the sign of the two Z-scores plays a crucial role in determining whether the individual’s condition is classified as “abnormal” or “normal” based on $Δ_{0}$ . If $ρ > 0$ (Figure 1, second row), Z-scores with opposite signs (e.g., $Z_{01} = + 1.65$ and $Z_{02} = - 1.65$ ) give more extreme $Δ_{0}$ values than Z-scores with the same sign (e.g., $Z_{01} = Z_{02} = 1.65$ ), and the reverse occurs if $ρ < 0$ (Figure 1, third row). This can be explained by noting that if $ρ > 0$ , a profile with opposite signs for $Z_{01}$ and $Z_{02}$ is less likely (and thus more extreme) than a profile with the same sign and vice versa if $ρ < 0$ . Stated differently, by taking into account the correlation between tests, the Mahalanobis distance discriminates between an individual scoring high on both tests (or low on both tests) and an individual scoring high on one test but low on the other test. If the correlation is positive (respectively, negative), then the second individual is marked as more “abnormal” (respectively, less “abnormal”) than the first individual according to the Mahalanobis distance, but not according to the other two classification rules (see Figure 1, rightmost column, second and third rows). In other terms, the definition of “abnormality” according to the Mahalanobis distance is broader than that according to the conjunctive and disjunctive rule, as it includes profiles with extreme Z-scores and profiles with an unlikely combination of Z-scores given the correlation between tests. This result will be further illustrated in the application section.

Sampling Variances of the Norm Statistics

The two norming approaches presented in the previous section are based on two different norm statistics: Van der Elst et al.’s (2017) approach is based on the Z-score, while the approach proposed here is based on the Mahalanobis distance. In both cases, the norm statistic is estimated by using $\hat{B}$ and $\hat{Σ}$ from the normative sample, instead of the unknown B and $Σ$ . Thus, the sampling variances of the norm statistics, shown in Table 1, originate from the sampling error in $\hat{B}$ and $\hat{Σ}$ and not from the test scores of the subject to whom the norms are applied (i.e., the variances are conditional on y ₀).

Using the delta method (Casella & Berger, 2002, p. 245), it can be shown that (see Online Supplement A, section 1)

${\hat{Z}}_{0 p}$ is (approximately) normally distributed with mean $Z_{0 p}$ , sampling variance $V ({\hat{Z}}_{0 p})$ approximated by Equation 7, and sampling covariance $C o v ({\hat{Z}}_{0 p}, {\hat{Z}}_{0 q})$ approximated by Equation 8. Estimators of these variance and covariance are given in Equations 9 and 10, respectively.

The sampling distribution of ${\hat{Δ}}_{0}$ is also (approximately) normal with mean $Δ_{0}$ and variance $V ({\hat{Δ}}_{0})$ approximated by Equation 12, which can be estimated with Equation 13.

The formulas in Table 1 are approximations, with the exception of Equations 6 and 11. Indeed, the delta method is a large-sample approximation, and thus, a simulation study was performed to investigate the accuracy of the formulas in Table 1 as a function of the sample size, with the final goal of determining the smallest sample size for which the accuracy was satisfactory (i.e., defined as a relative bias no larger than 5%). The factors considered in the simulation study were the sample size, the variance–covariance matrix $Σ$ , the complexity of the multivariate regression model, and the correlation between tests of the same individual (for more details, see Online Supplement A, section 1.3). It turned out that, for $P \leq 2$ a sample of size $N = 338$ (i.e., the smallest sample size considered in the simulations), ${\hat{Z}}_{0 p}$ and ${\hat{Δ}}_{0}$ are fairly unbiased, $V ({\hat{Z}}_{0 p})$ and $\hat{V} ({\hat{Z}}_{0 p})$ , as approximated in Table 1, are fairly accurate (i.e., relative bias $\leq 5 %$ ), and also $V ({\hat{Δ}}_{0})$ and $\hat{V} ({\hat{Δ}}_{0})$ in Table 1 are accurate approximations (i.e., relative bias $\leq 5 %$ ) but only for $Δ_{0} \geq 1.18$ , that is, values of $Δ_{0}$ above the median of the $χ_{2}^{2}$ distribution (for details, see Online Supplement A, section 1.3). However, the values of $Δ_{0}$ below 1.18 are not relevant for distinguishing between “normal” and “abnormal” anyway.

As shown in Table 1, $V ({\hat{Z}}_{0 p})$ and $V ({\hat{Δ}}_{0})$ are the functions of the sample size (N), the number of predictors (k), the true value of the norm statistic ( $Z_{0 p}$ , $Δ_{0}$ ), and the standardized prediction variance $d (X, ξ) = N x_{0}' {(X^{'} X)}^{- 1} x_{0}$ , which in turn depends on the individual’s scores on the predictors (x ₀) and the design matrix of the normative study (X). In the following sections, it will be shown first how to find the joint distribution of the predictors in the normative sample (i.e., the design $ξ$ ) that minimize $V ({\hat{Z}}_{0 p})$ and $V ({\hat{Δ}}_{0})$ , given N and the norm statistic value. Next, it will be shown how to determine the required N, given the design $ξ$ , for a desired power level for hypothesis testing or for a desired margin of estimation error.

Table 1.

Norm Statistics and Their Variances and Covariances

		Equation Number
Z-scores
Norm statistic	${\hat{Z}}_{0 p} = \frac{{\hat{ε}}_{0 p}}{{\hat{σ}}_{p}} = \frac{Y_{0 p} - x_{0}' {\hat{β}}_{p}}{{\hat{σ}}_{p}}$	6
Variance	$V ({\hat{Z}}_{0 p}) \approx \frac{d (X, ξ)}{N} + \frac{1}{2 (N - k - 1)} Z_{0 p}^{2}$	7
Covariance	$C o v ({\hat{Z}}_{0 p}, {\hat{Z}}_{0 q}) \approx ρ_{p q} \frac{d (X, ξ)}{N} + ρ_{p q}^{2} \frac{Z_{0 p} Z_{0 q}}{2 (N - k - 1)}$	8
Variance estimator	$\hat{V} ({\hat{Z}}_{0 p}) \approx \frac{d (X, ξ)}{N} + \frac{1}{2 (N - k - 1)} {\hat{Z}}_{0 p}^{2}$	9
Covariance estimator	$\hat{C o v} ({\hat{Z}}_{0 p}, {\hat{Z}}_{0 q}) \approx {\hat{ρ}}_{p q} \frac{d (X, ξ)}{N} + {\hat{ρ}}_{p q}^{2} \frac{{\hat{Z}}_{0 p} {\hat{Z}}_{0 q}}{2 (N - k - 1)}$	10
Mahalanobis distance
Norm statistic	${\hat{Δ}}_{0} = \sqrt{(y_{0} - {\hat{y}}_{0})' {(\hat{Σ})}^{- 1} (y_{0} - {\hat{y}}_{0})} = \sqrt{{\hat{ε}}_{0}' {(\hat{Σ})}^{- 1} {\hat{ε}}_{0}}$	11
Variance	$V ({\hat{Δ}}_{0}) \approx \frac{d (X, ξ)}{N} + \frac{1}{2 (N - k - 1)} Δ_{0}^{2}$	12
Variance estimator	$\hat{V} ({\hat{Δ}}_{0}) \approx \frac{d (X, ξ)}{N} + \frac{1}{2 (N - k - 1)} {\hat{Δ}}_{0}^{2}$	13

Note. Derivations are given in section 1, Online Supplement A. Subscripts p and q refer to the pth and qth outcome variable, respectively. Hence, $ρ_{p q}$ is the correlation between two residual outcomes of an individual. Further, $ξ$ is the design of the normative study, that is, the joint distribution of the predictors in the normative sample given N, and $d (X, ξ) = N x_{0}' {(X^{'} X)}^{- 1} x_{0}$ , which is the univariate standardized prediction variance.

Optimal and Robust Design for Optimizing Precision of Norms Estimation

In this section, the optimal design that maximizes the precision of Z-score and $Δ_{0}$ estimation is derived, but first some important definitions are introduced. A support point of a design $ξ$ is any possible combination of the levels of the predictors in the model for norming (e.g., a 20-year-old woman), and the proportion of the total sample size N allocated to a support point is called design weight (w). A design $ξ$ is then defined as a combination of support points and associated design weights (e.g., if age and sex are the only predictors in model (2), a possible design can sample the same proportion of subjects at each age-sex combination). The optimal design $ξ^{*}$ is then defined as that joint distribution of the predictors in the normative sample that maximizes the precision of norms estimation, given N. For each norming approach presented in the previous section, there is an optimality criterion that can be used to optimize the precision of estimation of the corresponding norm statistic (Atkinson et al., 2007; Berger & Wong, 2009; Goos & Jones, 2011; Schwabe, 1996). Since Van der Elst et al.’s (2017) approach derives separate norms for the P test scores, the precision of ${\hat{z}}_{0}$ can be optimized by finding that joint distribution of the predictors that minimizes the sum of the variances of the P estimated Z-scores (i.e., the trace of the variance–covariance matrix of ${\hat{z}}_{0}$ ):

t r [V ({\hat{z}}_{0})] = \sum_{p = 1}^{P} V ({\hat{Z}}_{0 p}) = P \frac{d (X, ξ)}{N} + \frac{\sum_{p = 1}^{P} Z_{0 p}^{2}}{2 (N - k - 1)},

which follows from Equation 7. Instead, the precision of ${\hat{Δ}}_{0}$ can be optimized by minimizing $V ({\hat{Δ}}_{0})$ (i.e., Equation 12) over the joint distribution of all predictors. Note that $t r [V ({\hat{z}}_{0})]$ and $V ({\hat{Δ}}_{0})$ depend on the design $ξ$ only through $d (X, ξ) = N x_{0}' {(X^{'} X)}^{- 1} x_{0}$ , which depends on the individual’s predictor values x ₀. A safe approach, which is here adopted, is to minimize $d (X, ξ)$ , and thus, $t r [V ({\hat{z}}_{0})]$ and $V ({\hat{Δ}}_{0})$ , for those values of x ₀ for which $d (X, ξ)$ , $t r [V ({\hat{z}}_{0})]$ and $V ({\hat{Δ}}_{0})$ are maximum, given N, z ₀, and $Δ_{0}$ (for more details, see section 2.1, Online Supplement A). Under model (2) and assumption (iii), this is equivalent to minimizing the determinant of ${(X^{'} X)}^{- 1}$ (Atkinson et al., 2007; Chang, 1994; Fedorov, 1972), so the resulting optimal design $ξ^{*}$ does not depend on x ₀, although the resulting sampling variances of $V ({\hat{Z}}_{0 p})$ and $V ({\hat{Δ}}_{0})$ of course still do. Specifically, $d (X, ξ)$ (and thereby $V ({\hat{Z}}_{0 p})$ , $t r [V ({\hat{z}}_{0})]$ , and $V ({\hat{Δ}}_{0})$ ) is maximum at the support points of the optimal design (for more details, see section 2.1, Online Supplement A). This will have consequences for the required sample size for the normative study, as will be seen in the next section.

We restricted ourselves to polynomial regression models allowing at most for quadratic effects, because such models are commonly found in normative studies (for univariate normative studies, see Innocenti et al., 2023, Online Supplement A, and for multivariate normative studies, see Espenes et al., 2023, and Van der Elst et al., 2017). Hence, Table 2 shows the optimal designs for five multivariate polynomial regression models, which differ on whether they allow for interactions and/or quadratic effects but, being special cases of model (2), they all make the same assumptions (i) through (iv). Denote by X ₁ the quantitative predictor, and X ₂ the qualitative predictor with Q ₂ levels. Two literature reviews indicated that age and sex (and education) are the most frequent predictors in normative studies (Innocenti et al., 2023; Oosterhuis et al., 2016), so from now on $X_{1} =$ age, $X_{2} =$ sex, and $Q_{2} = 2$ , but the optimal designs given in Table 2 are valid for any quantitative predictor, and any qualitative predictor with $Q_{2} \geq 2$ levels (see Innocenti et al., 2023). Furthermore, age is rescaled to the interval $[- 1, 1]$ and sex is coded 0/1. The designs in Table 2 are known to be optimal for estimating $β_{p}$ (Schwabe, 1996), and $Z_{0 p}$ when $P = 1$ test score is normed (Innocenti et al., 2023). However, these designs are also optimal for estimating z ₀ and $Δ_{0}$ under the multivariate regression models in Table 2 with $P > 1$ test scores (for proof, see section 2.1, Online Supplement A).

As can be seen in Table 2, the number of age levels in the optimal design depends on the degree of the polynomial effect of age: Two levels are required for estimating a linear effect and its interaction (i.e., first and third row), and three levels are needed for estimating a quadratic age effect (i.e., second, fourth, and fifth rows). Furthermore, the optimal design is balanced, that is, each support point (i.e., age–sex combination) has the same sample size (or design weight), the only exception being the optimal design in the fourth row. The optimal design in the fourth row gives equal weight to age levels −1 and 1 (thus allowing the estimation of the linear age effect and its interaction with sex) and a smaller weight to age level 0 (which is needed for estimating the quadratic effect). This can be understood by noting that the model in the fourth row is a combination of the models in the second and third rows, which leads to an optimal design which is a compromise between the optimal designs in the second and third rows.

A limitation of the optimal design, such as those in Table 2, is that it depends on the assumed regression model, but the “true” model (i.e., the best fitting polynomial) is often unknown at the design stage. To deal with model uncertainty at the design stage, there are (at least) two strategies, differing in the criterion by which a design is identified as the most robust against misspecification of the model. The first strategy is based on the maximin efficiency criterion, that is, the most robust design is that design which maximizes the minimum efficiency across all plausible “true” models (e.g., the five models in Table 2). Efficiency is defined as ${(t r [V ({\hat{z}}_{0})])}^{- 1}$ for Van der Elst et al.’s (2017) approach and ${(V ({\hat{Δ}}_{0}))}^{- 1}$ for the Mahalanobis distance-based approach. Instead, the second strategy is based on the maximin relative efficiency (RE) criterion, that is, the most robust design is that design that yields across all plausible “true” models the highest worst-case RE, that is, the highest efficiency relative to the efficiency of the optimal design for the “true” model. The RE of a design $ξ$ versus the optimal design $ξ^{*}$ is defined as

R E_{z_{0}} (ξ v s ξ^{*}) = \frac{t r {[V ({\hat{z}}_{0})]}^{*}}{t r [V ({\hat{z}}_{0})]} \approx \frac{P d (X, ξ^{*}) + \frac{\sum_{p = 1}^{P} Z_{0 p}^{2}}{2}}{P d (X, ξ) + \frac{\sum_{p = 1}^{P} Z_{0 p}^{2}}{2}},

under Van der Elst et al.’s (2017) approach, and as

R E_{Δ_{0}} (ξ v s ξ^{*}) = \frac{V {({\hat{Δ}}_{0})}^{*}}{V ({\hat{Δ}}_{0})} \approx \frac{d (X, ξ^{*}) + \frac{Δ_{0}^{2}}{2}}{d (X, ξ) + \frac{Δ_{0}^{2}}{2}},

under the Mahalanobis distance-based approach. The interpretation of RE is the same for both definitions (Equations 15 and 16): Based on a sample of $N^{*}$ individuals in the optimal design $ξ^{*}$ , the sample size N of the nonoptimal design $ξ$ must be increased with $(R E^{- 1} - 1) 100 %$ in order for $ξ$ to be as efficient as $ξ^{*}$ . It turned out that the balanced three age levels design (Table 2, second row) is the most robust design for both criteria, maximin efficiency and maximin RE (for proof, see section 2.2, Online Supplement A), under both norming approaches.

Table 2.

Models and Corresponding Optimal Designs That Maximize the Precision of z ₀ and $Δ_{0}$ Estimation

Model for Test Score p of Individual i	Optimal Design $ξ^{*}$
$Y_{p i} = β_{p 0} + β_{p 1} X_{1 i} + β_{p 2} X_{2 i} + ε_{p i}$	Equal weight $w^{*} = \frac{1}{2} \frac{1}{Q_{2}} = \frac{1}{4}$ to $- 1$ and 1 of X ₁ for each level of X ₂
$Y_{p i} = β_{p 0} + β_{p 1} X_{1 i} + β_{p 2} X_{2 i}$ $+ β_{p 3} X_{1 i}^{2} + ε_{p i}$	Equal weight $w^{*} = \frac{1}{3} \frac{1}{Q_{2}} = \frac{1}{6}$ to $- 1, 0,$ and 1 of X ₁ for each level of X ₂
$Y_{p i} = β_{p 0} + β_{p 1} X_{1 i} + β_{p 2} X_{2 i}$ $+ β_{p 4} X_{1 i} X_{2 i} + ε_{p i}$	Equal weight $w^{*} = \frac{1}{2} \frac{1}{Q_{2}} = \frac{1}{4}$ to $- 1$ and 1 of X ₁ for each level of X ₂
$Y_{p i} = β_{p 0} + β_{p 1} X_{1 i} + β_{p 2} X_{2 i} + β_{p 3} X_{1 i}^{2} + β_{p 4} X_{1 i} X_{2 i} + ε_{p i}$	For each level of X ₂, equal weight $\frac{Q_{2} + 1}{2 (Q_{2} + 2)} \frac{1}{Q_{2}} = \frac{3}{16}$ to $- 1$ and 1 of X ₁, and weight $\frac{1}{Q_{2} + 2} \frac{1}{Q_{2}} = \frac{1}{8}$ to 0 of X ₁
$Y_{p i} = β_{p 0} + β_{p 1} X_{1 i} + β_{p 2} X_{2 i} + β_{p 3} X_{1 i}^{2} + β_{p 4} X_{1 i} X_{2 i} + β_{p 5} X_{1 i}^{2} X_{2 i} + ε_{p i}$	Equal weight $w^{*} = \frac{1}{3} \frac{1}{Q_{2}} = \frac{1}{6}$ to $- 1, 0,$ and 1 of X ₁ for each level of X ₂

Note. Derivations of the optimal designs are given in Online Supplement A, section 2.1. Since the design matrix X is the same for all P test scores, the multivariate models can be expressed in the form of the univariate model for the pth outcome variable, where any variable and parameter that refers to the pth test has the index p. The optimal design weight $w^{*}$ holds for any $Q_{2} \geq 2$ , but here it is assumed that $Q_{2} = 2$ , where Q ₂ is the number of levels of the categorical predictor $X_{2} .$ X ₁ is a continuous predictor bounded between −1 and 1.

Sample Size Calculation for the Mahalanobis Distance-Based Approach

In practice, norms are used to classify individuals’ performance or symptoms to make important decisions (e.g., the assignment of a student to remedial teaching). This classification problem can be formalized in a hypothesis testing framework, and a sample size calculation procedure can be developed to ensure that the size of the normative sample is sufficiently large to have enough power for properly classifying individuals (Innocenti et al., 2023). Suppose that $Δ_{t}$ is the Mahalanobis distance of the individual to whom the norms are applied, $Δ_{c}$ is the cutoff point chosen by the psychologist/educator to distinguish between “abnormal” and “normal” performance or symptoms (e.g., $Δ_{c} = 2.45$ , which for $P = 2$ is the square root of the 95th percentile of the $χ_{2}^{2}$ distribution), and $δ_{Δ}$ is the smallest clinically relevant difference between $Δ_{t}$ and $Δ_{c}$ . The null hypothesis H0 and the alternative hypothesis H1 are then $Δ_{t} = Δ_{c}$ and $Δ_{t} > Δ_{c}$ , respectively. Thus, the required sample size for the normative sample is defined as that size $N^{*}$ that allows to detect $δ_{Δ}$ , given the prespecified type I error rate α and power $1 - γ$ . The following procedure allows to compute $N^{*}$ for an individual with predictor value x ₀ at a support point of the optimal design (e.g., a combination of any sex level with the age levels given in the right column of Table 2), which is a safe approach because for any other x ₀ values $V ({\hat{Δ}}_{0})$ is smaller (for details, see Online Supplement A, section 3.1):

(1) Choose the norming model, the cutoff point $Δ_{c}$ , the type I error rate $α$ , the desired power $1 - γ$ , and the smallest clinically relevant difference $δ_{Δ} > 0$ between $Δ_{c}$ and $Δ_{t}$ .

(2) Compute the required sample size with the following equation (for proof, see Online Supplement A, section 3.1):

N^{*} = {[\frac{z_{1 - α} {(k + 1 + \frac{Δ_{c}^{2}}{2})}^{1 / 2} + z_{1 - γ} {(k + 1 + \frac{Δ_{t}^{2}}{2})}^{1 / 2}}{δ_{Δ}}]}^{2},

where $z_{1 - α}$ and $z_{1 - γ}$ are the $(1 - α)$ th and $(1 - γ)$ th percentiles of the standard normal distribution, respectively, $Δ_{t} = Δ_{c} + δ_{Δ}$ , and k = the number of predictors in the regression models (e.g., $k = 5$ in the model in the fifth row of Table 2).

As can be seen in Equation 17 and in Figure 2, $N^{*}$ is an increasing function of the number of predictors k in the model, the power $1 - γ$ (the larger $1 - γ$ , the larger $z_{1 - γ}$ ), and the cutoff point $Δ_{c}$ , and a decreasing function of the type I error rate $α$ (the larger $α$ , the smaller $z_{1 - α}$ ), and the smallest clinically relevant difference $δ_{Δ}$ between the individual’s Mahalanobis distance value $Δ_{t}$ and the cutoff point $Δ_{c}$ . Furthermore, note that Equation 17 is not restricted to a specific multivariate regression model (i.e., number of predictors or scale types) or a specific number of test scores P, because Equation 12 (used to derive Equation 17, see Online Supplement A, section 3.1) has neither of these restrictions (see Online Supplement A, section 1). However, Equation 17 is only applicable when the optimal design is employed, as explained in Online Supplement A (section 3.1). Furthermore, the cutoff points $Δ_{c}$ used in Figure 2 are based on $P = 2$ (see section Competing Approaches to Multivariate Regression-Based Norming, for details).

Figure 2.

Required sample size $N^{*}$ for $Δ_{0}$ with $P = 2$ tests, as a function of the effect size $δ_{Δ}$ , for different cutoff points (curves), power levels (rows), and number of predictors k in the model (columns).

An alternative approach to sample size calculation is to focus on the precision of parameter estimation instead of power for hypothesis testing (see Maxwell et al., 2008 and references therein). This approach consists in choosing the $Δ_{0}$ value of interest (e.g., $Δ_{0} = 2.45$ for $P = 2$ ) first and then determining the size of the normative sample that yields the desired margin of estimation error for the $(1 - α / 2) 100 %$ confidence interval for the chosen $Δ_{0}$ value. Specifically, the required sample size is that size of the normative sample such that half the confidence interval width (i.e., $z_{1 - α / 2} {(V ({\hat{Δ}}_{0}))}^{1 / 2}$ ) equals the desired margin of estimation error, and can be computed, for an individual with as predictor value a support point of the optimal design (e.g., for any sex value, age is at the boundary or in the middle of its range, see Table 2), with the following equation:

N^{*} = {[\frac{z_{1 - α / 2} {(k + 1 + \frac{Δ_{0}^{2}}{2})}^{1 / 2}}{δ_{Δ}}]}^{2},

which is obtained by replacing $z_{1 - α}$ with $z_{1 - α / 2}$ , $z_{1 - γ} = 0$ (i.e., 50% power), and $Δ_{c}$ with the $Δ_{0}$ value of interest in Equation 17, and by redefining $δ_{Δ}$ as the desired margin of estimation error (instead of the smallest clinically relevant effect size). With these modifications, one can still follow Steps 1 and 2 to determine $N^{*}$ , which is now that the size of the normative sample that guarantees sufficient precision of $Δ_{0}$ estimation (instead of that size that allows to detect the desired effect size) for the optimal design. For example, under the model with five predictors in Table 2 (i.e., $k = 5$ ), the required sample size $N^{*}$ , obtained with Equation 18, for the 95% confidence interval for $Δ_{0} = 2.45$ (i.e., the square root of the 95th percentile of the $χ_{2}^{2}$ distribution) is 576, if the desired margin of estimation error is 10% of $Δ_{0}$ (i.e., $δ_{Δ} = 0.245$ ).

Application

In this section, the results of this article are illustrated using Van der Elst et al.’s (2006) normative study of the Stroop Color-Word Test. This test is composed of three subtasks: A subject is asked to read 100 color words (i.e., red, blue, yellow, green) printed in black ink (Subtask 1), then to name the colors shown in 100 solid patches (Subtask 2), and finally to name the ink color of 100 color words printed in an incongruent ink color (e.g., the word “blue” printed in green; Subtask 3), as fast and accurately as possible. The outcome variable is the time (seconds) to complete each subtask. Additional outcome variables (not considered here) could be the number of not-self corrected errors and\or interference scores (for details, see Stroop-ANDI-Norms, 2020; Van der Elst et al., 2006). The rationale of this test is that the time to complete a subtask increases with the complexity of the task, that is, when moving from Subtask 1 toward Subtask 3 (Dyer, 1973). Poor performance on the Stroop test has been shown to be associated with several brain pathologies and disorders, such as discrete frontal lobe lesions and schizophrenia (Mitrushina et al., 2005). Furthermore, there is consistent evidence that performance on the Stroop test declines with age, but inconclusive evidence about the effect of sex and education (Mitrushina et al., 2005). Both Van der Elst et al. (2006) and Stroop-ANDI-Norms (2020) derived norms for the Dutch version of the Stroop test adjusting for age, sex, and education. Since the selection of the candidate predictors for a norming model should be dictated by the test’s purpose, we assume here that the Stroop test will be used to uncover brain pathologies. Thus, we will control for age, sex, and education, and their possible interactions and nonlinear effects.

The results of this article are illustrated with a subsample of Van der Elst et al.’s (2006) data (i.e., 1,000 subjects out of 1,856), which were derived from the Maastricht Aging Study (MAAS), a prospective study into the determinants of cognitive aging (Jolles et al., 1995):

Data cleaning: Of the 1,000 subjects randomly sampled from the original data set, 23 subjects were excluded because (1) the test was incompletely administered due to respondent’s physical or cognitive limitations or respondent’s nonadherence to the instructions ( $n = 18$ ) or (2) the subject scored extremely low or extremely high relative to the range of plausible scores for a cognitively intact individual (see Figure S.A.2, Online Supplement A; $n = 5$ ). The sample distributions, before and after data cleaning, of the test scores, age, sex, and educational level (low = at most primary education, average = junior vocational training, and high = senior vocational training or academic training, as in Van der Elst et al., 2006) are shown in Figure S.A.3 in Online Supplement A.

Model selection: The starting model was a multivariate regression of the three Stroop scores on sex (male =1 and female = 0), age centered, (age centered)², educational level (dummy coded with two dummies and using low education as reference category), and all two-way interactions (except of age centered and [age centered]²). A global test of all interactions (Harrell, 2015) with the Pillai–Bartlett (P–B) trace multivariate test was not statistically significant (P–B trace $= 0.033$ , $F_{(24, 2889)} = 1.353$ , $p = .117$ ), and thus, all interactions were dropped. No other predictors could be removed based on the P–B trace test. However, the residuals of the final model were nonnormally distributed and heteroscedastic (see the next paragraph on how assumptions were checked, and Tables S.A.4-5 in Online Supplement A); hence, the Stroop scores were transformed with the Box–Cox transformation (Velilla, 1993) and then standardized to improve the interpretation (see equation (S.M.5) and Figures S.A.3-4 in Online Supplement A). Table 3 shows the results for the final model with the transformed scores. As shown in Table 3, the performance on the three subtasks decreases with age in a nonlinear way, improves with level of education, and women outperform men, as in Van der Elst et al. (2006).

Assumption checks: The final model in Table 3 met the following assumptions:

Multivariate normality (see Table S.A.6, Online Supplement A): Following Mecklin and Mundfrom (2005), this assumption was checked by inspecting the univariate distributions of the residuals (with normal QQ-plots, univariate skewness and kurtosis, and Shapiro–Wilk tests), and the bivariate and trivariate distributions of the residuals (with χ² QQ-plots, Mardia’s [1970] measure of multivariate skewness and kurtosis, and Royston and Henze–Zirkler tests).

Multivariate homoscedasticity: Homogeneity of the variance–covariance matrix of the residuals, $Σ$ , was assessed by plotting the standardized residuals against the standardized predicted values and against the standardized residuals of each other Stroop subtask conditional on different levels of the predictors (see Table S.A.7, Online Supplement A). The assumption was also tested with the Box-M test (Johnson & Wichern, 2007), after dividing the sample into groups based on the quartiles of the Mahalanobis distance of the predicted scores relative to their averages.

Linearity: For sex and educational level, the assumption was satisfied since dummy variables were used, while it was relaxed for age by adding a quadratic term.

No influential outliers: Univariate outliers were identified by inspecting the studentized deleted residuals on each Stroop subtask, and multivariate outliers were identified by checking the Mahalanobis distance of the residuals (Johnson & Wichern, 2007). The influence of the outliers was assessed by checking Cook’s distances, but all observations had a Cook’s distance below 1.

No multicollinearity: Multicollinearity was assessed by computing variance inflation factors (Harrell, 2015), which were below 10 for all the predictors.

Planning of a new study: Suppose that a new normative study for the Stroop test is planned based on the final model in Table 3. The optimal design for this model is shown in Figure 3 (for proof, see section 4, Online Supplement A), where each dot represents a combination of age, sex, and educational level that should be included into the normative sample. As shown in Figure 3, the optimal design is balanced, so the same number of subjects should be sampled at each combination of the predictors. The required total sample size for the optimal design can be determined with Equation 18, targeting the 90th and 95th percentile of the $χ_{P = 3}^{2}$ distribution (of which the square root is $Δ_{0} =$ 2.5 and 2.8, respectively), in view of their importance for classifying “abnormal” multivariate performances. Since $V ({\hat{Δ}}_{0})$ increases with $Δ_{0}$ (see Equation 12), using $Δ_{0} = 2.8$ in the sample size calculation ensures at least the same precision level for $Δ_{0} = 2.5$ . Let us define the desired precision level as a margin of error $δ_{Δ} = 0.15$ , that is, half the distance between the (square root of) 90th and 95th percentile. Under the final model in Table 3 and the optimal design in Figure 3, the required total sample size is then

N^{*} = {[\frac{z_{1 - α / 2} {(k + 1 + \frac{Δ_{0}^{2}}{2})}^{1 / 2}}{δ_{Δ}}]}^{2} = {[\frac{1.96 {(5 + 1 + \frac{{2.8}^{2}}{2})}^{1 / 2}}{0.15}]}^{2} \approx 1694,

which means that $1694 / 18 \approx 95$ subjects should be sampled at each support point (i.e., at each age–sex–educational level combination) of the optimal design.

Classification of multivariate performance: In Figure 4, the multivariate performance of 11 subjects on the Stroop test is classified based on the Mahalanobis distance (top-left panel, where the square root of the 90th and 95th percentile of the $χ_{P = 3}^{2}$ distribution are represented by a dashed and a solid vertical line, respectively), Hotelling’s T ² statistic proposed by Huizenga et al. (2007; top-right panel, where the 90th and 95th percentile of the $F_{(P = 3, N - k - P = 969)}$ distribution are represented by a dashed and a solid vertical line, respectively), and the disjunctive and conjunctive rule (bottom panels, where the 2.5th and 97.5th percentile of the standard normal distribution are shown by solid horizontal lines). The 11 subjects correspond to the percentiles from 80th to 100th, with steps of 2, of the sample distribution of the Mahalanobis distance of the residuals. The multivariate performance is classified as “abnormal” for profiles: 55, 23, and 226 based on the Mahalanobis distance and Hotelling’s T ² (taking the 95th percentile as cutoff), for profile 226 based on the conjunctive rule (bottom-right panel), and for all profiles with the exception of 23, 118, and 825 based on the disjunctive rule. Interestingly, the Mahalanobis distance and Hotelling’s T ² not only detect extreme profiles in terms of Z-scores (i.e., 226, and if the cutoff is lowered also 543 and 147), but also unlikely profiles given the pairwise correlations between tests (i.e., 55, 23, and 912, given the moderate-to-high positive correlations shown in Figure S.A.4, Online Supplement A). Instead, the disjunctive and conjunctive rule, neglecting the information provided by the correlation, can misclassify atypical profiles such as 23, who had a performance close to the average on subtask 1 ( $Z_{1} = - 0.208$ ), below the average on subtask 2 ( $Z_{2} = - 1.231$ ), and above the average on the third (and most difficult) subtask ( $Z_{3} = 1.329$ , see Figure 4, bottom-right panel). This example also shows that the unweighted average of the three Z-scores of subject 23 misclassifies the performance of this profile as average ( $Z_{m e a n} = - 0.036$ ). On the other hand, the Mahalanobis distance and Hotelling’s T ² are two-tailed criteria that cannot distinguish between profile 543 (above average performance on all subtasks) and profile 147 (below average performance on all subtasks), unless the cutoff for decision making is lowered to the 90th percentile and the sign of each single Z-score is inspected.

Table 3.

Final Model After the Box–Cox Transformation

	Stroop 1 B (SE)	Stroop 2 B (SE)	Stroop 3 B (SE)	Pillai–Bartlett (P–B) Trace Test
Intercept	.2303 (.0640)	.0489 (.0624)	.1407 (.0547)	P–B trace $= 0.026$ , $F (3, 969) = 8.622$ , $p < .001$
Sex (male versus female)	.0924 (.0573)	.2433 (.0559)	.1922 (.0490)	P–B trace $= 0.026$ , $F (3, 969) =$ 8.562, $p < .001$
Age	.0176 (.0018)	.0207 (.0018)	.0290 (.0015)	P–B trace $= 0.268$ , $F (3, 969) = 118.530$ , $p < .001$
Age²	.0003 (.0001)	.0005 (.0001)	.0005 (.0001)	P–B trace $= 0.030$ , $F (3, 969) = 9.940$ , $p < .001$
Level of education (LE)				P–B trace $= 0.135$ , $F (6, 1940) = 23.406$ , $p < .001$
LE: Average versus low	−.4868 (.0676)	−.3799 (.0659)	−.5218 (.0578)
LE: High versus low	−.6832 (.0796)	−.6335 (.0776)	−.7050 (.0680)

Figure 3.

Optimal design for the norming model in Table 3.

Figure 4.

Classification of the multivariate performance of 11 profiles based on the Mahalanobis distance (top-left panel), Hotelling’s T² statistic (top-right panel), and the disjunctive and conjunctive rule (bottom panels). In the top panels, the 90th and 95th percentile of the $χ_{P = 3}^{2}$ distribution and the $F_{(P = 3, N - k - P = 969)}$ distribution, respectively, are represented by a dashed and a solid vertical line, respectively. In the bottom panels, the 2.5th and 97.5th percentile of the standard normal distribution are represented by solid horizontal lines.

Discussion

For multivariate normative studies, Van der Elst et al. (2017) proposed to use multivariate regression to take into account the correlation between the test scores of the same individual when testing hypotheses about the regression coefficients in the norming model, thus reducing the number of such tests and preventing the inflation of the type I error rate. However, this approach was multivariate only in the norming model because norms were derived separately for each test, like in the univariate regression-based approach, which makes it difficult to evaluate the overall performance of a subject across all administered tests. A new multivariate approach that targeted subject’s overall performance was therefore proposed in this article. This new approach relies on multivariate regression to identify which predictors (among those defining the reference population given the test’s purpose and their higher-order terms) are related to the outcomes, like in Van der Elst et al. (2017), but then summarizes all test scores of an individual with the Mahalanobis distance, which is a multivariate Z-score. Agelink van Rentergem et al. (2017) developed a third multivariate approach based on multilevel multivariate regression and Hotelling’s T ², which is a multivariate t-statistic. It turned out that the Mahalanobis distance and Hotelling’s T ² yield similar classifications for a sufficiently large normative sample (Figure 4). Thus, the main difference between the approach developed here and that of Agelink van Rentergem et al. (2017) is that the former guides researchers from the design of the normative study to the derivation of the norms, while the latter can be applied when data from different studies are already available and can be combined to establish the norms.

To compare these approaches, two classification rules combining the P Z-scores obtained with Van der Elst et al.’s (2017) approach were proposed, namely, the disjunctive rule, which classified as “abnormal” a profile with at least one extreme Z-score relative to the chosen cutoff for decision-making, and the conjunctive rule, for which “abnormality” was defined as having extreme Z-scores on all tests. It turned out that the conjunctive rule was the most conservative definition of “abnormality,” the disjunctive rule was the most liberal, and the Mahalanobis distance and Hotelling’s T ² were in-between these two extremes (Figure 4). Unlike the conjunctive and disjunctive rule, the Mahalanobis distance and Hotelling’s T ² take the correlation between Z-scores of the same subject into account, thus identifying “abnormality” as an unlikely profile (given the correlation sign) and not merely as a profile with extreme Z-scores. However, the Mahalanobis distance and Hotelling’s T ² are two-tailed criteria (i.e., not distinguishing between scoring high or scoring low on all tests), while in practice, one is often interested in one-tailed comparisons (e.g., comparing “normal” with “too low”). A solution can be to take the 90th percentile instead of the 95th percentile as the cutoff for decision-making, and if a profile is classified as “abnormal,” to examine the signs of the Z-scores to decide if a person is “abnormally” low or “abnormally” high in performance.

Sampling variance formulas were derived for the Z-score and the Mahalanobis distance using the delta method; hence, these formulas are approximations. A paper on details of the simulation study that examines the accuracy of these formulas is in progress, both for $P = 2$ and for $P > 2$ . For five multivariate polynomial regression models with a qualitative and a quantitative predictor, differing in whether interactions and nonlinearity were allowed, the optimal design was derived for both Van der Elst et al.’s (2017) and the Mahalanobis distance-based approach. It can also be shown that the sampling variance of Hotelling’s T ², obtained with the delta method and approximated for a sufficiently large sample size, is proportional to the sampling variance of the Mahalanobis distance (see section 1.2, Online Supplement A). Thus, the optimal designs in Table 2 are also valid for maximizing the precision of estimation of Hotelling’s T ² when subjects come from the same normative sample and the sample size is large. To deal with the uncertainty about the “true” model at the design stage, the most robust design for the five models in Table 2 was derived, using the efficiency and the RE criteria. It turned out that the balanced three age levels design in Table 2 was the most robust design for both criteria and for both Van der Elst et al.’s (2017) and the Mahalanobis distance-based approach. Thus, this design is recommended when all models in Table 2 are equally plausible at the design stage.

For the Mahalanobis distance-based approach, a sample size calculation procedure was proposed, such that individuals’ positions relative to the derived norms could be assessed with prespecified power, for hypothesis testing, or prespecified margin of estimation error, for interval estimation. This procedure allows to determine the required sample size for the optimal design of the normative sample, assuming that the individual to whom the norms will be applied has as predictor values a support point of the optimal design. This is a safe approach because for other predictor values, the sampling variance of the Mahalanobis distance is smaller and then the power is larger. When several cutoff values are of interest, for instance, because the psychologist/educator wants to distinguish between “normal,” “borderline-normal,” and “abnormal” performance, one should use the most extreme cutoff value in the sample size calculation, as illustrated in the application section, because the sampling variance of the Mahalanobis distance is an increasing function of the value of the norm statistic. A sample size calculation procedure for Van der Elst et al.’s (2017) approach is a topic for future research, because controlling the familywise type I and II error rates is complicated, except for the rather unrealistic case of independence between the P Z-scores for the tested person, which is covered in section 3.2 of the Online Supplement A. Future research can also investigate how to determine the sample size for Agelink van Rentergem et al.’s (2017) approach regarding data from multiple studies, for which a good starting point could be methods to compute the sample size for meta-analyses (see Valentine et al., 2010).

The results of this article are restricted by the assumptions underlying model (2), namely, multivariate normality and homoscedasticity of the residuals, and linearity of the predictors’ effects. These assumptions also underly the approaches of Van der Elst et al. (2017) and Agelink van Rentergem et al. (2017). Since the validity of the norms derived with these approaches depends on these assumptions, it is crucial that researchers check the assumptions and report the results of the diagnostic analyses. In the application, it was shown how these assumptions can be checked and simple methods to repair their violations were illustrated. For instance, multivariate normality and homoscedasticity could be achieved after a transformation of the test scores, such as the Box–Cox transformation (de Vent et al., 2016; Johnson & Wichern, 2007; Velilla, 1993). However, future research is indeed needed to develop a multivariate regression-based approach under nonnormality and heteroscedasticity. Starting points for this extension could be those approaches that gave promising results in the univariate case, such as generalized additive models for location, scale, and shape (Timmerman et al., 2021) or semi-parametric approaches, such as quantile regression (Sherwood et al., 2015) and cNORM (Lenhard et al., 2018; Lenhard & Lenhard, 2021). Nonlinearity was here handled by adding a quadratic term for the continuous predictor (e.g., age), but the optimal designs in Table 2 can be extended to more complex nonlinear trends by considering higher order polynomials for which optimal designs are known in the literature (see Berger & Wong, 2009). Although polynomial regression models are easy to implement, they can show undesirable nonlocal behaviors (Magee, 1998), that is, the predicted outcome value in a region of the predictor (e.g., 20 ≤ age ≤ 25) can be affected by the observed outcome value in a different region of the predictor (e.g., 45 ≤ age ≤ 50). This issue can be prevented by using more flexible methodologies such as restricted cubic splines (Harrell, 2015).

In many practical situations, it is important to evaluate individual’s performance/symptoms over time, for instance, to monitor the progress of a rehabilitation or tutorial program. When a test is repeatedly administered to an individual, the test scores are correlated. Treating repeated measures as multiple outcomes, normative data can be derived through a multivariate regression (Van der Elst et al., 2013, 2017). For a given number of measurements and a given spacing of those measurements, the designs in Table 2 are also optimal for repeated individual’s performance/symptoms assessment. Finding the optimal number of time points and their spacing could be a topic for future research (see, for instance, Winkens et al., 2005). Future research could also compare the Mahalanobis distance-based approach for norming repeated measures of the same test with the univariate approaches presented by Gu et al. (2021), both in terms of the consequences for norming and the required sample sizes. Furthermore, the regression-based approach could be extended to multilevel populations (e.g., to norm educational tests in children nested within schools). Finally, Voncken et al. (2020) and Wang et al. (2020) developed more sophisticated univariate regression-based approaches within the Bayesian inferential framework that future research could extend to the multivariate case.

Supplemental Material

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231210807 - Sample Size Calculation and Optimal Design for Multivariate Regression-Based Norming

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231210807 for Sample Size Calculation and Optimal Design for Multivariate Regression-Based Norming by Francesco Innocenti, Math J. J. M. Candel, Frans E. S. Tan and Gerard J. P. van Breukelen in Journal of Educational and Behavioral Statistics

Supplemental Material

Supplemental Material, sj-docx-2-jeb-10.3102_10769986231210807 - Sample Size Calculation and Optimal Design for Multivariate Regression-Based Norming

Supplemental Material, sj-docx-2-jeb-10.3102_10769986231210807 for Sample Size Calculation and Optimal Design for Multivariate Regression-Based Norming by Francesco Innocenti, Math J. J. M. Candel, Frans E. S. Tan and Gerard J. P. van Breukelen in Journal of Educational and Behavioral Statistics

Footnotes

Appendix

Acknowledgments

The authors would like to thank Dr. Martin van Boxtel, School for Mental Health and Neuroscience of Maastricht University, for the use of the Maastricht Aging Study (MAAS) data.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Francesco Innocenti

References

Agelink van Rentergem

J. A.

de Vent

N. R.

Huizenga

H. M.

Murre

J. M. J.

ANDI

Consortium

Schmand

B. A.

(2019). Predicting progression to Parkinson’s disease dementia using multivariate normative comparisons. Journal of the International Neuropsychological Society, 25(7), 678–687. https://doi.org/10.1017/S1355617719000298

Agelink van Rentergem

J. A.

Murre

J. M.

Huizenga

H. M.

(2017). Multivariate normative comparisons using an aggregated database. PLoS One, 12(3), e0173218. https://doi.org/10.1371/journal.pone.0173218

Atkinson

A. C.

Donev

A. N.

Tobias

R. D.

(2007). Optimum experimental designs, with SAS. Oxford University Press.

Berger

M. P. F.

Wong

W. K.

(2009). An introduction to optimal designs for social and biomedical research. John Wiley.

Bergland

Strand

B. H.

(2019). Norwegian reference values for the Short Physical Performance Battery (SPPB): The Tromsø study. BMC Geriatrics, 19(1), 216. https://doi.org/10.1186/s12877-019-1234-8

Casella

Berger

R. L.

(2002). Statistical inference (2nd ed.). Duxbury.

Chang

S. I.

(1994). Some properties of multiresponse D-optimal designs. Journal of Mathematical Analysis and Applications, 184, 256–262.

de Vent

N. R.

Agelink van Rentergem

J. A.

Schmand

B. A.

Murre

J. M.

ANDI

Consortium

Huizenga

H. M.

(2016). Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A normative database created from control datasets. Frontiers in Psychology, 7, 1601. https://doi.org/10.3389/fpsyg.2016.01601

Dujardin

Jobard

Vahine

Mathey

(2021). Norms of vocabulary, reading, and spelling tests in French university students. Behavior Research Methods, 54(4), 1611–1625. https://doi.org/10.3758/s13428-021-01684-5

10.

Dyer

F. N.

(1973). The Stroop phenomenon and its use in the study of perceptual, cognitive, and response processes. Memory & Cognition, 1(2), 106–120. https://doi.org/10.3758/BF03198078

11.

Espenes

Eliassen

I. V.

Öhman

Hessen

Waterloo

Eckerström

Lorentzen

I. M.

Bergland

Halvari Niska

Timón-Reina

Wallin

Fladby

Kirsebom

B. E.

(2023). Regression-based normative data for the Rey Auditory Verbal Learning Test in Norwegian and Swedish adults aged 49-79 and comparison with published norms. The Clinical Neuropsychologist, 37(6), 1276–1301. https://doi.org/10.1080/13854046.2022.2106890

12.

Fedorov

V. V.

(1972). Theory of optimal experiments. Academic Press.

13.

Goos

Jones

(2011). Optimal design of experiments. A case study approach. John Wiley.

14.

Emons

Sijtsma

(2021). Precision and sample size requirements for regression-based norming methods for change scores. Assessment, 28(2), 503–517. https://doi.org/10.1177%2F1073191120913607

15.

Harrell

F. E.

(2015). Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis. Springer.

16.

Huizenga

H. M.

Smeding

Grasman

R. P.

Schmand

(2007). Multivariate normative comparisons. Neuropsychologia, 45(11), 2534–2542. https://doi.org/10.1016/j.neuropsychologia.2007.03.011

17.

Innocenti

Tan

F. E. S.

Candel

M. J. J. M.

Van Breukelen

G. J. P.

(2023). Sample size calculation and optimal design for regression-based norming of tests and questionnaires. Psychological Methods, 28(1), 89–106. https://doi.org/10.1037/met0000394

18.

Johnson

R. A

Wichern

D. W.

(2007). Applied multivariate statistical analysis (6th ed.). Pearson Prentice Hall.

19.

Jolles

Houx

P. J.

Van Boxtel

M. P. J.

Ponds

R.W. H. M.

(1995). Maastricht aging study: Determinants of cognitive aging. Neuropsych.

20.

Kempen

G. I.

Todd

C. J.

Van Haastregt

J. C.

Zijlstra

G. A.

Beyer

Freiberger

Hauer

K. A.

Piot-Ziegler

Yardley

(2007). Cross-cultural validation of the Falls Efficacy Scale International (FES-I) in older people: Results from Germany, the Netherlands and the UK were satisfactory. Disability and Rehabilitation, 29(2), 155–162. https://doi.org/10.1080/09638280600747637

21.

Lenhard

Suggate

Segerer

(2018). A continuous solution to the norming problem. Assessment, 25(1), 112–125. https://doi.org/10.1177%2F1073191116656437

22.

Lenhard

(2021). Improvement of norm score quality via regression-based continuous norming. Educational and Psychological Measurement, 81(2), 229–261. https://doi-org.mu.idm.oclc.org/10.1177/0013164420928457

23.

Magee

(1998). Nonlocal behavior in polynomial regressions. The American Statistician, 52(1), 20–22. https://doi.org/10.1080/00031305.1998.10480531

24.

Mardia

K. V.

(1970). Measures of multivariate skewness and kurtosis with applications. Biometrika, 57(3), 519–530. https://doi.org/10.1093/biomet/57.3.519

25.

Maxwell

S. E.

Kelley

Rausch

J. R.

(2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537–563. https://doi.org/10.1146/annurev.psych.59.103006.093735

26.

Mecklin

C. J.

Mundfrom

D. J.

(2005). A Monte Carlo comparison of the type I and type II error rates of tests of multivariate normality. Journal of Statistical Computation and Simulation, 75(2), 93–107. https://doi.org/10.1080/0094965042000193233

27.

Mitrushina

Boone

K. B.

Razani

D’ Elia

L. F.

(2005). Handbook of normative data for neuropsychological assessment (2nd ed.). Oxford University Press.

28.

Oosterhuis

H. E. M.

Van der Ark

L. A.

Sijtsma

(2016). Sample size requirements for traditional and regression-based norms. Assessment, 23(2), 191–202. https://doi.org/10.1177/1073191115580638

29.

Oosterhuis

H. E. M.

Van der Ark

L. A.

Sijtsma

(2017). Standard errors and confidence intervals of norm statistics for educational and psychological tests. Psychometrika, 82(3), 559–588. https://doi.org/10.1007/s11336-016-9535-8

30.

R Core Team. (2021). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. https://www.r-project.org/

31.

Schwabe

(1996). Optimum designs for multi-factor models. Springer-Verlag.

32.

Sherwood

Zhou

A. X.

Weintraub

Wang

(2015). Using quantile regression to create baseline norms for neuropsychological tests. Alzheimer’s & Dementia, 2, 12–18. https://doi.org/10.1016/j.dadm.2015.11.005

33.

Stroop-ANDI-Norms. (2020). ANDI-Norms. https://andi.nl/tests/aandacht-en-werkgeheugen/stroop/

34.

Schouten

Geurtsen

G. J.

Wit

F. W. N. M.

Stolte

I. G.

Prins

Portegies

Caan

M. W.

Reiss

Majoie

C. B. L. M.

Schmand

(2015). Multivariate normative comparison, a novel method for more reliably detecting cognitive impairment in HIV infection. AIDS, 29(5), 547–557. https://doi.org/10.1097/qad.0000000000000573

35.

Timmerman

M. E.

Voncken

Albers

C. J.

(2021). A tutorial on regression-based norming of psychological tests with GAMLSS. Psychological Methods, 26(3), 357–373. https://doi.org/10.1037/met0000348

36.

Valentine

J. C.

Pigott

T. D.

Rothstein

H. R.

(2010). How many studies do you need?: A primer on statistical power for meta-analysis. Journal of Educational and Behavioral Statistics, 35(2), 215–247. https://doi-org.mu.idm.oclc.org/10.3102/1076998609346961

37.

Van Breukelen

G. J. P.

Vlaeyen

J. W. S.

(2005). Norming clinical questionnaires with multiple regression: The pain cognition list. Psychological Assessment, 17(3), 336–344. https://doi.org/10.1037/1040-3590.17.3.336

38.

Van der Elst

Hurks

Wassenberg

Meijs

Jolles

(2011). Animal verbal fluency and design fluency in school-aged children: Effects of age, sex, and mean level of parental education, and regression-based normative data. Journal of Clinical and Experimental Neuropsychology, 33(9), 1005–1015. http://dx.doi.org/10.1080/13803395.2011.589509

39.

Van der Elst

Molenberghs

Van Boxtel

M. P. J.

Jolles

(2013). Establishing normative data for repeated cognitive assessment: A comparison of different statistical methods. Behavioral Research Methods, 45, 1073–1086. https://doi.org/10.3758/s13428-012-0305-y

40.

Van der Elst

Molenberghs

van Tetering

Jolles

(2017). Establishing normative data for multi-trial memory tests: The multivariate regression-based approach. The Clinical Neuropsychologist, 31, 1173–1187. https://doi.org/10.1080/13854046.2017.1294202

41.

Van der Elst

Van Boxtel

M. P.

Van Breukelen

G. J.

Jolles

(2006). The Stroop color-word test: Influence of age, sex, and education; and normative data for a large sample across the adult age range. Assessment, 13(1), 62–79. https://doi.org/10.1177/1073191105283427

42.

Velilla

(1993). A note on the multivariate Box-Cox transformation to normality. Statistics & Probability Letters, 17(4), 259–263. https://doi.org/10.1016/0167-7152(93)90200-3

43.

Voncken

Kneib

Albers

C. J.

Umlauf

Timmerman

M. E.

(2020). Bayesian Gaussian distributional regression models for more efficient norm estimation. British Journal of Mathematical and Statistical Psychology, 74, 99–117. https://doiorg/10.1111/bmsp.12206

44.

Wang

L. A. L.

Herrington

J. D.

Tunç

Schultz

R. T.

(2020). Bayesian regression-based developmental norms for the Benton facial recognition test in males and females. Behavior Research Methods, 52, 1516–1527. https://doi.org/10.3758/s13428-019-01331-0

45.

Winkens

Schouten

H. J.

van Breukelen

G. J. P.

Berger

M. P.

(2005). Optimal time-points in clinical trials with linearly divergent treatment effects. Statistics in Medicine, 24(24), 3743–3756. https://doi.org/10.1002/sim.2385

46.

Zhu

Chen

H.-Y.

(2011). Utility of inferential norming with smaller sample sizes. Journal of Psychoeducational Assessment, 29(6), 570–580. https://doi.org/10.1177/0734282910396323

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.48 MB

0.04 MB