Abstract
Computerized neuropsychological test batteries (CNTs), such as Central Nervous System Vital Signs (CNS VS), are increasingly used for measuring cognitive functioning, but empirical evidence of how they measure cognition is scarce. We investigated the factor structure of CNS VS using exploratory factor analyses four samples: healthy adults (n = 169), patients with meningioma (392), low-grade glioma (99), and high-grade glioma (247). We tested model fit and investigated measurement invariance. Differences in factor interpretation existed between healthy participants and patients. Factor structures among patient groups were approximately the same but differed in non-zero loadings. Overall, factor structures largely did not support the “clinical domains” provided by CNS VS for clinical interpretation. Confirmatory models did not have a good fit, and measurement invariance could not be established. Our results indicate that (weighted) sum scores of CNS VS results may lack validity. We recommend researchers and clinicians to use scores on individual test measures.
Keywords
Introduction
Computerized neuropsychological tests (CNTs) or test batteries are increasingly popular instruments to assess cognitive function in clinical practice and research (Gondar et al., 2021). CNTs allow for easy, standardized, and often remote test administration. In addition, they instantly provide accurate performance measures without going through manual scoring procedures that are labor-intensive and prone to errors. One popular and widely-used brief screening battery is “Central Nervous System Vital Signs” (CNS VS) (Oliveira & Brucki, 2014; Wojcik et al., 2019).
The CNS VS battery consists of seven core tests that are based on well-established pen-and-paper tasks (CNS Vital Signs, n.d.; C. Gualtieri & Johnson, 2006; Plourde et al., 2018); a visual (freely adapted from the recognition condition of the Rey Visual Design Learning test), a verbal memory recognition tasks (freely adapted from the recognition condition of the Rey Auditory Verbal Learning tests), a symbol digit coding task (based on the Wechsler Digit symbol substitution test), a shifting attention task (color-shape), a finger-tapping test, an adapted Stroop test, and a continuous performance test (C. Gualtieri & Johnson, 2006). A more detailed description of each task can be found in Supplemental Table S1. One of the main benefits of this relatively brief screening is that it is less strenuous than a traditional neuropsychological evaluation which makes it especially suitable for implementation in busy clinical care trajectories and for screening of vulnerable patients with a variety of neurological illnesses that commonly present with cognitive deficits. The downside of CNS VS being a brief screening is that its measurements are less comprehensive and that some subtests likely have a lower signal-to-noise ratio (C. Gualtieri & Johnson, 2006; S. J. M. Rijnen et al., 2018). Large-scale screening of patients, however, can give us better insight into the cognitive problems of different patient groups.
The test–retest reliability of individual measures resulting from CNS VS was shown to range between 0.31 and 0.87 (median: 0.68; test interval average 62 days) (C. Gualtieri & Johnson, 2006) or 0.18 and 0.88 (median: 0.47; test interval 3 and 9 months; Rijnen et al., 2018; Pearson’s r/Spearman’s ρ) in samples of healthy participants. Moreover, the concurrent validity of individual outcome measures was shown to range between 0.17 and 0.79 (Pearson’s r) in a sample comprising patients with various neuropsychiatric disorders and a small number of healthy participants. This is with the exception of the continuous performance tasks for which concurrent validity was very limited with a correlation between the correct responses of CNS VS Continuous Performance Task and the NES2 Continuous Performance Task of 0.04 (C. Gualtieri & Johnson, 2006). The CNS VS battery is shown to be able to detect significant decreases in test scores relative to healthy participants for various patient groups, including patients with neuro-(onco-)logical illnesses (e.g., Keine et al., 2019; Langensee et al., 2022; Papathanasiou et al., 2014; Rijnen, 2019; Rijnen, Butterbrod, et al., 2020).
Most individual CNS VS tests provide multiple scores, such as the number of correct responses, errors, and reaction times. As a result, the seven core tests in the CNS VS battery provide a total of 30 different independent test measures. This makes the interpretation of the cognitive profile difficult and laborious, especially as the screening is intended as brief and more global than regular neuropsychological assessment. To improve clinical interpretability, the developers of CNS VS provide 10 different “clinical domain” scores (excluding two composites: the Neurocognition Index and Composite memory). These domain scores are uniformly weighted linear combinations of individual test scores (sum scores), four of which being based on scores from multiple tests in the battery (CNS Vital Signs, n.d.). These domain scores are based on theory but are not (yet) empirically validated despite multiple attempts (Brooks et al., 2019; C. T. Gualtieri & Hervey, 2015) and have a low divergent validity (Plourde et al., 2018). Moreover, an explanation of how these domain scores were derived from theory is missing. This is concerning given that they are commonly used (e.g., Lanou et al., 2023; Lewis et al., 2023; Pilipenko et al., 2022) and that they should be validated in the same manner as latent variable models (McNeish & Wolf, 2020).
Finding empirical evidence for the unobserved latent constructs (the factor structure) underlying the measured test scores resulting from a CNT allows us to better understand the different constructs it measures and validate the use of composite scores for clinical practice. Two papers attempted to provide empirical evidence for the structure of the CNS VS test battery using exploratory or confirmatory factor analysis (CFA); one in a sample of youth with diverse neurological diseases (Brooks et al., 2019) and one in healthy adults (C. T. Gualtieri & Hervey, 2015). The factor structures in both investigations did not just differ from the “clinical domain” scores, but they also partly differed between the two samples. Both studies find three factors where two factors are interpreted as “(processing) speed” and “memory,” with several differences between the speed factors. A third factor was found in healthy adults, describing “attention,” while in the sample of youth with neurological disease, and the third factor described “inhibition.” This leads to the question of what is measured by CNS VS, and if the “clinical domains” are a valid representation of its structure. A further caveat is that both empirical studies assess only one group and, as such, cannot directly address whether the battery measures the same across different (diagnostic) populations, which is the problem of measurement invariance (Leitgöb et al., 2022).
Establishing measurement invariance is necessary for the valid use of composite scores, such as the clinical domain scores that are often automatically computed by computerized batteries, across various (diagnostic) groups. Measurement invariance can be divided into four increasingly strict levels: configural, metric, scalar, and residual. Configural invariance requires the same number of factors and approximately the same pattern of non-zero loadings between groups, indicating that the same constructs are measured. Metric invariance requires the same magnitude of loadings, allowing for testing the predictive relationship of a factor and some dependent variable. Scalar invariance requires approximately the same magnitude of intercepts, allowing for the comparison of latent means. Finally, residual invariance requires similar residuals between groups, indicating similar precision across groups (Chen, 2008; Hirschfeld & von Brachel, 2014). Unfortunately, measurement invariance regularly cannot be established (Wicherts, 2016).
In the current study, measurement invariance of CNS VS was tested across patients with three types of primary brain tumors (meningioma, low-grade glioma, and high-grade glioma) and healthy participants. Ideally, clinicians can be presented with easily interpretable test measures in clinical practice that have uniform interpretation for patients regardless of their diagnosis, which requires at least configural invariance between patient groups. Moreover, scores generally are normed relative to the performance of healthy participants to facilitate interpretation, thus requiring scalar invariance between all groups. Problems with cognition, however, are more common and more severe for patients with a primary brain tumor than for healthy participants, and neuropsychological functioning often is lower for patients with an isocitrate dehydrogenase (IDH) wild-type or high-grade glioma than those with an IDH-mutant or low-grade glioma (Noll et al., 2015; van Kessel et al., 2017, 2019). Moreover, the profile of cognitive profile is known to differ between patients and has been related to tumor grade (Butterbrod, 2021). This leads to the question if CNS VS measures the same constructs across these groups.
In this work, we performed exploratory factor analyses (EFA) on datasets consisting of 169 healthy Dutch individuals, 392 meningioma, 99 low-grade glioma, and 247 high-grade glioma patients separately; interpreted the factor structures; and tested model fit using CFA. Furthermore, we tested levels of measurement invariance across groups using multiple-group confirmatory factor analysis (MG-CFA) to be able to subsequently compare the latent means between the different groups (if scalar invariance was found) and to evaluate the use of the obtained factor structures for future work and use in clinical practice.
Method
Participants
Participants were meningioma and low- or high-grade glioma patients who were scheduled for surgery at the Elisabeth-TweeSteden hospital, Tilburg, the Netherlands, and underwent preoperative cognitive screening between 2010 and 2019. Patients were not included when they were under 18, had a progressive neurological disease, had a psychiatric or acute neurological disorder within the past 2 years, or had reduced testability for the neuropsychological assessment, for example, due to problems with vision, motor deficits such as paresis, and language deficits that complicated administration of the test battery. In addition, for comparison, data from healthy Dutch adults were collected using convenience sampling. Healthy participants were considered healthy if there was no past or present psychiatric or neurologic disorder, they had no other major medical illnesses in the past year before participation (e.g., cancer, myocard infarct), they were free of the use of any centrally acting psychotropic medication, and did not have a history of or current alcohol or drug abuse (Rijnen, Meskal et al., 2020).
The current samples are described (in part) in previous studies (Beele et al., 2024; Boelders et al., 2023, 2024; Butterbrod et al., 2019, 2020, 2021; De Baene, Rijnen, et al., 2019; De Baene, Rutten, et al., 2019; Lonkhuizen et al., 2019; Meskal et al., 2015; Rijnen, Butterbrod, et al., 2020; Rijnen, 2019; Rijnen, Kaya, et al., 2020; van der Linden et al., 2020; van Loenen et al., 2018).
Design
During the screening using CNS VS, participants were accompanied by a well-trained technician (neuropsychologist or neuropsychologist in training) who provided instructions before starting each test. The test battery administered in our clinical practice included the seven core tests and took approximately 30–40 minutes to complete. The technician reported on the potential confounding of test scores caused by problems during administration (e.g., external disturbances). Test scores affected by confounding factors were excluded from the current study on a test-by-test basis. A description of the seven core tests used in this work can be found in Supplemental Table S1.
For all participants, a standardized interview was performed to obtain demographic variables (age, sex, and education). Education was recorded on the Dutch Verhage scale (Verhage, 1965). Data on race and ethnicity were not collected. Participants signed informed consent during this interview.
Processing of Cognitive Measurements
Standardization of Raw Scores
We used 26 out of the 30 raw test scores resulting from the seven tests in CNS VS, leaving out the reaction times of the memory tasks. Reaction times of the memory tasks were left out as they make up a large number of variables while their interpretation is unclear and is in line with the work of Brooks and colleagues (2019). Test scores representing errors or reaction times were inverted such that a higher score represented a better performance. For all datasets (healthy participants and patients), test scores were converted to sociodemographically adjusted z-scores. This was done in two steps. First, test scores were corrected for the effects of age, sex, and education as found in the dataset of the same healthy participants using a multiple regression approach (Rijnen, Meskal et al. 2020). This was done as age and education are known to strongly affect test scores resulting in variance that is not informative of neuropsychological problems. Second, test scores of healthy participants were set to have a mean of zero and a standard deviation of one after which all patient groups were scaled relative to the healthy participants to facilitate interpretation.
Outliers
Test scores of the patient group were passed through a Winsor function to reduce the influential effect of univariate outliers on the analysis. The Winsor function sets the value of outliers after a certain percentile to the value at that percentile. This method was chosen because of its ability to maintain the information stored in most outliers while eliminating their influential effect on the analysis. The Winsor function was set to change only the most extreme outliers with a threshold of 3% on each side. The Winsor function was not applied to ends of the distributions with a ceiling or floor effect, precluding the Winsor function from increasing the ceiling or floor effects.
Missing Test Scores
Patients were removed from the analysis when three or more complete tests out of the seven tests in the test battery were missing. This threshold was chosen to ensure sufficient information for imputation. The remaining missing values were imputed using multiple imputation, estimating each test score from all the others as recommended by Zygmont and Smith (2014). Multiple imputation was chosen a priori, as missings in the screening are known to be missing at random. For example, test scores can be missing if patients do not understand or cannot execute the instruction of a (more complicated) task, which is reflected in poorer performance on other tasks. In addition, multiple imputation allows us to take the uncertainty of the imputations into account while performing the exploratory factor analysis and CFA. Details on imputation for the specific analyses follow below.
Multicollinearity
Test scores were evaluated for high multicollinearity using the variance inflation factor (VIF), where a VIF > 5 was interpreted as high multicollinearity. Test scores with high multicollinearity were combined (Paul, 2006). For consistency, tests with high multicollinearity in one sample were also combined in the other samples.
The prerequisite of mild multicollinearity necessary for the EFA was demonstrated using a Kaiser-Meyer-Olkin test between 0.5 and 1 and a significant (p < .005) Bartlett’s test of sphericity (Zygmont & Smith, 2014).
Analysis
Exploratory Factor Analysis
To identify a factor structure for each group, four EFAs were performed. To determine the number of factors to maintain in the EFAs, we used traditional parallel analysis with principal component analysis as the extraction method and the mean as the reference value (Horn, 1965). Furthermore, we used the Empirical Kaiser Criterion and the Hull method implemented with the comparative fit index (CFI). This set of methods was based on recommendations by Auerswald and Moshagen (2019) given non-normally distributed data and were evaluated individually for each group.
The EFAs were performed using ordinary least squares (OLS) from the R package Psych (Revelle, 2017) (fa function with fm = “minres”) which aims to find the minimum residual solution (Harman & Jones, 1966). OLS was chosen as it is suitable for smaller samples while being robust when univariate and multivariate normality assumptions are violated (Zygmont & Smith, 2014).Oblimin rotation was used to further refine our factors as cognitive functions are highly dependent on one another, and oblimin rotation allows for correlations between latent factors (Browne, 2001). Missing test scores were imputed with multiple imputation with the R package Mifa (Nassiri, 2018) using the Multivariate Imputation by Chained Equations (MICE) algorithm with predictive mean matching as recommended by McNeish (2017).
Next, covariance matrices of the imputed datasets were combined using Rubin’s rules (Barnard & Rubin, 1999) after which the EFA was performed on the combined matrix. Factor interpretation was done over a series of meetings between the first author (S.B) and two neuropsychologists (E.B, K.G) who have extensive experience with both the patient groups and the CNS VS test battery. Interpretations were made over a series of meetings and revision rounds until a consensus was reached.
Confirmatory Factor Analysis
To determine the validity of the resulting factor structures, the fit of the resulting models was tested using CFA in the same groups where they were initially found. No separate validation sets were used due to the small sample sizes. Confirmatory models were defined as all non-zero factor loadings (loadings ≥|0.30|). As no validation set was used, no changes were made to the models based on residuals to prevent overfitting the models to the current sample. Models were considered to have a good fit when they had a root mean square error of approximation (RMSEA) < 0.06 and standardized root mean square residual (SRMR) < 0.06 (Hu & Bentler, 1999; Schreiber et al., 2006; Zygmont & Smith, 2014). If a good fitting model was not identified, the same confirmatory analyses were additionally performed while extracting different numbers of factors than done during the initial EFAs. This was done to prevent missing a good fitting model due to extracting too few or too many factors. The number of factors maintained was varied within the previously described metrics for factor extraction plus and minus one.
Measurement Invariance
To test if the found structure of CNS VS was invariant across different groups, measurement invariance between all—or a subset of groups—was tested using MG-CFA. To do so, the same patterns of non-zero factor loadings had to be found between EFA results, allowing for these results to be combined into one confirmatory model.
If patterns of non-zero loadings were the same between all—or a subset of—groups, configural invariance was tested by evaluating if the MG-CFA fit measures evaluated on these groups were good (Leitgöb et al., 2022). The MG-CFA was evaluated on a balanced sample comprising an equivalent number of randomly selected individuals from each group to ensure that the results were not biased by differences in sample size (Chen, 2007). Metric, scalar, and residual invariance were tested by comparing the change in CFI, RMSEA, and SRMR between models. We adhered to the stringent criteria as proposed by Chen (2007), where non-invariance for loadings (metric invariance) is indicated by a change in CFI ≥−.010 and RMSEA ≥−.015 and a change in SRMR ≥ -.030, and non-invariance for intercepts (scalar invariance) or residuals (residual invariance) is indicated by a change in CFI ≥−.010 and RMSEA ≥−.015 and a change in SRMR ≥−.0.10. If scalar invariance held, differences in latent means were tested by additionally restricting the means to be equal across groups for the scalar or strict model (depending on the level of invariance) and inspecting the change in model fit. A decreased model fit due to restricting the latent means was interpreted as evidence for unequal latent means across groups. If this was the case, the intercepts for the latent variables resulting from the scalar or strict model were reported.
Else, if patterns of non-zero loadings resulting from the EFA between all or a subset of groups were only approximately the same instead of being largely the same, measurement invariance was tested using EFA solutions resulting from a combined sample consisting of the groups with approximately the same patterns of non-zero loadings. This was done as sample sizes in this study for individual groups are relatively small (with as few as 99 samples) when compared to sample size recommendations such as 200 (Guilford, 1954) to 500 (Cattell, 2012) samples, or between 3:1 and 6:1 (Cattell, 2012) up to 20:1 (Hair et al., 1979) samples per variable. This small sample size may lead to sample-specific EFA solutions or poor model fit for the individual groups (Kyriazos, 2018; Taasoobshirazi & Wang, 2016).
The combined sample was created by randomly sampling an equivalent number of participants from each of the groups with approximately the same pattern of non-zero loadings. Multiple EFAs were performed on this combined sample where different numbers of factors were maintained. The resulting models were used to test measurement invariance between the groups that were part of the combined sample. The steps taken were the same as for the EFA and MG-CFA as previously described (i.e., determine the number of factors to maintain, perform EFAs, maintain loadings ≥|0.3,| and perform MG-CFAs). For both the EFAs and the MG-CFAs, a balanced sample was used to ensure that the results were not biased by differences in sample sizes (Chen, 2007). The number of factors maintained was again varied between the metrics for factor extraction plus and minus one to prevent missing an invariant model due to under- or over-extracting factors. If no configural variance was found, model fit was explored for the individual groups using CFAs. This was done to investigate if the fit for any group was particularly different.
The CFA and MG-CFA were performed using the R package Lavaan (Rosseel, 2012) with the maximum likelihood estimator with robust standard errors and a Satorra-Bentler scaled test statistic. Missing test scores were again imputed with multiple imputation using the MICE algorithm with predictive mean matching. The runMI function from the Semtools package was used to pool CFA parameter estimates, standard errors, and fit measures resulting from the different imputed datasets. Robust versions of the RMSEA (Brosseau-Liard et al., 2012), CFI (Brosseau-Liard & Savalei, 2014), and the Satorra-Bentler corrected version of the chi-square test statistic and SRMR (Satorra & Bentler, 2001) were used as they are robust to violations of the multivariate normality assumptions.
Factor Similarity
If configural invariant models could not be found, Tucker’s congruence coefficients were calculated between the initial EFA solutions of the individual samples instead, to aid the comparison of the different factor structures between groups. The Tucker’s congruence coefficient is the cosine angle between two vectors (representing the factors) (Tucker & Ledyard, 1951) and is commonly used to quantify the similarity between the factors resulting from an EFA solution (Lorenzo-Seva & ten Berge, 2006). A congruence coefficient between 0.85 and 0.94 was interpreted as fair similarity, and a value higher than 0.95 was interpreted as good similarity (Lorenzo-Seva & ten Berge, 2006).
Results
Sample Characteristics and Preprocessing
Data from 8 patients with a high-grade glioma, 12 patients with a meningioma, and 2 healthy participants were removed due to too many missing values. After these exclusions, data from 380 meningioma patients, 239 high-grade glioma patients, 99 low-grade glioma patients, and 167 healthy participants remained. Missing test scores comprised 2.94%, 0.11%, 3.71%, and 1.00% of the data for meningioma, low-grade glioma, high-grade glioma, and healthy participants, respectively. Sample characteristics of healthy participants and each patient group (after exclusions) can be found in Table 1. For a detailed description of the normative sample, we refer to Rijnen, Meskal et al. (2020).
Sample Characteristics
Note. Sample characteristics of healthy participants and patient groups (after exclusions). Education was recorded on the Dutch Verhage scale (ranging from I to VII).
The correct responses of the continuous performance test were removed for further analysis as the number of correct responses and the omission errors on this task were the exact inverses of one another. High multicollinearity (VIF > 5) was found for the number of correct responses and errors in the shifting attention test. Therefore, we calculated a combined score of correct responses minus errors. All 24 remaining test scores had a VIF of at most 3.97. Descriptive statistics of the test results (count, mean, standard deviation, quantiles, skewness, kurtosis) after preprocessing are provided in Table 2. Moreover, this table describes the percentage of patients affected by the Winsor functioning, individually for each test score.
Descriptives of the Neuropsychological Test Scores as Collected Using CNS Vital Signs.
Note. Descriptive statistics of the test scores resulting from all healthy participants and patient groups together as used for the analysis (after preprocessing). Descriptive statistics for each individual group can be found as an online supplement to this study. Scores are scaled relative to healthy participants which were scaled to have a mean of zero and a standard deviation of one.
Visual inspection showed that the data were not normally distributed (several scores showed strong floor/ceiling effects). Kaiser-Meyer-Olkin scores were 0.82, 0.65, 0.77, and 0.67 for meningioma, low-grade glioma, high-grade glioma patients, and healthy participants, respectively. Bartlett’s test of sphericity was significant for all groups. The assumption of mild multicollinearity was met, and all datasets thus were suitable for the planned analyses.
Number of Factors
For healthy individuals, Empirical Kaiser Criterion, the Hull method, and parallel analysis indicated 3, 7, and 5 factors as the optimal number to be extracted, respectively. For patients with a meningioma, these numbers were 3, 1, and 5 factors; for patients with a low-grade glioma, these were 3, 14, and 4; and for patients with a high-grade glioma, these were 3, 1, and 5 factors. Thus, Empirical Kaiser Criterion reported 3 factors across all samples, the Hull method ranged from 1 to 14 factors depending on the sample, and the parallel analysis test indicated 4 or 5 factors. Based on the factor retention rules, we argue that the optimal number of factors is between 3 and 5. Given the desire to compare factor structures across groups and the interpretability of the resulting factor solutions, we continued with 5 factors for each group.
EFA and Factor Interpretation
Healthy Individuals
Table 3 describes the oblimin rotated five-factor solution of the EFA on the sample of healthy individuals. The total amount of variance explained by the exploratory model on the sample of healthy individuals was 34%. Based on our cutoff of 0.3 for factor loadings, 4 out of 24 variables did not load on any of the factors, and two variables loaded on two factors. Correlations between factors were small with one medium correlation of 0.35 between Factors I and IV (information-processing speed and motor speed).
Exploratory Results for the Sample of Healthy Participants.
Note. Results for the exploratory factor analysis on the sample consisting of healthy participants including the explained variance and correlations between factors. Loadings ≥|0.30| are presented in bold.
Interpretation
We interpret Factor I (Table 3) as found for the healthy individuals as describing “information-processing speed” (10% of the variance). This factor contains all reaction times. This factor additionally includes a negative loading for the commission errors of the Stroop incongruent condition, measuring inhibition of a dominant response. We argue that this may reflect a trade-off or strategy where the participant either presses quickly and risks a commission error or takes the time leading to a slower reaction time and fewer commission errors.
We interpret Factor II as describing “general cognitive performance” (8% of the variance). This factor includes the correct passes on both memory tasks (except for the initial correct passes of the verbal memory task). This factor additionally includes the correct minus errors and reaction time of the shifting attention task and the symbol digit coding task correct responses.
We interpret Factor III as describing “recognition of verbal material” (6% of the variance) as it includes the correct hits on the initial and delayed condition of the verbal memory recognition task, that is, correctly identifying target words among distractor words. This factor additionally includes, to a lesser extent, the correct responses of the congruent Stroop task, measuring color-word matching (with a loading of 0.30 versus loadings of over 0.75 and 0.87 for the initial and delayed correct hits, respectively).
We interpret Factor IV as describing “motor speed” (6% of the variance). This factor includes the left and right conditions of the finger-tapping task which are simple tests of motor control. Although the factor loading of the commission errors of the incongruent Stroop task was 0.32 on this factor, it loaded stronger on the information-processing speed factor (−0.44).
We interpret Factor V as describing “recognition of visual material” (5% of the variance) as it only includes the correct hits on the initial and delayed condition of the visual memory recognition task, that is, correctly remembering target figures among distractor figures.
Meningioma Patients
Table 4 describes the EFA solution for the first sample of meningioma patients. The total amount of variance explained was 46%. All variables loaded on at least one factor and two variables loaded on two factors. Correlations between factors were generally small with one medium correlation of 0.46 for Factors I (attention) and V (motor speed) and one medium negative correlation between Factors II (memory correct passes/strategy) and III (recognition of visual and verbal material).
Exploratory Results for Patients With a Meningioma.
Note. Results for the exploratory factor analysis for the sample consisting of patients with a meningioma including the explained variance and correlations between factors. Loadings ≥|0.30| are presented in bold.
Interpretation
We interpret Factor I (Table 4) as found for the meningioma patients as describing “attention” (16% of the variance). The factor includes all reaction time scores, the number of correct responses for the symbol digit coding task, the congruent Stroop task, the incongruent Stroop task, the omission errors and reaction time of the continuous performance task, and the shifting attention task.
We interpret Factor II as describing “memory correct passes/strategy” (8% of variance). This factor includes the correct passes of all memory tasks, that is, correctly identifying distractors and not responding to these while under time pressure. It additionally loads negatively on the initial and delayed correct hits of the visual memory task, indicating a strategy component for this memory task. A participant may have a better memory, resulting in more correct hits and more correct passes. On the other hand, participants may use a strategy (e.g., not pressing when uncertain) or may have problems responding, resulting in more correct passes while also reducing the number of correct hits. The same argument can be made for the memory correct passes factors of the groups below.
We interpret Factor III as describing “recognition of visual and verbal material” (8% of variance). This factor only includes the correct hits of all memory tasks, that is, correctly identifying target words among distractor words.
We interpret Factor IV as describing “inhibition” (7% of variance) as this factor includes the commission errors and correct hits of the congruent Stroop task, the commission errors of the incongruent Stroop task, the commission errors of the continuous performance task, the correct passes of the verbal memory recognition task, and the errors of the symbol digit coding task. This factor additionally loads negatively on the reaction time of the shifting attention task, which may indicate a trade-off between speed and (commission) errors.
We interpret Factor V as describing “motor speed” (8% of the variance). This factor only includes the finger-tapping tests.
Low-Grade Glioma Patients
Table 5 describes the EFA solution for the complete sample of low-grade glioma patients. The total amount of variance explained was 47%. All variables except one loaded on at least one factor except for the commission errors of the continuous performance test. Six variables loaded on multiple factors. Correlations between factors were small with a maximum correlation of 0.43 between Factors I and II (psychomotor speed and attention). It is important to note that these results were obtained using a relatively small sample (N = 99), potentially leading to sample-specific results.
Exploratory Results for Patients With a Low-Grade Glioma.
Note. Results for the exploratory factor analysis for the sample consisting of patients with a low-grade glioma including the explained variance and correlations between factors. Loadings ≥|0.30| are presented in bold.
Interpretation
We interpret Factor I (Table 5) as found for the low-grade glioma patients as describing “attention” (12% of the variance). This factor includes all reaction times except the reaction time on the shifting attention task. This factor additionally includes the correct responses of the congruent and incongruent Stroop task. Note that these last two scores load stronger on Factor III (Inhibition).
We interpret Factor II as describing “psychomotor speed” (10% of variance) as this factor includes both finger-tapping tasks. It also includes the reaction time and performance of the shifting attention task and the correct and errors of the symbol digit coding task. These loadings can be explained by the difficult response options for these tasks (e.g., the targets that should be pushed on the keyboard switch and need to be continuously updated with each item, requiring both cognitive flexibility and switching of efferent motor commands). Last, this factor contains the omission errors of the continuous performance task which we are unable to interpret and which loads stronger on Factor III (Inhibition)
We interpret Factor III as describing “inhibition” (9% of the variance). This factor includes the correct and commission errors of both Stroop tasks, the omission errors of the continuous performance task, and to a lesser extent, both finger-tapping tasks which also load on Factor II (psychomotor speed).
We interpret Factor IV as describing “recognition of visual and verbal material” (8% of variance) as this factor only includes the correct hits of all memory tasks, that is, correctly identifying target words among distractor words.
We interpret Factor V as describing “memory correct passes” (8% of variance). This factor includes the correct passes of all memory tasks. This factor additionally includes, to a lesser extent, negatively the reaction time on the continuous performance task. We are unable to interpret this last loading which loads stronger on Factor I (attention).
High-Grade Glioma Patients
Table 6 describes the EFA solution for the first sample of high-grade glioma patients. The total amount of variance explained was 44%. Two variables did not load on any factor, and one variable loaded on two factors. Correlations between factors were small with one medium correlation of 0.42 between Factors I and V (complex attention and motor speed).
Exploratory Results for Patients With a High-Grade Glioma.
Note. Results for the exploratory factor analysis for the sample consisting of patients with a high-grade glioma including the explained variance and correlations between factors. Loadings ≥|0.30| are presented in bold.
Interpretation
We interpret Factor I (Table 6) as found for the low-grade glioma patients as describing “attention” (16% of variance). This factor includes all reaction time scores except for the reaction time of the shifting attention task. This factor additionally includes the correct responses (i.e., the accuracy) of the incongruent Stroop task, the shifting attention task, and the symbol digit coding test, all of which have a flexibility component. Last, this factor includes the correct hits of the visual memory tasks, although this score also loads stronger on Factor II (memory correct passes).
We interpret Factor II as describing “memory correct passes/strategy (8% of variance). This factor includes the correct passes of all memory tasks. It additionally includes the inverse of the initial and delayed correct hits of the visual memory task, indicating a strategy component for this memory task.
We interpret Factor III as describing “inhibition” (7% of the variance). This factor includes the commission errors of both Stroop tasks, the correct hits of the congruent Stroop task, and the commission errors of the continuous performance task. This task additionally includes the correct hits of the symbol digit coding task which we are unable to explain and which loads stronger on Factor I (attention)
We interpret Factor IV as describing “recognition of verbal material” (7% of variance). This factor only includes the verbal memory initial and delayed correct hits, that is, correctly identifying target words among distractor words.
We interpret Factor V as describing “motor speed” (6% of variance). This factor only includes the left and right finger-tapping tests.
Confirmatory Factor Analysis
To determine the validity of the factor structures described above, the fit measures for four CFA models based on the four EFA solutions previously described were inspected and are presented in Supplemental Table S2. Results show that none of the resulting models had a good fit.
Therefore, the same analysis was also performed while maintaining between two and six factors, resulting in a total of 20 different models. These results are also presented in Supplemental Table S2 and did not indicate good model fit for any of the patient groups regardless of the number of factors maintained. For healthy participants, the RMSEA indicated a good model fit while maintaining four to six factors, but not while maintaining two or three factors. The SRMR, however, did not show good model fit for healthy participants regardless of the number of factors maintained. Therefore, we conclude that no models with a good fit could be found for individual groups using the current modeling approach.
Measurement Invariance
To test measurement invariance, the same patterns of non-zero factor loadings resulting from the EFAs are required between all—or a subset of groups. The factor structures resulting from the EFA for healthy participants, however, differed from patient groups with different factor interpretations for most factors. Factor structures among patient groups were more similar, with approximately the same factor interpretations. Despite the similarities in interpretation among patient groups, results showed between-group differences in non-zero loadings for almost every pair of factors, with up to six loadings differing (between the inhibition factors of meningioma and low-grade glioma patients). Therefore, we did not continue with the planned tests of measurement invariance using the models resulting from the individual groups or a subset thereof.
As the patterns of non-zero loadings for the patient groups were approximately the same, measurement invariance was tested using EFA solutions resulting from a combined sample consisting of 99 participants from each patient group. For the combined sample, Bartlett’s test of sphericity was significant and the Kaiser-Meyer-Olkin score was 0.81 indicating that was sample was suitable for the planned analysis. Parallel analysis, the Empirical Kaiser Criterion, and the Hull method indicated 4, 3, and 1 as the optimal number to be extracted respectively. Therefore, measurement invariance was tested for models with one up to five factors.
Supplemental Table S3 shows the fit measures of the MG-CFAs (based on the EFA solutions resulting from the combined sample, not shown) as evaluated on all patients (one MG-CFA per number of factors maintained). None of the MG-CFAs indicated good model fit; therefore, we conclude that no configural invariant model could be identified using the current modeling approach regardless of the number of factors maintained. Supplemental Table S3 further shows the fit of the same confirmatory models when applied to the individual groups. None of the individual patient groups had a notably different fit according to these CFAs. This indicates that the lack of good model fit of the MG-CFA is not due to any specific sample. As a model with good fit is a prerequisite for configural invariance, we did not continue the analysis of measurement invariance. As scalar invariance is required to compare latent means, we also did not continue with the planned comparison of latent means.
As no model with configural invariance could be identified, Tucker’s congruence coefficients between each pair of factors resulting from the initial EFAs on the individual groups were calculated instead (Supplemental Table S4). Comparing factors for patients with low-grade glioma and patients with meningioma, we found that one factor had fair similarity and that no factors had good similarity. For patients with high-grade glioma and patients with a meningioma, we found that all five factors had fair similarity while no factors had good similarity. Between patients with high-grade and low-grade glioma we found that one factor had fair similarity and no factors had good similarity. Finally, no factors showed fair or good similarity between healthy participants and any of the patient groups (for which factors also differed in interpretation).
Discussion
In this work, we performed four EFAs on the test scores of the CNS VS computerized neuropsychological screening in healthy participants and three samples of patients with primary brain tumors, evaluated model fit, and tested measurement invariance between groups. Results showed several differences in factor interpretation, especially when comparing healthy participants to patient groups. Factor structures among patient groups had approximately the same factor interpretation but differed in non-zero loadings for almost every pair of factors. No good fitting model could be found given the current modeling approach regardless of the group(s) used or the number of factors maintained. Therefore, configural invariance between groups could not be established.
Implications for the Use of Composite Scores Resulting From CNS VS
Given that no model with good fit could be identified, regardless of the group(s) and the number of factors maintained, it is likely that the structure of CNS VS cannot be adequately captured using linear components (cf. C. T. Gualtieri & Hervey, 2015). This, in turn, raises concerns about the validity of using (weighted) sum scores of the metrics resulting from CNS VS, including the “clinical domains” that are provided to aid intuitive clinical interpretation. We therefore recommend researchers and clinicians using CNS VS to use scores resulting from the individual tests, rather than composite (e.g., domain) scores consisting of scores from multiple tests in the battery.
Furthermore, due to the lack of a model with good fit, configural invariance across groups could not be established. This is in line with previous research that performed evaluations of measurement invariance in a variety of (non) CNT batteries. Mixed results have been found when comparing patient groups to healthy participants or patients without cognitive impairments (Haring et al., 2015; Ma et al., 2021; Yang et al., 2022). Furthermore, measurement invariance did not hold for half of the studies reviewed by Wicherts (2016) that compared IQ test batteries across different subgroups. Therefore, we recommend users of neuropsychological test batteries to consider measurement invariance in addition to other validity measures before comparing sum scores between groups.
There can be multiple causes why a good fitting model could not be found using EFA. First, CNS VS is a brief neuropsychological screening with some tests likely having a lower signal-to-noise ratio when compared to more extensive tests, potentially complicating factor extraction (Watkins, 2018) and reducing fit measures (Niemand & Mai, 2018). Second, data were collected as part of a busy clinical care trajectory which likely contributes to the amount of noise in the data. Third, the incidence of primary brain tumors is relatively low, causing the sample size to be low despite being large for its kind. Fourth, several measures resulting from CNS VS are not normally distributed and/or suffer from floor or ceiling effects potentially complicating factor extraction (Watkins, 2018) and negatively affecting fit measures (Ainur et al., 2017; Ory & Mokhtarian, 2010). These effects are present for several test variables including the omission and commission errors of the continuous performance test and the correct hits of the congruent Stroop task. As a post hoc analysis excluding these tasks with strong floor/ceiling effects did not result in good model fit (results not shown), it is less likely that these effects were the main driver of the poor fit. Fifth, a more general explanation may be that variables resulting from CNTs, as from traditional cognitive tests (Agelink van Rentergem et al., 2020), may have more complex, non-linear interrelationships that cannot be sufficiently captured using linear factor analysis. This means that other representations of the test scores or other modeling approaches may result in better fitting models.
Similarities and Differences in the Factor Structures Found in This Study
Comparison of factor structures between patient groups on the one hand and healthy participants on the other showed both differences and similarities in interpretation. First, information-processing speed for healthy participants and attention in patient groups were largely comparable (congruence between 0.76 and 0.80). However, the attention factor in patients contained accuracy scores in addition to reaction times, whereas it did not for healthy individuals. The inclusion of accuracy scores jointly with reaction time scores in one factor among patients could be due to a joint attention-processing speed deficit as a result of brain damage (Lezak et al., 2004; Mathias & Wheaton, 2007), resulting in stronger linkage of these two components in a factor than in healthy controls. Alternatively, it could be that mental slowness, one of the most common deficits in brain damage, leaves patients unable to timely process and use relevant information, resulting in more errors. Second, an Inhibition factor was found only for patient groups. This likely is explained by problems with inhibition being common in patients with brain lesions (Conway & Fthenaki, 2003; Picton et al., 2006). Problems with inhibition of a response may cause them to respond on stimuli where they should not, leading to a factor comprising most of the (commission) errors. Last, memory tasks were divided into visual memory recognition and verbal memory recognition for healthy participants, and into memory recognition and memory correct passes for patients. This finding for patient groups differs from most neuropsychological literature where memory is separated into visual and verbal memory or short-term and long-term memory (Kraemer et al., 2009; Lezak et al., 2004). This finding may be explained by a more general impairment in memory encoding for patient groups. Such an impairment affects both visual and verbal memory, and long- and short-term memory. This, in turn, likely causes patients to rely on a strategy for these tasks where they either indicate that they recognized a stimulus in case of doubt, or do not indicate that they recognize the stimulus.
Factor structures across patient groups also showed both similarities and differences. Patient groups always showed a factor interpreted as attention (congruence between 0.84 and 0.93) that captured the most variance, indicating that deficits in attention play an important role in their test performance. Moreover, patient groups always had two similar memory-related factors (congruence between 0.70 and 0.92). Last, patient groups always had an Inhibition factor (congruence between 0.74 and 0.86). However, for patients with a low-grade glioma, psychomotor speed was found instead of motor speed, measuring the integration between cognitive and motor skills. Moreover, memory correct passes did not contain a strategy component for patients with low-grade glioma, potentially being explained by patients with a low-grade glioma (known to have less severe cognitive impairments) being less reliant on strategies. Last, there were many differences in non-zero loadings between patient groups that did not change our interpretation. It is important to note that some differences in factor structure may be due to the relatively small sample-to-variable ratios (with as few as 99 participants), possibly leading to sample-specific solutions. Moreover, it is important to consider that correlations of skewed variables may have been over- or under-estimated (Bishara & Hittner, 2015), potentially affecting the EFA results (Watkins, 2018).
Relating Factor Structures Found in This Study to Previous Studies
Comparing our results in the patient groups to results by Brooks and colleagues (Brooks et al., 2019) who performed an EFA in a sample of youth with diverse neurological diagnoses, we find that the structures were fairly similar, despite Brooks and colleagues only finding three factors. The memory factor by Brooks and colleagues is comparable to our recognition of visual and verbal material factors, indicating that CNS VS captures adequate recognition of (quick) responses to the target material across diagnoses. The Inhibition factor by Brooks and colleagues is comparable to the combination of our memory correct passes/strategy factors and our inhibition factors. This suggests that CNS VS may be sensitive to inhibition in different cognitive domains across diagnoses. Finally, the speed factor by Brooks and colleagues is comparable to the combination of our attention and (psycho-)motor speed factor. The rough similarities between the factor structures of patient groups in our work and the work by Brooks and colleagues suggest that factor structures in other patient groups may also be somewhat similar while likely not being invariant.
The CFA by C. T. Gualtieri and Hervey (2015) on healthy individuals also resulted in three factors. Overall, their Memory factor is like the recognition of visual and recognition of verbal material factors as found in our healthy sample, with the addition of the memory correct passes that we found in our General cognitive performance factor. Test scores in their processing speed or attention factor are either included in our information-processing speed factor or our general cognitive performance factor.
Comparing the Factor Structures to the CNS VS “Clinical Domain” Scores
CNS VS provides 10 different, but partly overlapping, domain scores based on neuropsychological theory (CNS Vital Signs, n.d.). Our empirical approach, resulting in a five-factor solution, offered little support for the large number of domain scores. Both our results and previous studies indicate that it is difficult to distinguish between reaction time, processing speed, executive function, and simple attention using CNS VS, as suggested by the domain scores. At the same time, some support for some of the CNS VS domain scores was found. First, both the domain scores and our results distinguished aspects of memory and motor speed from all other functions. Furthermore, the variables in the CNS VS Reaction time and Executive function domains were part of the attention domains found in our patient groups. The discrepancy between the factors found using EFA and clinical domains may, in part, be explained by the difference in methodology. The EFAs aim to find distinct sources of common variance present in multiple variables. This causes the factor analysis to ignore some smaller sources of variance or combine overlapping sources of variance. This makes test measures less prone to be part of multiple factors, as is the case with several CNS VS domains. Furthermore, the standardized scores we used for our tests were computed using a different approach than CNS VS.
Similar to the findings of C. T. Gualtieri and Hervey (2015), our study leads us to conclude that CNS VS cannot be used to make thorough distinctions between cognitive domains. Finding few factors that mostly describe broad cognitive domains is in line with results for other brief (20–30 minutes) neuropsychological tests such as the Computer-Administered Neuropsychological Screen for Mild Cognitive Impairment (CANS-MCI) (Memória et al., 2014; Tornatore et al., 2005) and the (computerized) Cognitive Stability Index (CSI) (Erlanger et al., 2002). A notable exception from this is a recent meta-factor analysis of the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS) (Goette, 2020). These results differ from extensive tests such as the Wechsler Adult Intelligence Scale—Fourth edition (WAIS-IV) which measure more detailed (intelligence) functions (Benson et al., 2010; Canivez & Watkins, 2010).
It is important to note that the current study does not rule out that models with good model fit and measurement invariance can be found using different modeling approaches, nor do our results directly address the validity of CNS VS’ individual tests. Moreover, the current study does not invalidate previous studies utilizing composite or “clinical domain” scores resulting from CNS VS. However, caution is warranted when interpreting results based on such scores. While they likely provide some indication of cognitive (dys)function, they may not linearly relate to, nor precisely describe the specific functions they are assumed to measure. Moreover, such scores may measure different functions in different (diagnostic) samples. Last, some caution is warranted when interpreting results that aim to distinguish between reaction time, processing speed, executive function, and simple attention, since these specific cognitive functions have not been identified in the current and previous factor analyses (Brooks et al., 2019; C. T. Gualtieri & Hervey, 2015)
Suggestions for Future Work
As brief neuropsychological screenings are becoming increasingly popular instruments, including in clinical trials, more research into the factor structures of these batteries is needed. This is especially the case for computerized batteries as studies regarding their validity are lacking (De Roeck et al., 2019; Zygouris & Tsolaki, 2015). This is further emphasized by the concerning results for CNS VS with respect to the use of composite scores both within a specific group and in comparisons between multiple groups, as found in the current study.
Future work should consider excluding variables that have strong floor/ceiling effects and can consider modeling general intelligence or processing speed using a bi-factor or higher-order model (Beaujean, 2015). In addition, bi-factor models can be used to model residual correlations between outcome types or scores resulting from the same test. Besides aiming to improve model fit, future work can develop models based on established theory such as the Cattell–Horn–Carroll abilities (Schneider & McGrew, 2018) which may result in a model with a better fit or can use non-linear factor analysis. Alternatively, computational models of cognitive processes, such as those by Agelink van Rentergem and colleagues (2021), can be considered. For example, drift-diffusion models can model tasks where the participant receives conflicting stimuli (Ratcliff & McKoon, 2008; White et al., 2018). To apply these models, however, more detailed measures such as the response times for each stimulus are needed which are not provided by CNS VS. Finally, future research should continue to study the reliability and validity of individual tests in CNS VS for healthy participants and different patient populations and could study other sources of non-invariance such as tumor location.
Conclusion
In this work, we explored the factor structures of the CNS VS neuropsychological test battery in patients with a meningioma, low-grade glioma, high-grade glioma, and in healthy participants, tested model fit, and investigated measurement invariance. Results showed several differences in factor interpretation between healthy participants and patient groups. Factor structures among patient groups were more similar when compared to the differences between patients and healthy participants, but differed in non-zero loadings for almost every pair of factors. Factor structures largely did not support the “clinical domains” proposed by CNS VS. No good fitting model could be found using the current modeling approach, raising concerns about the validity of composite (including “clinical domain”) scores resulting from CNS VS. Therefore, we recommend both clinicians and researchers to use test scores resulting from individual tests instead of combining scores resulting from multiple tests within the battery. Configural invariance could not be established due to the inability to identify a model with good fit. Given that measurement invariance often is not found in neuropsychological test batteries, we recommend users be mindful of measurement invariance in addition to other validity measures when comparing sum scores between groups.
Supplemental Material
sj-docx-1-asm-10.1177_10731911241289987 – Supplemental material for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors
Supplemental material, sj-docx-1-asm-10.1177_10731911241289987 for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors by S.M. Boelders, E. Butterbrod, L.V.D.E. Vogelsmeier, M.M. Sitskoorn, L.L. Ong and K. Gehring in Assessment
Supplemental Material
sj-docx-2-asm-10.1177_10731911241289987 – Supplemental material for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors
Supplemental material, sj-docx-2-asm-10.1177_10731911241289987 for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors by S.M. Boelders, E. Butterbrod, L.V.D.E. Vogelsmeier, M.M. Sitskoorn, L.L. Ong and K. Gehring in Assessment
Supplemental Material
sj-docx-3-asm-10.1177_10731911241289987 – Supplemental material for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors
Supplemental material, sj-docx-3-asm-10.1177_10731911241289987 for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors by S.M. Boelders, E. Butterbrod, L.V.D.E. Vogelsmeier, M.M. Sitskoorn, L.L. Ong and K. Gehring in Assessment
Supplemental Material
sj-docx-4-asm-10.1177_10731911241289987 – Supplemental material for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors
Supplemental material, sj-docx-4-asm-10.1177_10731911241289987 for Factor Structure and Validity of Composite Scores Resulting From a Computerized Cognitive Test Battery in Healthy Adults and Patients With Primary Brain Tumors by S.M. Boelders, E. Butterbrod, L.V.D.E. Vogelsmeier, M.M. Sitskoorn, L.L. Ong and K. Gehring in Assessment
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was part of a study protocol registered at the Medical Ethics Committee Brabant (file number NW2020-32) and was funded by ZonMw, a Dutch national organization for Health Research and Development (project numbers: 10070012010006 and 824003007).
Pre-registeration
This study was not preregistered.
Data Availability Statement
To protect the privacy of patients, data described in this work are not publicly available. Instead, imputed covariance matrices are provided on Github for all samples as presented in the current study to facilitate reproduction of our results (
R). Moreover, all codes used for the analysis and a ScatterMatrixPlot are provided on Github.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
