Sage Journals: Discover world-class research

Abstract

Research on assessing English as a foreign language (EFL) development has been growing recently. However, empirical evidence from longitudinal analyses based on substantial samples is still needed. In such settings, tests for measuring language development must meet high standards of test quality such as validity, reliability, and objectivity, as well as allow for valid interpretations of change scores, requiring longitudinal measurement invariance. The current study has a methodological focus and aims to examine the measurement invariance of a C-test used to assess EFL development in monolingual and bilingual secondary school students (n = 1956) in Germany. We apply longitudinal confirmatory factor analysis to test invariance hypotheses and obtain proficiency estimates comparable over time. As a result, we achieve residual longitudinal measurement invariance. Furthermore, our analyses support the appropriateness of altering texts in a longitudinal C-test design, which allows for the anchoring of texts between waves to establish comparability of the measurements over time using the information of the repeated texts to estimate the change in the test scores. If used in such a design, a C-test provides reliable, valid, and efficient measures for EFL development in secondary education in bilingual and monolingual students in Germany.

Keywords

C-test English as a foreign language language testing measurement invariance panel data

Introduction

Most research on literacy development deals with its foundation phase in monolingual contexts of teaching and learning. Interest in investigating and assessing language development in a second language has been growing recently (Barkaoui & Hadidi, 2021, p. 1). However, compared with other language domains, robust empirical evidence of proficiency development process in foreign languages is needed (Pae & O’Brien, 2018; Schoonen et al., 2011; van Gelderen et al., 2007). Findings from longitudinal analyses based on substantial samples, in particular, are rare (Barkaoui, 2014; Schrauf, 2009).

One challenge in measuring language development arises from the fact that tests used for cross-sectional measurements are not necessarily applicable to longitudinal designs. In settings with repeated measurements of theoretical constructs, such as language proficiency, tests must meet both: (1) standards of test quality such as validity, reliability, and objectivity (American Educational Research Association [AERA], 2014) and (2) longitudinal measurement invariance (MI) which allows for valid interpretations of the development in test scores over time (Barkaoui, 2014; Nagle, 2022). Development is defined as a change over time (Little, 2013). In order to capture language development, the assessment should measure the change in language proficiency. Interpreting change as reflecting development in the skills being assessed is only possible when the assessment measures the same abilities or constructs over time (Llosa, 2011). The establishment of the constructs’ reliability and validity at each point of measurement is crucial because both properties may change over time. Therefore, the assessments’ MI is required because only under this condition, change in a test score could be seen as having been caused by true development in the addressed language proficiency and not by the variation in the measurement conditions (Lord & Novick, 1968; Millsap, 2011).

Confirmatory factor analysis (CFA) proved to be the method of choice for MI analysis because it allows the explicit modeling of a test’s psychometric properties (Millsap, 2011). CFA is a widely used method for modeling theoretical constructs in educational measurement (Brown, 2015), especially in longitudinal designs (Little, 2013). However, in applied linguistics, there is a high demand for implementing this method to ensure the comparability of language test scores over time to recommend their usage in longitudinal research designs (Nagle, 2022).

The current study has a methodological focus and aims to examine the longitudinal MI of a C-test used to assess the development of foreign language skills in English as the first foreign language in secondary school students in Germany. By doing so, the applicability of the C-tests for longitudinal measurement of foreign language skills will be investigated. The current study intends to extend the growing body of research on the assessment of foreign language development by addressing an approach dealing with challenges unique to longitudinal measurements.

C-test

The C-test is a popular language testing instrument where a respondent has to fill gaps in short texts. The linguistic principle underlying C-test construction is the redundancy of any natural language, enabling a competent speaker to decode distorted information. The reconstruction of systematically deleted parts of words in a text indicates general language proficiency (Eckes & Grotjahn, 2006). In a C-test, the second half of every second or third word is deleted as opposed to Cloze tests in which whole words are missing. A C-test consists of 4–6 independent texts with 20–25 gaps each (Grotjahn, 2002), representing a testlet structure (or item bundles; Rosenbaum, 1988; Wainer & Kiely, 1987). The texts’ first and last sentences are completely intact to provide a semantic context for the task (Grotjahn, 2002, 2004).

Owing to this testlet structure, a C-test brings several benefits to language assessment (Alpizar et al., 2022): (1) it increases test time efficiency (Bradlow et al., 1999) by providing a quick performance estimation (Norris, 2018) based on much information gathered in a short time (Min & He, 2014); (2) it provides a more in-depth estimation than many traditional test formats (DeMars, 2006; Hamdi et al., 2018); and (3) it can be aligned to the Common European Framework of Reference for Languages (Council of Europe, 2020; see Eckes, 2012).

Eckes and Grotjahn (2006) conclude that this holistic measure of general language ability encompasses receptive and productive language skills offering insights into students’ literacy. They provide reliable estimates of the learners’ general language proficiency, especially in foreign language assessment (Grotjahn, 2002).

Concerning construct validity, C-tests highly correlate with other measures such as language tests, self-assessments, and students’ school grades, as shown by previous research (Daller et al., 2021; Eckes, 2012; Grotjahn, 2002; Harsch & Hartig, 2016; Hastings, 2002). According to theory, C-tests should also be suitable for language development testing (Aguado et al., 2007). However, to our best knowledge, the applicability of C-tests in a longitudinal design has not yet been empirically substantiated by previous research.

The C-test is used for various high-stakes educational assessments like screenings and placement testing (e.g., Harsch & Hartig, 2016; Mozgalina & Ryshina-Pankova, 2015 ; Norris, 2006) as well as a proficiency indicator in the second-language acquisition (SLA) research (e.g., Lee-Ellis, 2009; Norris, 2018; for details, see Alpizar et al., 2022).

Measurement invariance (MI)

If a construct is measured in multiple groups or at multiple time points, comparing these measurements requires MI. MI is the notion that a measurement instrument works in the same way in varied conditions, such as different data collection waves and groups of participants, given these conditions are irrelevant to the construct itself (see Millsap, 2011). Thus, a test score reflects a person’s true score on the addressed construct plus some random error. In turn, if measurements are systematically affected by concomitant conditions, it is referred to as measurement bias, and the comparability of the data is questionable. Latent variable models such as in the structural equation modeling (SEM) or item-response theory (IRT) frameworks provide measurement models that allow for evaluating an instrument’s psychometric properties supporting the comparability of data. In the SEM framework followed here, MI is evaluated by CFA and is a part of the model’s factorial invariance (e.g., Horn & McArdle, 1992; Little, 1997; Meredith, 1993). Routed in the common factor model (Thurstone, 1947), CFA (Bollen, 1989; Brown, 2015) uses observed test scores as an indication of unobserved theoretical constructs. One of the various virtues of this latent modeling approach is that psychometric properties are explicitly testable by imposing parameter restrictions on the measurement model’s factorial structure. Recalling the CFA measurement model for measurements at multiple time points $t$

y_{i t} = τ_{i t} + Λ_{t} η_{i t} + ε_{i t}

(1)

where $y_{i t}$ is a vector of the observed variables (indicators) for person $i$ at time $t,$ $η_{i t}$ is a vector of latent trait levels of person $i$ at time $t,$ and $τ_{t}$ is a vector of indicator intercepts (model-implied means of the indicators) at time $t .$ The matrix $Λ_{t}$ contains factor loadings that link indicators and factors. $ε_{i t}$ is a vector of indicator-specific measurement errors which sum to zero [ $Ε (ε_{i t}) = 0$ ] and are independent of the latent factor [ $C o v (η_{t}, ε_{t}) = 0$ ]. The equation for the models’ covariance structure (i.e., the estimated covariance matrix $Σ_{t}$ of the observed variables at time $t$ ) is

Σ_{t} = Λ_{t} Ψ_{t} Λ_{t}^{'} + Θ_{t}

(2)

where $Ψ_{t}$ is the covariance matrix of the latent factors $η_{t}$ at time $t,$ and $Θ_{t}$ is the residual covariance matrix of the residuals $ε_{i t}$ at time $t .$ The equation for the models’ mean structure (i.e., the estimated mean vector $μ_{y t}$ of the observed variables at time $t$ ) is

E (y_{t}) = μ_{y t} = τ_{t} + Λ_{t} α_{t}

(3)

where $E (y_{t})$ are the expected scores of the observed variables, $τ_{t}$ is a vector of the indicator means, and $α_{t}$ is a vector of the latent factor means at time $t .$

A measurement model’s MI depends on whether the model parameters are invariant across time points or groups. To what extent this is the case can be validated by testing invariance hypotheses through restrictions imposed on the models’ covariance and mean structures (see Equations 2 and 3). Four levels of factorial invariance are distinguished, imposing equality assumptions across time or groups (for an overview, see Cheung & Rensvold, 2002; see Table 1):

Table 1.

Formal definitions of factorial invariance levels (see Little, 2013).

Level	Name	Definition
0	Configural invariance	$E (y_{t}) = μ_{y t} = τ_{t} + Λ_{t} α_{t}$ ; $Σ_{t} = Λ_{t} Ψ_{t} Λ_{t}^{'} + Θ_{t}$
1	Metric invariance (or weak factorial invariance)	$E (y_{t}) = μ_{y t} = τ_{t} + Λ α_{t}$ ; $Σ_{t} = Λ Ψ_{t} Λ^{'} + Θ_{t}$
2	Scalar invariance (or strong factorial invariance)	$E (y_{t}) = μ_{y t} = τ + Λ α_{t}$ ; $Σ_{t} = Λ Ψ_{t} Λ^{'} + Θ_{t}$
3	Residual invariance (or strict factorial invariance)	$E (y_{t}) = μ_{y t} = τ + Λ α_{t}$ ; $Σ_{t} = Λ Ψ_{t} Λ^{'} + Θ$

Note: The letter t in the model equations refers to either time points or groups. Parameters marked with t are freely estimated at each wave. Matrices without marking are constraint to be equal over time.

Configural invariance (e.g., Horn & McArdle, 1992; Meredith, 1993) means that structurally identical models fit the data, that is, the same number of latent factors are associated with the same indicators. Apart from that, all model parameters are freely estimated. If this is the case, it can be assumed that the theoretical construct was measured structurally invariant.

Metric invariance (also: weak factorial invariance, e.g., Horn & McArdle, 1992; Little, 1997; Steenkamp & Baumgartner, 1998) holds when the factor loadings are also the same across groups or time points, respectively. In the formal definition, this is expressed by dropping the index of $Λ$ . Thus, it can be assumed that the theoretical construct was measured on the same scale. This invariance level is the prerequisite for comparing covariances.

Scalar invariance (also: strong factorial invariance, e.g., Little et al., 2007; Meredith, 1993; Steenkamp & Baumgartner, 1998) holds if the latent intercepts are also identical across groups or time points, expressed by omitting the index of $τ .$ This MI level ensures the same operational definition of the indicator scales, that is, an indicator has the same value across time points or groups when the latent factors’ score is 0, which is the prerequisite for comparing the latent means.

Residual invariance (also: strict factorial invariance, e.g., Little et al., 2007; Meredith, 1993; Steenkamp & Baumgartner, 1998) means that the indicator-specific residual variances are also invariant, indicating equal reliability across groups or time points. In the formal definition, this is expressed by dropping the index of $Θ$ .

MI is always tested for the whole model. However, if a certain level cannot be confirmed, this does not necessarily mean that all parts of the model are affected. In most cases, models suffer local invariance problems, with only some parameters affected. Therefore, in applied research, it is recommended to relax the invariance restrictions and estimate the affected model parameters freely. As a result, the tested level of MI can be assumed to be partially reached. If only a small part of the model is affected by this exception and no serious bias is to be expected, the comparison between groups and time points is acceptable (Bollen, 1989; Byrne et al., 1989; Cheung & Rensvold, 2002; Millsap, 2011; Wang & Wang, 2012).

The final level of MI is then identified by model comparisons, where the most restrictive model, which fits the data not poorer than the previous less restrictive model, sets the MI level (Millsap, 2011).

Research questions and analytical strategy

In this study, we seek to answer whether a C-test is applicable in a longitudinal research design to assess the development of general language proficiency in English as a foreign language (EFL) in secondary students in Germany, where English is the first mandatory foreign language, mostly learned from the first grade. For this aim, we used repeated measurements of a C-test in EFL from a large-scale dataset on secondary students’ multilingual development across secondary education in Germany. In our view, three major conditions have to be met to affirm the C-test’s applicability for longitudinal measurement:

The C-tests’ longitudinal MI is at least at the scalar level so that change in the latent means can be reliably interpreted as true construct development rather than variability in the measurement conditions (H1).

The C-test is invariant across age groups so that test scores are comparable across secondary education (H2).

The C-test is sensitive enough to detect English foreign language development at this stage of schooling (H3).

As an analytical strategy, we estimate a series of nested CFA models to test MI hypotheses and evaluate the latent means’ development over time based on unbiased parameter estimates.

Method

Study design

We applied data from the German panel study MEZ, funded by the German Federal Ministry of Education and Research (BMBF).¹ Within an interdisciplinary approach, MEZ investigated the development of multilingualism in German secondary students. The longitudinal cohort-sequence design comprised two starting cohorts with four data collection waves (2016–2018). The MEZ panel incorporated 2103 students with Russian, Turkish, and monolingual German language backgrounds in 78 schools in eight German federal states (Baden-Wuerttemberg, Bremen, Hamburg, Hesse, Lower Saxony, North Rhine-Westphalia, Rhineland-Palatinate, and Schleswig-Holstein).

The MEZ sampling design targeted students with the following characteristics (Klinger et al., 2022):

Students with German–Russian, German–Turkish, or monolingual German background who learned English as their first foreign language at school and, where applicable, French or Russian as a second foreign language;

Students from seventh (age 13+) or ninth (age 15+) grades;

Students attending school in Germany at least since the third grade and have learned English as their first foreign language at school;

Students attending Gymnasium² (at least 50% of the sample) to gather a sufficient number of high-performing students with multilingual literacy skills.

A two-step sampling strategy was conducted: First, schools were selected with student populations that met the sampling criteria. Second, students within these schools were selected based on the abovementioned criteria. Schools and students were not selected randomly. Sampling, ethics approval, and data collection were carried out by the International Association for the evaluation of educational achievement (IEA).³

All students were tested in the majority language (German) and the first foreign language (English). Students with Russian or Turkish language backgrounds were also tested in their heritage languages. Furthermore, those students who learned either French or Russian as a second foreign language at school were tested in these languages as well. In addition, data were collected on individual (e.g., motivation, cognitive ability), contextual (e.g., social and ethnic background), as well as linguistic (e.g., typological characteristics of languages) aspects influencing language development.

In order to measure general language ability in EFL, the MEZ team developed eight C-tests based on curricula for the foreign language classes in secondary schooling; two texts were taken from the KESS 8 study⁴ (Bos & Gröhlich, 2010). Each C-test consists of four texts per data collection wave. Training and motivation effects were to be minimized by alternating texts between waves. Thus, for the longitudinal C-test design applied in MEZ (see Table 2), 10 different texts were used, and they were distributed across the four waves of data collection as follows: Wave 1 (Y1, Y2, Y3, Y4), Wave 2 (Y5, Y3, Y4, Y6), Wave 3 (Y7, Y3, Y6, Y8), and Wave 4 (Y6, Y8, Y9, Y10). As seen for the first wave, the texts Y1 and Y2 were administered only once. In contrast, the text Y3 was repeated in the second and third waves and Y4 in the second wave. Two new texts were administered in the second wave: Y5 only once, Y6 was repeated in the third and fourth waves, Y7 and Y8 were introduced in the third wave, and Y8 was repeated in the fourth wave. Finally, Y9 and Y10 were added as new texts in the fourth wave.

Table 2.

Distribution of the 10 C-Test over the four waves of the longitudinal data collection.

Text	Wave 1	Wave 2	Wave 3	Wave 4
1	Y1: Work experience in foreign countries (MEZ)	Y5: Christmas in England (MEZ)	Y7: My name is Wade Stasney (MEZ)	Y6: School in Australia (MEZ)
2	Y2: Australia’s biggest health problem (MEZ)	Y3: My name is Derek (KESS 8)	Y3: My name is Derek (KESS 8)	Y8: Len, Sally and their children (MEZ)
3	Y3: My name is Derek (KESS 8)	Y4: In the zoo (KESS 8)	Y6: School in Australia (MEZ)	Y9: Life`s a pig (MEZ)
4	Y4: In the zoo (KESS 8)	Y6: School in Australia (MEZ)	Y8: Len, Sally and their children (MEZ)	Y10: Tourism (MEZ)

Note: KESS 8-C-Test from the KESS 8 study; MEZ: C-Test developed in MEZ.

In order to calibrate the different test versions for longitudinal measurement, we used the information of the repeated texts as an anchor to establish the measurement models’ longitudinal MI to produce comparable parameter estimates. The rotation of texts followed the rationale of repeating two texts from the previous wave and adding two new ones in each subsequent wave (see Table 2). Against this rationale, two texts were repeated over three waves each (Y3 and Y6) in a temporally lagged sequence to ensure longitudinal compatibility.

Sample statistics

We use data from all waves, starting cohorts, and language groups in the current study. We only excluded cases with no valid C-test score (n = 147) for any wave (complete missing) from the original MEZ sample (n = 2103). The longitudinal sample used for the current analysis consists of 1956 students (50.5% from the seventh-grade starting cohort and 49.5% from the ninth-grade starting cohort), including 990 monolingual German (98.0% born in Germany, 57.6% females, 47.7% Gymnasium), 364 German–Russian (98.0% born in Germany, 57.6% females, 47.7% Gymnasium), and 602 German–Turkish bilinguals (93.2% born in Germany, 57.0% females, 40.0% Gymnasium). Table 3 reveals students’ grades and mean age across the four waves of data collection. Since the students in our sample are mostly second-generation immigrants who attended German school for no later than the third grade and learned English as the first foreign language in compulsory education, we expect no differential functioning of the C-test in the Russian and Turkish language groups and, therefore, considered them jointly in the data analysis.

Table 3.

Students’ mean age and grade across four waves of data collection.

		Wave 1(Spring 2016)	Wave 2(Fall 2016)	Wave 3(Summer 2017)	Wave 4(Summer 2018)
Grade	Cohort 7	7	8	8	9
Grade	Cohort 9	9	10	10	11
Age	Cohort 7	13.2	14.0	14.6	15.5
Age	Cohort 9	15.2	16.0	16.6	17.5

Table 4 shows the means and standard deviations of the raw C-test scores for four waves of data collection. Each C-test consists of four texts with a maximum of 20 points each. The C-test overall score for each wave is the unweighted sum of the scores for the four texts with a maximum of 80 points. Since texts differ in their number of gaps, all scores may be transformed to have 20 gaps [(score/number of items)*20]. Cronbach’s alpha is excellent at each wave, ranging from $α$ = .90 to $α$ = .95. The means for the four waves cannot be interpreted as longitudinal development unless MI is established and the fact that different texts were presented at each wave is taken into account. In particular, the performance drop in the fourth wave casts doubt on the comparability of the raw scores.

Table 4.

Sample statistics for the text scores at the four waves of data collection across both cohorts.

		Text 1	Text 2	Text 3	Text 4	Overall score	Cronbach’s α
Wave 1 (n = 1672)	Cohort 7	10.8	9.2	10.1	8.5	38.6	.931
Wave 1 (n = 1672)	Cohort 9	13.7	13.2	13.7	12.4	52.8	.937
Wave 2 (n = 1625)	Cohort 7	10.5	11.8	10.1	9.4	41.8	.950
Wave 2 (n = 1625)	Cohort 9	13.0	14.5	13.2	12.3	52.8	.949
Wave 3 (n = 1527)	Cohort 7	12.2	12.5	10.6	8.1	43.0	.949
Wave 3 (n = 1527)	Cohort 9	14.9	15.1	13.4	11.0	54.3	.946
Wave 4 (n = 1043)	Cohort 7	12.0	10.3	5.3	9.1	36.7	.913
Wave 4 (n = 1043)	Cohort 9	15.7	14.8	8.2	12.5	51.2	.902

Statistical analyses

Figure 1 shows a graphical representation of the applied CFA model to test for the C-tests’ longitudinal MI. The model represents the development in general language proficiency in EFL. The C-test scores are the models’ observed variables (Y1–Y10) that serve as indicators for the latent construct of general language ability in EFL.

Figure 1.

Longitudinal measurement model of general language skills in EFL.

Concerning latent variables, the model consists of four repeated measured factors ( $η_{1}, η_{2}, η_{3}, η_{4}$ ), representing the latent constructs’ true longitudinal variance shared by the observed variables as well as 16 unique latent factors ( $u_{1}$ to $u_{16}$ )⁵ representing variances specific to the observed variables (residual variance). Concerning the model parameters, there are 16 factor loadings ( $λ_{1, 1}$ to $λ_{16, 4}$ ), 16 residual variances ( $ε_{1, 1}$ to $ε_{16, 16}$ ), 7 residual covariances ( $ε_{3, 6}, ε_{3, 10}, ε_{4, 7}, ε_{6, 10}, ε_{8, 11}, ε_{11, 13}, ε_{12, 14})$ , included for the same C-tests used at multiply waves, 10 latent factor (co)variances ( $ψ_{1, 1}$ to $ψ_{4, 4}$ ), 16 means of the observed variables ( $τ_{1}$ to $τ_{16}$ ), and 4 latent factor means ( $α_{1}$ to $α_{4})$ .

We tested the measurement models’ MI in two steps. First, we test the models’ longitudinal MI for both cohorts separately to set the level of the measurements’ comparability over time. We estimated models with configural, metric, scalar, and residual longitudinal MI by restricting the related model parameters to be equal across the same tests. After establishing the level of longitudinal MI for both starting cohorts, we tested for the measurement models’ MI between the starting cohorts to ensure comparability between the age groups. Therefore, we used a multigroup CFA version of our measurement model and restricted the model parameters related to the MI levels to be equal. After establishing the models’ MI, we conclude with the students’ development in EFL by reporting the model-estimated means for both cohorts.

For model evaluation, we relied on the root mean square error of approximation (RMSEA; Steiger, 1998; Steiger & Lind, 1980) as a global measure (RMSEA < .01 “great fit;” < .05 “good/close fit;” < .08 “acceptable fit;” < .10 “mediocre fit;” > .10 “poor fit,” Little, 2013). It tests the degree of the models’ approximation to the data, thus accounting for models being always only approximations of the actual processes that gave rise to the data (Little, 2013, pp. 108 ff.). In contrast, the χ²-test, which tests the absolute fit of a model, is too strict for complex models with large samples due to the test power against the null hypothesis of perfect model fit. Thus, even trivial deviations from a perfect fit lead to model rejection (Brown, 2015; Little, 1997; Wang & Wang, 2012). We also used the standardized root mean square residual (SRMR; Bentler, 1995) as a global fit index that quantifies the deviation of the model-implied variance/covariance matrix compared with the empirical variance/covariance matrix in terms of residuals (SRMR < .08 “good fit;” < .10 “acceptable fit,” (Hu & Bentler, 1999; Kline, 2016). Concerning relative model fit indices, we applied the comparative fit index (CFI; Bentler, 1995) and Tucker–Lewis Index (TLI; Tucker & Lewis, 1973), which compares the specified model with its null model (>.99 “outstanding fit;” .95 to .99 “very good fit;” .90 to .95 “acceptable fit;” .85 to .90 “mediocre fit;” <.85 “poor fit;” Little, 2013). As a measure to compare nested models with different levels of parameter constraints, the difference in CFI is applied. If ΔCFI > 0.01, a substantial decrease in model fit has to be assumed, and the less restrictive model defines the level of MI to be confirmed (Cheung & Rensvold, 2002).

In order to deal with missing data, we used full information maximum likelihood estimation (FIML; Enders, 2010; B. O. Muthén et al., 1987) in the Mplus 8.2 software (L. K. Muthén & Muthén, 1998–2017) for model parameter estimation and used a maximum likelihood algorithm producing robust standard errors (MLR) to deal with nonnormality in the observed variables.

Results

The report of our analyses results on general language proficiency in EFL includes results of longitudinal MI for starting Cohort 7 (3.1) and starting Cohort 9 (3.2), MI between starting cohorts (3.3), and latent mean-development (3.4).

Longitudinal MI for starting Cohort 7

We estimated a series of model versions with increasing parameter restrictions corresponding to the levels of MI to test the models’ MI for starting Cohort 7. Table 5 contains the results of the model comparisons.

Table 5.

Invariance of the English C-tests’ longitudinal measurement model estimates (Cohort = 7, Waves = 1–4, n = 987).

MI level	Model	χ ²	df	p	RMSEA	RMSEA90% CI(p RMSEA ⩽ .050)	SRMR	CFI	ΔCFI	TLI	ΔTLI	Pass?
0	Null	18,692.01	140	.000	.366	.362/.371(.000)	.723	.000	–	.033	–	No
1	Configural	239.48	90	.000	.041	.035/.047(.991)	.014	.990	–	.986	–	Yes
2	Metric	270.58	93	.000	.044	.038/.050(.946)	.021	.988	.002	.984	.002	Yes
3	Scalar	338.39	96	.000	.051	.045/.056(.425)	.020	.983	.007	.979	.007	Yes
4	Residual	346.27	102	.000	.049	.044/055(.574)	.022	.983	.007	.980	.006	Yes

The null model served as a baseline revealing the maximum degrees of freedom for the parameter estimation provided by the observed data (df = 140). The configural model tested the structural invariance of the measurement model over time with a very good data fit. The metric model tested the measurements’ scale’s invariance over time with also a very good data fit. Comparing both models, the difference in the CFI was <.01 ( $Δ$ CFI = .002), and therefore the more restrictive metric model prevailed. The scalar model tested the invariance of the scales’ intercept over time with a very good data fit. The comparison to the configural model also revealed no decrease in model fit ( $Δ$ CFI = .007), and therefore, the scalar model prevailed. The residual model tested the invariance of the indicators residuals over time with a very good data fit. The comparison with the configural model showed no decrease in model fit. Thus the final set level of longitudinal MI for the measurement models of skills in EFL for the starting Cohort 7 is residual invariance.

Figure 2 shows the structural model of the measurement model with longitudinal residual MI for the starting Cohort 7. The parameter estimates are reported in unstandardized metrics to demonstrate the imposed equality restrictions on factor loadings, intercepts, and residuals. For example, the text Y3 that was used in the first three waves has the same factor loading ( $λ$ = 4.19), intercept ( $τ$ = 9.87), and residual ( $ε$ = 3.28) at each wave. Another example is the test Y6, whose factor loadings ( $λ$ = 3.28), intercepts ( $τ$ = 8.13), and residuals ( $ε$ = 2.35) are fixed as equal over time. Although we report only unstandardized parameter estimates, one may be interested in the standardized factor loadings to assess the weights of the indicators on the factor. Overall, the standardized factor loadings range from $λ$ = .79 to $λ$ = .93, showing a close relationship between indicators and factors.

Figure 2.

Longitudinal measurement model of general language development in EFL with longitudinal residual MI (starting Cohort 7, unstandardized coefficients).

Longitudinal MI for starting Cohort 9

Table 6 contains the model comparison results for the measurement models’ longitudinal MI testing for Cohort 9. The configural, metric, scalar, and residual models fit the observed data very well according to the referred thresholds. Compared with the configural model, neither the metric model ( $Δ$ CFI < .001) nor the scalar model ( $Δ$ CFI = .006) nor the residual model ( $Δ$ CFI = .008) shows a decline in model fit. Therefore, the final level of the measurement models’ longitudinal MI for Cohort 9 is set to residual invariance.

Table 6.

Invariance of the English C-tests’ longitudinal measurement model estimates (Cohort = 9, Waves = 1–4, n = 969).

MI Level	Model	χ ²	df	p	RMSEA	RMSEA90% CI(p RMSEA ⩽ .050)	SRMR	CFI	ΔCFI	TLI	ΔTLI	Pass?
0	Null	61,643.84	140	.000	.337	.332/.341(.000)	.732	.000	–	.040	–	No
1	Configural	177.25	90	.000	.032	.025/.038(1.00)	.029	.992	–	.989	–	Yes
2	Metric	181.58	93	.000	.031	.025/.038(1.00)	.030	.992	<.001	.099	.001	Yes
3	Scalar	248.86	96	.000	.041	.034/.047(.994)	.030	.986	.006	.982	.006	Yes
4	Residual	273.59	102	.000	.042	.036/.048(.989)	.033	.984	.008	.981	.008	Yes

Figure 3 shows the structural model for Cohort 9 with unstandardized parameter estimates demonstrating the imposed equality restrictions to meet longitudinal residual invariance. The standardized factor loadings (not included in Figure 3) range from $λ$ = .87 to $λ$ = .94, revealing a strong relationship between indicators and factors.

Figure 3.

Longitudinal measurement model of general language development in EFL with longitudinal residual MI (starting Cohort 9, unstandardized coefficients).

MI between starting cohorts

To this point, we ensured that the measurement of skills in EFL are comparable over time (across waves) in both starting cohorts. Next, we must clarify whether the measurement model is comparable between the starting cohorts. Table 7 contains the results of a multigroup comparison between the starting cohorts based on the previous measurement model with residual MI. The models for all MI levels fit the data very well, although the model comparison shows a significant decline in model fit in the residual model, that is, ΔCFI < .050. Thus, the between-group MI level of the measurement model is set to the scalar level.

Table 7.

Invariance of the English C-tests’ longitudinal measurement model estimates between the Cohorts (Cohort 7: n = 987; Cohort 9: n = 969).

MI level	Model	χ²	df	p	RMSEA	RMSEA90% CI(p RMSEA ⩽ .050)	SRMR	CFI	ΔCFI	TLI	ΔTLI	Pass?
1	Configural	618.08	204	.000	.046	.041/.050 (.962)	.028	.984	–	.981	–	Yes
2	Metric	789.94	214	.000	.052	.049/.056 (.147)	.063	.977	.007	.974	.007	Yes
3	Scalar	864.68	217	.000	.055	.051/.059 (.012)	.063	.974	.010	.972	.009	Yes
4	Residual	970.64	227	.000	.058	.054/.062 (.000)	.058	.971	.013	.969	.012	No

Development in EFL

Our previous investigation of the measurement models’ MI revealed that the psychometric quality of our measurement is good enough to interpret students’ general language development in EFL regarding interindividual differences, that is, constructs’ longitudinal covariances as well as the development of the constructs’ means over time. Furthermore, the measurements are comparable between the age groups due to MI between the starting cohorts.

Concerning interindividual differences in English development, the strong factor correlations indicate a high degree of stability in interindividual differences in English foreign language skills over time (see Figures 2 and 3). In a standardized metric, these coefficients range from $ψ$ = .79 to .93 (Cohort 7) and from $ψ$ = .94 to .98 (Cohort 9), indicating almost perfect relations between the data collection waves.

Concerning mean differences in English development, the latent means presented in Figures 2 and 3 for both cohorts already show a significant increase in English foreign language skills across the four waves within each cohort. The multigroup model yields factor means on a common metric, with the mean of Cohort 7 at Wave 1 fixed to zero and all other means freely estimated. Those means are displayed in Figure 4. The factor means’ metric is standard deviations. The raw scores (see Table 4) are displayed as reference. It can be seen that the development is not reflected in the raw scores, as these are based on different texts and are not comparable across time. For Cohort 7, the mean development over time is $α_{1}$ = 0.00 (baseline), $α_{2}$ = 0.34 (SE = 0.019), $α_{3}$ = 0.60 (SE = 0.025), and $α_{4}$ = 1.03 (SE = 0.031). For Cohort 9, the mean development over time is $α_{1}$ = 0.85 (SE = 0.053), $α_{2}$ = 1.11 (SE = 0.054), $α_{3}$ = 1.30 (SE = 0.055), and $α_{4}$ = 1.64 (SE = 0.063). In Cohort 7, the average increase is 0.34 SDs; in Cohort 9, it is 0.26 SDs. These margins are close to the expected development of 0.35 SDs per year of normal teaching (Hattie, 2009).

Figure 4.

Latent mean development in EFL (Cohort 7, Wave 1 as reference).

The between-cohorts mean differences are also significant with $Δ α_{1}$ = 0.85 (SE = 0.053), $Δ α_{2}$ = 0.77 (SE = 0.063), $Δ α_{3}$ = 0.70 (SE = 0.071), $Δ α_{4}$ = 0.60 (SE = 0.103) across the waves, indicating strong between-group differences that decrease over time.

Discussion

The growing interest in investigating language development in educational assessment induced a high demand for measurement instruments suitable for longitudinal research designs (Gogolin et al., 2022; Schissel et al., 2019). Yet, tests used for cross-sectional measurements are not necessarily applicable to longitudinal designs. In such settings, tests for measuring language development must meet high standards of test quality such as validity, reliability, and objectivity, as well as allow for valid interpretations of change scores, requiring longitudinal MI. That is, test scores’ changes reflect the trait’s development only when a test measures the same construct over time, not just changes in measurement conditions.

In applied linguistics, there is a high demand to ensure test scores’ comparability over time to recommend certain instruments for usage in longitudinal research designs (Nagle, 2022). CFA proved to be the method of choice for longitudinal measurement modeling to investigate their longitudinal MI. A C-test is a widely used method for assessing general language proficiency because it offers considerable benefits: straightforward to construct, efficient test design, reduced test time, and holistic language proficiency measurement (for an overview, see Alpizar et al., 2022). However, there is a lack of empirical evidence supporting their valid use for assessing language development in longitudinal research designs.

Our study aimed to close this gap by investigating a C-test’s applicability to assess EFL development in secondary students in Germany. We analyzed longitudinal data from the German panel study MEZ. The analyzed sample subjected to secondary analysis consists of 1956 secondary students from the seventh- and ninth-grade starting cohorts with Russian, Turkish, or monolingual German language backgrounds.

The used C-tests consist of four texts at each wave. Ten different texts were administered in MEZ, some of which were repeated (see Table 2). The C-test overall score for each wave is the unweighted sum of the scores for the four texts. These raw scores are not comparable over time because they are not based on a common metric that accounts for the different difficulty levels of the texts.

We used longitudinal CFA to build a four-wave-longitudinal measurement model to address this issue. We used the information from the repeated texts as anchors to establish longitudinal MI and produce comparable parameter estimates over time. The aim was to gain proficiency estimates in general language proficiency in EFL that are comparable across time and age groups in secondary education in Germany. This scaling approach would allow the C-test to be applied in a longitudinal research design. The differences between the trends in the raw scores and the estimated means, displayed in Figure 4, demonstrate the benefit of the approach. While the development is clearly visible in the model-based scores, it is obscured in the raw scores by differences in text difficulty.

We investigated the measurement models’ psychometric properties concerning three hypotheses:

H1: The C-tests’ longitudinal MI is at least at the scalar level so that change in the latent means can be reliably interpreted as true construct development rather than variability in the measurement conditions. The results show the highest level of longitudinal MI (residual MI), exceeding the required minimum scalar level. This result supports the interpretation that change in the test scores over time is caused by the true change in the general language ability in EFL and not due to change in the measurement conditions.

H2: The C-test is invariant across age groups, so test scores are comparable across secondary education. As a result, the multigroup version of the previous measurement model revealed scalar MI between the seventh- and ninth-grade starting cohorts, allowing for score comparability across secondary education.

H3: The C-test is sensitive enough to detect English foreign language development at this stage of schooling. As a result, the test scores increased by 0.34 SD and 0.26 SD on average in Cohorts 7 and 9, respectively. This result is within the expected margin for the development within one school-year (Hattie, 2009).

Overall, our analyses support the appropriateness of altering texts in a longitudinal C-test design conducted by the MEZ study. This approach allows for the anchoring of texts between waves to establish comparability of the measurements over time using the information of the repeated texts to estimate the change in the test scores. If used in such a design, a C-test provides reliable, valid, and efficient measures for EFL development in secondary education in bilingual and monolingual students in Germany. The information a C-test provides about EFL levels and progress can be used for language diagnostics and subsequent language support in school settings.

We used SEM in our approach because our study aimed at testing MI hypotheses concerning construct validity. For this purpose, we favored SEM because it provides us with some crucial benefits: first, we could refer to a sizeable body of literature on MI testing in the SEM, developing an approach for the C-test design of the MEZ study. Second, SEM provides a set of model fit indices best suited for comparing hierarchically nested models. Alternatively, C-test data could have been scaled following IRT, which would be the preferred method for scoring a C-test when focusing on proficiency assessment. Since our main interest was in testing MI and less in the criterion-related interpretation of test scores, the SEM approach provided sufficient psychometric evidence supporting the applicability of the C-test in longitudinal settings. Our study assesses C-tests constructs’ factor structure and its comparability over time and age groups, closing the gap underscored in previous research (Alpizar et al., 2022). However, additional evidence is needed to validate the C-test’s applicability for measuring language development in languages other than English and contexts beyond school to further generalize our findings to other language learning settings.

Our study has several limitations. First, the longitudinal C-test design conducted in the MEZ study allows only for weak anchoring over time. The scheme was to repeat two texts from the previous wave and add two new ones in each subsequent wave. Under this rationale, two texts were repeated over three waves. No text was administered at all four waves. The weak anchoring over time was a consequence of the longitudinal survey design, which attempted to avoid typical effects of repeated test administration. As a result, not all texts (i.e., items) achieved longitudinal MI due to lack of comparability. Thus, the measurement of change here is made on the relatively narrow basis of the repeated items, while the other items contribute to the variance of the proficiencies per wave, but not to the measurement of their development over time. In this way, information for the development analysis remains unused, which ideally could put related statements on a more solid basis. On the contrary, the development of C-tests using nonrepeated but measurement-invariant comparable texts is highly time-consuming and costly, so that the strategy adopted here could be considered a worthwhile alternative.

Second, the generalizability of the results is limited to our studies’ sample of secondary students learning EFL in school. Our data do not allow us to substantiate more general conclusions on the MI of C-tests in other contexts and age groups. In order to be able to generate statements and best-practice recommendations concerning general trends and procedures, further studies on different target groups would be necessary. After all, one general finding follows from the results, which is the need for verification of MI when proficiency tests are used longitudinally, to avoid erroneous conclusions in the interpretation of test scores.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the German Federal Ministry of Education and Research under Grant [MARE (01JG2101)].

ORCID iD

Birger Schnoor

Notes

References

Aguado

Grotjahn

Schlak

(2007). Erwerbsalter und Sprachlernfolg: Zeitlimitierte C-Tests als Instrument zur Messung prozeduralen sprachlichen Wissens [Age of acquisition and language learning success: Time-limited C-tests as an instrument for measuring procedural linguistic knowledge]. In Vollmer

H. J.

(Ed.), Kolloquium Fremdsprachenunterricht: Vol. 27. Synergieeffekte in der Fremdsprachenforschung: Empirische Zugänge, Probleme, Ergebnisse (pp. 137–149). Peter Lang.

Alpizar

Norris

J. M.

(2022). Psychometric approaches to analyzing C-tests. Language Testing, 40, 107–132. https://doi.org/10.1177/02655322211062138

American Educational Research Association. American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Barkaoui

(2014). Quantitative approaches for analyzing longitudinal data in second language research. Annual Review of Applied Linguistics, 34, 65–101. https://doi.org/10.1017/S0267190514000105

Barkaoui

Hadidi

(2021). Assessing change in English second language writing performance. Innovations in language learning and assessment at ETS. Routledge.

Bentler

P. M.

(1995). EQS structural equations program manual. Routledge.

Bollen

K. A.

(1989). Structural equations with latent variables. Wiley series in probability and mathematical statistics. Applied probability and statistics. Wiley. https://doi.org/10.1002/9781118619179

Bos

Gröhlich

(Eds.). (2010). KESS 8: Kompetenzen und Einstellungen von Schülerinnen und Schülern am Ende der Jahrgangsstufe 8. [KESS 8: Competencies and attitudes of students and students at the end of grade 8]. Waxmann.

Bradlow

E. T.

Wainer

Wang

(1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168. https://doi.org/10.1007/bf02294533

10.

Brown

T. A.

(2015). Confirmatory factor analysis for applied research (2nd ed.). Guilford Press.

11.

Byrne

B. M.

Shavelson

R. J.

Muthén

(1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456–466. https://doi.org/10.1037/0033-2909.105.3.456

12.

Cheung

G. W.

Rensvold

R. B.

(2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 9(2), 233–255. https://doi.org/10.1207/S15328007SEM0902_5

13.

Council of Europe. (2020). Common European framework of reference for languages: Learning, teaching, assessment. https://www.coe.int/en/web/common-european-framework-reference-languages

14.

Daller

Müller

Wang-Taylor

(2021). The C-test as predictor of the academic success of international students. International Journal of Bilingual Education and Bilingualism, 24(10), 1502–1511. https://doi.org/10.1080/13670050.2020.1747975

15.

DeMars

C. E.

(2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x

16.

Eckes

(2012). Examinee-centered standard setting for large-scale assessments: The prototype group method. Psychological Test and Assessment Modeling, 54(3), 257–283.

17.

Eckes

Grotjahn

(2006). A closer look at the construct validity of C-tests. Language Testing, 23(3), 290–325. https://doi.org/10.1191/0265532206lt330oa

18.

Enders

C. K.

(2010). Applied missing data analysis. Methodology in the social sciences. Guilford Press.

19.

Gogolin

Klinger

Schnoor

Usanova

(2022). The Competence model of writing skills in the project “Multilingual development: A longitudinal perspective (MEZ).” In Brandt

Krause

Usanova

(Eds.), Language development in diverse settings: Interdisziplinäre Ergebnisse aus dem Projekt “Mehrsprachigkeitsentwicklung im Zeitverlauf” (MEZ). (pp. 35–72). Springer.

20.

Grotjahn

(2002). Konstruktion und Einsatz von C-Tests: Ein Leitfaden für die Praxis [Design and use of C-tests: A guide for practice]. In Grotjahn

(Ed.), Fremdsprachen in Lehre und Forschung (FLF): Vol. 32. Der C-Test: Theoretische Grundlagen und praktische Anwendungen (pp. 211–225). AKS-Verlag.

21.

Grotjahn

(2004). Der C-Test: Aktuelle Entwicklungen [The C-Test: Current Developments]. In Wolff

Ostermann

Chlosta

(Eds.), Materialien Deutsch als Fremdsprache: Vol. 73. Integration durch Sprache: Beiträge der 31. Jahrestagung DaF 2003 (pp. 535–550). Fachverband Deutsch als Fremdsprache.

22.

Hamdi

Kartowagiran

Haryanto

(2018). Developing a testlet model for mathematics at elementary level. International Journal of Instruction, 11(3), 375–390. https://doi.org/10.12973/iji.2018.11326a

23.

Harsch

Hartig

(2016). Comparing C-tests and Yes/No vocabulary size tests as predictors of receptive language skills. Language Testing, 33(4), 555–575. https://doi.org/10.1177/0265532215594642

24.

Hastings

A. J.

(2002). Error analysis of an English C-test: Evidence for integrated processing. In Grotjahn

(Ed.), Fremdsprachen in Lehre und Forschung (FLF): Vol. 32. Der C-Test: Theoretische Grundlagen und praktische Anwendungen (pp. 53–66). AKS-Verlag.

25.

Hattie

(2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge.

26.

Horn

J. L.

McArdle

J. J.

(1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3–4), 117–144. https://doi.org/10.1080/03610739208253916

27.

Bentler

P. M.

(1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55. https://doi.org/10.1080/10705519909540118

28.

Kline

R. B.

(2016). Principles and practice of structural equation modeling (4th ed.). Guilford Press.

29.

Klinger

Brandt

Dittmers

(2022). The making of MEZ—multilingual development: A longitudinal perspective. Study design and methods. In Brandt

Krause

Usanova

(Eds.), Language development in diverse settings: Interdisziplinäre Ergebnisse aus dem Projekt “Mehrsprachigkeitsentwicklung im Zeitverlauf” (MEZ). (pp. 1–33). Springer.

30.

Lee-Ellis

(2009). The development and validation of a Korean C-test using Rasch Analysis. Language Testing, 26(2), 245–274. https://doi.org/10.1177/0265532208101007

31.

Little

T. D.

(1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32(1), 53–76. https://doi.org/10.1207/s15327906mbr3201_3

32.

Little

T. D.

(2013). Longitudinal structural equation modeling. Guilford Press.

33.

Little

T. D.

Card

N. A.

Slegers

D. W.

Ledford

E. C.

(2007). Representing contextual effects in multi-group MACS models. In Little

T. D.

Bovaird

J. A.

Card

N. A.

(Eds.), Modeling contextual effects in longitudinal studies (pp. 121–147). Psychology Press.

34.

Llosa

(2011). Standards-based classroom assessments of English proficiency: A review of issues, current developments, and future directions for research. Language Testing, 28(3), 367–382. https://doi.org/10.1177/0265532211404188

35.

Lord

F. M.

Novick

M. R.

(1968). Statistical theories of mental test scores. Addison-Wesley.

36.

Meredith

(1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525–543. https://doi.org/10.1007/BF02294825

37.

Millsap

R. E.

(2011). Statistical approaches to measurement invariance. Routledge. https://doi.org/10.4324/9780203821961

38.

Min

(2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453–477. https://doi.org/10.1177/0265532214527277

39.

Mozgalina

Ryshina-Pankova

(2015). Meeting the challenges of curriculum construction and change: Revision and validity evaluation of a placement test. The Modern Language Journal, 99(2), 346–370. https://doi.org/10.1111/modl.12217

40.

Muthén

B. O.

Kaplan

Hollis

(1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52(3), 431–462. https://doi.org/10.1007/BF02294365

41.

Muthén

L. K.

Muthén

B. O.

(1998–2017). Mplus User’s guide (8th ed.). Muthén & Muthén.

42.

Nagle

C. L.

(2022). A design framework for longitudinal individual difference research: Conceptual, methodological, and analytical considerations. Research Methods in Applied Linguistics, 2(1), Article 100033. https://doi.org/10.1016/j.rmal.2022.100033

43.

Norris

J. M.

(2006). Development and evaluation of a curriculum-based German C-test for placement purposes. In Grotjahn

(Ed.), Language testing and evaluation: Vol. 6. Der C-Test: Theorie, Empirie, Anwendungen (pp. 45–83). Peter Lang.

44.

Norris

J. M.

(2018). Developing and investigating C-tests in eight languages: Measuring proficiency for research purposes. In Norris

J. M.

(Ed.), Language testing and evaluation: Volume 39. Developing C-tests for estimating proficiency in foreign language research (pp. 7–33). Peter Lang. https://doi.org/10.3726/b13235

45.

Pae

H. K.

O’Brien

(2018). Overlap and uniqueness: Linguistic componential traits contributing to expressive skills in English as a foreign language. Reading Psychology, 39(4), 384–412. https://doi.org/10.1080/02702711.2018.1443298

46.

Rosenbaum

P. R.

(1988). Items bundles. Psychometrika, 53(3), 349–359. https://doi.org/10.1007/BF02294217

47.

Schissel

J. L.

Leung

Chalhoub-Deville

(2019). The Construct of multilingualism in language testing. Language Assessment Quarterly, 16(4–5), 373–378. https://doi.org/10.1080/15434303.2019.1680679

48.

Schoonen

van Gelderen

Stoel

R. D.

Hulstijn

Glopper

de . (2011). Modeling the development of L1 and EFL writing proficiency of secondary school students. Language Learning, 61(1), 31–79. https://doi.org/10.1111/j.1467-9922.2010.00590.x

49.

Schrauf

R. W.

(2009). Longitudinal designs in studies of multilingualism. In de Bot

Schrauf

R. W.

(Eds.), Language development over the lifespan (pp. 245–270). Routledge.

50.

Steenkamp

J-B. E. M.

Baumgartner

(1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25(1), 78–107. https://doi.org/10.1086/209528

51.

Steiger

J. H.

(1998). A note on multiple sample extensions of the RMSEA fit index. Structural Equation Modeling: A Multidisciplinary Journal, 5(4), 411–419. https://doi.org/10.1080/10705519809540115

52.

Steiger

J. H.

Lind

J. C.

(1980). Statistically based tests for the number of common factors. Annual Meeting of the Psychometric Society, Iowa City, IA, United States.

53.

Thurstone

L. L.

(1947). Multiple-factor analysis. University of Chicago Press.

54.

Tucker

L. R.

Lewis

(1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38(1), 1–10. https://doi.org/10.1007/BF02291170

55.

van Gelderen

Schoonen

Stoel

R. D.

Glopper

de Hulstijn

. (2007). Development of adolescent reading comprehension in language 1 and language 2: A longitudinal analysis of constituent components. Journal of Educational Psychology, 99(3), 477–491. https://doi.org/10.1037/0022-0663.99.3.477

56.

Wainer

Kiely

G. L.

(1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185–201. https://doi.org/10.1111/j.1745-3984.1987.tb00274.x

57.

Wang

(2012). Structural equation modeling: Applications using Mplus. Wiley.

Measuring the development of general language skills in English as a foreign language—Longitudinal invariance of the C-test

Abstract

Keywords

Introduction

C-test

Measurement invariance (MI)

Research questions and analytical strategy

Method

Study design

Sample statistics

Statistical analyses

Results

Longitudinal MI for starting Cohort 7

Longitudinal MI for starting Cohort 9

MI between starting cohorts

Development in EFL

Discussion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Notes

References