Sage Journals: Discover world-class research

Abstract

While several test concordance tables have been published, the research underpinning such tables has rarely been examined in detail. This study aimed to survey the publically available studies or documentation underpinning the test concordance tables of the providers of four major international language tests, all accepted by the Australian Department of Home Affairs for Australian visa purposes. To evaluate the concordance studies, we first identified the good practice principles in concordance research through a review of both the relevant literature and leading professional standards in the field of educational measurement and language assessment. Next, we reviewed the concordance studies against the identified good practice principles. Our findings revealed that the information supplied by test providers varied, with some making the full research papers available, whereas others providing little information about their underpinning research. None of the concordance studies fulfilled all the good practice principles. Based on the findings of this study, we offer recommendations for future concordance research in the field of language testing as well as suggestions for practice.

Keywords

Concordance language testing policy test score comparison test users

Background

In language assessment, “linking” broadly refers to the practice of relating the scores or levels on two tests or aligning them to a language proficiency framework such as the Common European Framework of Reference (CEFR). When it comes to the former definition, linking can be categorized into three types: equating, concordance, and prediction. Equating, which has been the most discussed of the three types, involves relating scores on parallel or alternate forms of a test. Concordance, which is the focus of this paper, applies when linking scores from two different tests that measure related but different constructs (Kolen, 2004). Prediction, the third type of linking, involves relating the scores on two tests regardless of whether they measure related or different constructs, typically using regression techniques.

Language test providers are generally expected to link the scores or levels on their tests to those on other language tests because test users such as admissions officers or employers commonly accept results from multiple language tests as proof of an applicant’s language proficiency. Additionally, test takers may choose or be required to take different language tests due to various personal and situational reasons (Taylor, 2004) and may need to understand how the scores across tests are related. In consequence, there is a practical need to compare scores on different language tests.

While several concordance tables have been published to facilitate test score comparability (e.g., Clesham & Hughes, 2020; Educational Testing Service, 2010), the research underpinning these tables has rarely been examined by independent, qualified language testing experts and/or psychometricians. This, in consequence, has raised concerns over the rigor and usefulness of these concordance tables and, by extension, whether test users and policy-makers can rely on these concordance results for informed decision-making or policy formulation in areas such as admission into higher education, professional registration, or immigration. As noted by Cardwell et al. (2024), creating a concordance table between two language tests is a complicated process as different data sets and methodological choices might lead to substantially different concordance results. Therefore, it is essential to evaluate various aspects of the concordance process to ensure that the concordance results are justifiable, stable, and generalizable. Furthermore, it is also important to explore the information published by test providers for score users (e.g., test takers, admissions officers, and policy-makers), who should be provided not only with the concordance tables but also with (a) information about the methodology underpinning the concordance studies, (b) instructions on how to interpret the concordance results, and (c) any caution they need to exercise when using these results.

As a case study for exploring the quality of the concordance studies underpinning concordance tables as well as the information published for test users on test provider websites, we selected four tests (described further in the “Methodology” section) accepted for Australian visa purposes by the Australian Department of Home Affairs (n.d.). This scenario was selected to exemplify a policy context where several language tests are used for the same purpose. Our study followed three steps. First, we identified a set of good practice principles in conducting a concordance study through a literature review. Next, we applied these principles to evaluate the concordance practices of the test providers based on the studies and/or documentation on their websites. Finally, we offered recommendations for future concordance research in the field of language testing as well as suggestions for practice.

Good practice in conducting concordance studies

To identify good practice in concordance studies, we reviewed relevant research literature in the field of educational evaluation and language assessment (e.g., Dorans, 2004; Elliot et al., 2021; Pommerich, 2007) and consulted relevant guidelines in leading professional standards in the two fields (e.g., American Educational Research Association [AERA] et al., 2014; International Language Testing Association [ILTA], 2020). The good practice principles that we identified through the literature review span three stages of a concordance study: (a) preliminary investigation, (b) study methodology, and (c) publication and use of concordance results. Table 1 lists the good practice principles at each stage of a concordance study.

Table 1.

Good practice principles in conducting a concordance study.

Stage	Good practice principle	Example reference
Preliminary investigation	• Verify the two tests measure closely related constructs • Establish strong correlations between the two tests • Ensure that test administration conditions are similar • Maintain similar levels of reliability for the overall test and subsections	Dorans (2004); Elliot et al. (2021)
Study methodology	• Ensure that the participant sample mirrors the test population of interest	Dorans & Holland (2000); Kolen & Brennan (2014); Pommerich et al. (2004)
	• Collect data on the participants’ reasons for taking the tests
	• Ensure that the participants have comparable levels of test preparation and familiarity with the tests
	• Ensure that the data are based on official test score reports
	• Ensure that the sample size is adequate
	• Implement a counter-balanced design
	• Ensure that the interval between test takers attempting both tests is sufficiently short
	• Conduct a population invariance study
Publication and use of results	• Make the study publicly available	AERA et al. (2014); Elliot et al. (2021); Pommerich (2007)
	• Include descriptive statistics and correlation coefficients in the report
	• Describe statistical methods and procedures in sufficient detail
	• Report concordance results for both the overall test and subsections
	• Report the sample size and standard error at each score level
	• Alert test users to exercise caution in interpreting and using concordance results
	• Provide test user-focused guidelines and recommendations

As Table 1 indicates, before initiating a concordance study, a systematic comparison of the two tests in terms of test content, method, and constructs is necessary to ascertain whether concordance is the appropriate linking method. Concordance is deemed appropriate only when the tests measure related constructs, their content is judged similar, and a strong correlation exists between their test scores. Additionally, an evaluation of the administration conditions and reliability of the tests is required. Both tests should be properly administered and exhibit similar reliability indices at both the overall test and subsection level (e.g., listening and reading) to be suitable for a concordance study.

During a concordance study, it is crucial to ensure that the participant sample mirrors the test population of interest, thus making it possible to generalize the concordance results. Using a truncated sample, for example, by focusing only on test takers within a particular score range may undermine the claims made using concordance results. It is also necessary for researchers to collect data on the participants’ reasons for taking the tests, as well as their level of test preparation and familiarity, as these factors can significantly influence their test results. Ensuring that the data are based on official test score reports rather than self-reported data is also important. The size of the sample needs to be sufficient to create robust and stable score equivalences at different score levels. If achieving a sufficient participant number is challenging, a cumulative approach can be considered where an initial equivalence is determined and subject to continuous monitoring through ongoing data collection. A counter-balanced design is required to mitigate the potential order effect on test scores, and the interval between test takers attempting both tests needs to be sufficiently short (e.g., less than 3 months). Once the concordance table has been established, a population invariance study needs to be implemented to ascertain that the concordance results are invariant across different subpopulations defined by key attributes such as gender, proficiency level, and ethnic background relevant to the testing contexts (Pommerich, 2007).

When publishing and using concordance results, test agencies should ensure that the concordance report is easily available to the public. Descriptive statistics of test scores for both the overall test and subsections, including mean and standard deviation, need to be included in the concordance report. Where possible, these statistics should be compared with the population of interest. The report should also include the correlation coefficients between the two tests. In addition, the statistical methods and procedures (e.g., equipercentile equating, with pre- or post-smoothing) employed for creating the concordance table need to be described in sufficient detail to allow for replication. The concordance results should be presented for the overall test and each subsection. It is essential to detail the number of observations and standard error at each score point or level, as a limited number of observations can lead to unstable concordance results (AERA et al., 2014; Pommerich, 2007). Testing agencies also have the responsibility to alert test users to exercise caution when interpreting the results at the score points or levels with a small number of observations. Finally, to assist test users in interpreting the concordance results accurately and using them responsibly, it is incumbent on testing agencies to publish the studies underpinning the concordance tables along with clear, test user-focused guidelines and recommendations.

Our study aimed to explore the following research questions:

What information do the four test providers publish on their websites for test users interested in score comparisons? Are the concordance studies underpinning the concordance tables mentioned, and if so, are they publicly available?

How well do the concordance studies align with the good practice principles suggested by the relevant literature?

Methodology

Test selection

Four large-scale English language tests were included in the study: Cambridge C1 Advanced (C1), International English Language Testing System (IELTS), Pearson Test of English Academic (PTE-A), and Test of English as a Foreign Language Internet-Based Test (TOEFL iBT). The tests were chosen because they are all accepted by the Australian Department of Home Affairs for Australian visa purposes, including shorter-term visas (e.g., for study purposes) and permanent visas (i.e., migration to Australia; see Australian Department of Home Affairs, n.d.). Given that these tests are generally well known globally and the limited space in this brief report, we do not provide detailed information about the design of each test.

Procedures

We first canvassed the websites of the four test providers to ascertain the information they provide regarding concordance tables, including (a) which tests have available concordance tables with other tests and (b) whether a concordance study or other information about the concordance table (e.g., user-friendly advice on how to use the table and cautions around use) is available to test users. We then carefully read the concordance studies provided or mentioned on the websites and coded the relevant sections based on the good practice principles (see Table 1). Our coding process involved identifying whether each good practice principle was fulfilled or not (indicated by a tick or cross, see Appendix 1). Both authors independently coded the research studies, with an inter-coder reliability above .90.

Results

To investigate research question (RQ) one, we summarized the information (see Table 2) that the four test providers include on their respective websites. The first column denotes the test provider websites that we examined (referred to as focal test), followed by the names of the tests to which concordance tables are presented on the focal test website (second column). The third column indicates the level of detail that is provided for score comparisons (i.e., whether only overall scores are compared or whether subsection comparisons are available). The final column provides information about what underpinning research supporting the concordance tables is provided on the websites. As Table 2 indicates, the information provided to test users about test concordance differs. While some provide comparisons between overall scores only, others include comparisons between overall scores and subsection scores. Three of the test providers provide concordance tables to two other tests, while C1 only provides score comparisons with IELTS. Some providers publish the full reports detailing the research underpinning concordance tables, while others either provide no information or the information is fairly vague.

Table 2.

Summary of review of test provider websites.

Focal test	Test scores compared with focal test	Overall and/or sub-section score comparison	Availability of underpinning research
IELTS^a	PTE-A	Overall + subsections	Full report available for download
IELTS^a	C1	Overall	A study linking IELTS to C1 is referred to in the link below, but no reference is provided and the study is not available for free download^d
PTE-A^b	IELTS	Overall	Full report available for download
PTE-A^b	TOEFL iBT	Overall	No information available about underpinning research
TOEFL iBT^c	IELTS	Overall + subsections	Full report available for download
TOEFL iBT^c	TOEFL Essentials	Overall + subsections	No information available about underpinning research
C1	IELTS	Overall	A study linking IELTS to C1 is referred to in the link, but no reference is provided and the study is not available for download^d

Note: IELTS, International English Language Testing System; PTE-A, Pearson Test of English Academic; C1, Cambridge C1 Advanced; TOEFL iBT, Test of English as a Foreign Language internet-Based Test.

https://www.ielts.org/for-organizations/comparing-ielts-to-other-tests.

https://www.pearsonpte.com/research/scoring.

https://www.ets.org/toefl/score-users/ibt/compare-scores.html.

https://www.cambridgeenglish.org/Images/461626-cambridge-english-qualifications-comparing-scores-to-ielts.pdf.

To investigate RQ2, we carefully reviewed the research reports available on the test provider websites (see Table 3) to determine whether these met the good practice principles for concordance research. We included all freely available concordance studies as well as the study linking IELTS to C1, despite this only being available in a peer-reviewed journal behind a paywall.

Table 3.

Concordance studies reviewed.

Test provider	Comparison	Type of study
• IELTS (Elliot et al., 2021)	IELTS to PTE-A	Concordance study (Study A^b)
• PTE-A (Clesham & Hughes, 2020)	PTE-A to IELTS	Concordance study (Study B)
• TOEFL iBT (Educational Testing Service, 2010)	TOEFL iBT to IELTS	Concordance study (Study C)
• Cambridge (Lim et al., 2013)^a	IELTS to C1	Concordance study conducted as external validation for the CEFR linking study (Study D)

Not freely available.

We used the labels in brackets (e.g., Study A) in our summary of results below due to space limitations.

The full evaluation of the studies against the good practice principles is detailed in Appendix 1, which is broken down into tables for each of the three stages of a concordance study: preliminary investigation, study methodology, and publication and use of results (see Table 1). The results in the tables in Appendix 1 clearly show that, based on the information available in the reports, few of the good practice principles were fulfilled by the four studies under investigation.

In terms of the preliminary investigation, only Study A reported a comparison of constructs (published separately). The correlations for most test pairs were low, particularly for some of the subsections, arguably too low to even proceed with the concordance study. Although correlations were reported as part of the main study, none of the reviewed studies considered them as a preliminary investigation to ascertain whether concordance was the appropriate linking method. Furthermore, no test providers compared the test reliability statistics before proceeding to concordance, except for Study B, which noted that both tests seemed sufficiently reliable.

The study methodologies we examined also varied greatly and most good practice principles were not or only partially fulfilled. All studies, apart from Study C, had relatively small sample sizes and, where mentioned, these samples rarely fully represented the test taker population of interest and typically lacked a sufficient number of low-scoring students for meaningful comparisons. Only Study B claimed to have captured a representative sample of the overall test taker population, although its sample size fell short of the good practice principle. None of the studies reported collecting data on the participants’ reasons for taking the tests. They either did not mention whether the scores used in the analysis were drawn from official score reports (Studies A, C, and D) or noted that only some scores were verified (Study B). The interval between the participants attempting the two tests was mostly within 3 months; however, two studies (Studies C and D) did not mention this. Additionally, only Study D reported full counter balancing of the order of testing. Study B controlled this for half the sample, while the other two studies (Studies A and C) did not address this aspect. Three out of the four reports did not mention participants’ test preparation or familiarity with the tests (Studies A, C, and D), while Study B included this information for half of its sample.

The reporting of descriptive statistics and correlations across both overall and subsection scores varied across studies. For example, Studies A and D provided no descriptive statistics but correlations for both overall and subsection scores, while Study B reported descriptive statistics for overall scores but not for subsection scores. Study C was the only one that reported descriptive statistics and correlations for both overall and subsection scores. While all studies reported the statistical method used for concordance (i.e., equipercentile equating), the details of the concordance procedures were not fully transparent. For example, it was unclear whether Study C implemented any pre- or post-smoothing procedures in their concordance process. None of the studies included a population invariance study, possibly due to the small sample sizes in their main studies.

The quality of reporting and use of results of the concordance results also varied. Study B reported the concordance results for overall scores only, while Studies A and C also reported the results for subsection scores. Study B included the number of observations at score levels without reporting standard errors. Conversely, Study A reported standard errors without mentioning the number of observations. Studies C and D did not provide either. Among the reviewed studies, only Study C advised caution around the use of concordance results in their research reports and on their websites, highlighting that the concordance results at the outer levels (e.g., at IELTS Levels 5 and 8) should be used with caution. This was also visually indicated by the shaded sections in the concordance table. Notably, none of the four studies presented their findings in accessible ways for non-specialist test users, such as policy-makers.

Discussion and conclusion

In this study, we evaluated the current concordance practices of four major providers of English tests against the good practice principles in concordance research. Our findings indicate that the information provided on the test provider websites about concordance tables is often vague or insufficient. Test users are not always provided with the research underpinning these concordance tables. When such research is provided, it tends not to fulfill the good practice principles and is usually presented in formats not easily accessible to non-specialist test users.

Our review of the concordance studies suggests that preliminary investigations are often insufficient, and the methodologies for data collection tend to fail to adhere to the good practice criteria. For example, the sample sizes are generally too small to provide robust score comparisons. Basic information is often not provided, such as concordance results for subsection scores (which are crucial for the requirements for Australian migration and other policy-makers), the number of observations at different score levels, and their standard errors. Test users are not usually informed about the potential limitations of using published concordance tables. The findings are concerning, as these results may be used to inform high-stakes decisions that significantly impact test takers.

One possible reason for the lack of information and rigor in this area is that these studies and the creation of the concordance tables are essentially driven by the test providers themselves rather than by an independent body. It is therefore likely that test providers draw on existing data from their test taker databases rather than collecting new data specifically for this purpose, as this is costly and time-consuming. Hence, many analyses are based on convenience samples. At the moment, there is little motivation to invest in more robust concordance studies due to the absence of regulatory oversight and minimal demand for high-quality work from test users. It is also important to note that concordance tables are one site in which competition between test providers manifests, who may have a commercial interest in lowering their test scores to make it easier for applicants to achieve certain test score requirements. Test score users also encounter the challenge of having to reconcile different concordance tables, a situation exemplified by the recent IELTS—PTE-A comparison studies (i.e., Studies A and B in our study). The resulting concordance tables showed significant discrepancies at certain score levels, likely causing confusion for test users.

Based on our findings, we make the following recommendations for future concordance research in language testing. First, testing agencies should make the complete results of their concordance studies openly available. This includes the technical details required for a comprehensive evaluation by experts (e.g., language testers and psychometricians), and an accessible summary for test users (e.g., policymakers). Such information should be easily available on the test provider websites, eliminating the need for users to navigate through multiple additional links or to purchase materials behind paywalls.

When planning a concordance study, preliminary investigations should determine whether concordance is the appropriate linking method. Test providers should alert test users to possible cautions around the use of concordance results. The concordance study methodologies should adhere to the good practice principles set out in this paper. It is important for the test provider to include concordance tables for both overall and subsection scores in the concordance report, and provide clear guidelines to test users around the level of confidence in the concordance results at different score levels. Finally, it is important for language testing researchers to consider how to develop resources and activities to enhance policy-makers’ understanding and ability to make better-informed decisions when using test concordance results.

Footnotes

Appendix 1 Author contributions

Ute Knoch: Conceptualization; Data curation; Formal analysis; Methodology; Writing – original draft; Writing – review & editing.

Jason Fan: Investigation; Methodology; Writing – original draft; Writing – review & editing.

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: During the last five years, the first author, Ute Knoch, conducted assessment-related research or consultancy work for the following organisations: Educational Testing Service (ETS), IELTS, Pearson, Cambridge Boxhill Language Assessments, Australian Department of Defense, Australian Civil Aviation Safety Authority, Australian Health Practitioner Regulation Authority, Benesse Corporation, Australian Department of Home Affairs. She served, until 2021, on the Pearson Technical Advisory Board and is the current test review editor of Language Testing. The second author, Jason Fan, conducted assessment-related research, advisory, or consultancy work for the following organisations: British Council, Pearson Education, PeopleCert, Cambridge Boxhill Language Assessment, Educational Testing Service, and Language Training and Testing Centre (LTTC).

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Ute Knoch

Jason Fan

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing.

Australian Department of Home Affairs. (n.d.). English language visa requirements. https://immi.homeaffairs.gov.au/help-support/meeting-our-requirements/english-language

Cardwell

R. L.

Nydick

S. W.

Lockwood

von Davier

A. A.

(2024). Practical considerations when building concordances between English tests. Language Testing, 41(1), 192–202. https://doi.org/10.1177/02655322231195027

Clesham

Hughes

S. R.

(2020). 2020 concordance report PTE Academic and IELTS Academic. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/english/SupportingDocs/concordance-report.pdf

Dorans

N. J.

(2004). Equating, concordance, and expectation. Applied Psychological Measurement, 28(4), 227–246. https://doi.org/10.1177/0146621604265031

Dorans

N. J.

Holland

P. W.

(2000). Population invariance and the equatability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37(4), 281–306. https://doi.org/10.1111/j.1745-3984.2000.tb01088.x

Educational Testing Service. (2010). Linking TOEFL iBT ® scores to IELTS® scores: A research report. https://www.ets.org/s/toefl/pdf/linking_toefl_ibt_scores_to_ielts_scores.pdf

Elliot

Blackhurst

O’Sullivan

Clark

Dunlea

Saville

(2021). Aligning IELTS and PTE-Academic: A measurement study. In Saville

O’Sullivan

Clark

(Eds.), IELTS partnership research papers: Studies in test comparability series (No. 2, pp. 42–64). IELTS Partners: British Council, Cambridge Assessment English and IDP: IELTS Australia.

International Language Testing Association. (2020). International Language Testing Association guidelines for practice. https://www.iltaonline.com/page/ILTAGuidelinesforPractice

10.

Kolen

M. J.

(2004). Population invariance in equating and linking: Concept and history. Journal of Educational Measurement, 41(1), 3–14. https://doi.org/10.1111/j.1745-3984.2004.tb01155.x

11.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer. https://doi.org/10.1007/978-1-4939-0317-7

12.

Lim

Geranpayeh

Khalifa

Buckendahl

C. W.

(2013). Standard setting to an international framework: Implications for theory and practice. International Journal of Testing, 13(1), 32–49. https://doi.org/10.1080/15305058.2012.678526

13.

Pommerich

(2007). Concordance: The good, the bad, and the ugly. In Dorans

N. J.

Pommerich

Holland

P. W.

(Eds.), Linking and aligning scores and scales (pp. 200–216). Springer. https://doi.org/10.1007/978-0-387-49771-6_11

14.

Pommerich

Hanson

B. A.

Harris

D. J.

Sconing

J. A.

(2004). Issues in conducting linkages between distinct tests. Applied Psychological Measurement, 28(4), 247–273. https://doi.org/10.1177/0146621604265033

15.

Taylor

(2004). Issues of test comparability. Research Notes, 15(2), 2–5. https://www.cambridgeenglish.org/images/23131-research-notes-15.pdf

Test score comparison tables: How well are they serving test users?

Abstract

Keywords

Background

Good practice in conducting concordance studies

Methodology

Test selection

Procedures

Results

Discussion and conclusion

Footnotes

Appendix 1

Author contributions

Declaration of conflicting interests

Funding

ORCID iDs

References