Sage Journals: Discover world-class research

Abstract

Applicants must often demonstrate adequate English proficiency when applying to postsecondary institutions by taking an English language proficiency test, such as the TOEFL iBT, IELTS Academic, or Duolingo English Test (DET). Concordance tables aim to provide equivalent scores across multiple assessments, helping admissions officers to make fair decisions regardless of the test that an applicant took. We present our approaches to addressing practical (i.e., data collection and analysis) challenges in the context of building concordance tables between overall scores from the DET and those from the TOEFL iBT and IELTS Academic tests. We summarize a novel method for combining self-reported and official scores to meet recommended minimum sample sizes for concordance studies. We also evaluate sensitivity of estimated concordances to choices about how to (a) weight the observed data to the target population; (b) define outliers; (c) select appropriate pairs of test scores for repeat test takers; and (d) compute equating functions between pairs of scores. We find that estimated concordance functions are largely robust to different combinations of these choices in the regions of the proficiency distribution most relevant to admissions decisions. We discuss implications of our results for both test users and language testers.

Keywords

achievement tests admissions testing concordance English proficiency testing equating higher education test validity

Inter-test concordances are one of many resources that support fair decision-making across different high-stakes assessments used for similar purposes. For example, the TOEFL iBT (ETS, 2023; henceforth “TOEFL”), IELTS Academic (IELTS, 2023; henceforth “IELTS”), and the Duolingo English Test (DET, 2023; henceforth “DET”) are widely accepted by postsecondary institutions to satisfy English language proficiency (ELP) admissions requirements (Isbell & Kremmel, 2020). Concordance tables among TOEFL, IELTS, and DET help institutions make more equitable admissions decisions among international applicants who may be equally qualified but take different ELP tests for logistical or economic reasons. Given the recent emergence of new ELP tests and the introduction of at-home versions of legacy center-based tests, it is increasingly important for test score users to understand concordances and test comparability, and for test developers to have scalable processes for producing and updating concordance tables.

Because concordance tables influence high-stakes decisions, they must be robust to methodological decisions; how the data set is constructed and concordance tables computed may substantively impact the results. Prior ELP concordance studies (e.g., Clesham & Hughes, 2020; ETS, 2010) contain limited information about the methodological choices made, such as adjustments to the data, the particular form of log-linear presmoothing, or justification for the chosen equating methods. This paper presents the primary data-gathering and methodological challenges, and our approaches to addressing them, encountered when building concordance tables between DET overall scores and those of TOEFL and IELTS.

Context

While the IELTS, TOEFL, and DET differ in task design and test administration, all three tests purport to measure ELP and to be suitable for use in postsecondary (Isbell & Kremmel, 2020; LaFlair et al., 2022; Powers et al., 2017). Furthermore, previous studies have demonstrated strong total-score correlations for TOEFL–IELTS ( $r = . 73$ ; ETS, 2010), DET–TOEFL ( $r = . 77$ ; LaFlair & Settles, 2019), and DET–IELTS ( $r = . 78$ ; LaFlair & Settles, 2019). Their concurrent use for admissions purposes and the observed inter-test correlations motivate the estimation of concordances. Concordance does not require the tests be built to the same specifications, and thus does not imply that corresponding scores are wholly interchangeable (Pommerich, 2007). A concordance is therefore not sufficient evidence of test comparability, and score users are encouraged to consider additional evidence of construct similarity (Bachman et al., 1988). Many sources (e.g., Pommerich & Dorans, 2004) have discussed the suitability and limitations of using concordance to compare scores across tests built to different specifications.

The DET previously produced total-score concordances with TOEFL $(n = 2, 319)$ and IELTS $(n = 991)$ using self-reported data (LaFlair & Settles, 2019). Use of DET for admissions has since increased dramatically worldwide (see https://englishtest.duolingo.com/institutions), including increased adoption for graduate admissions, contributing to substantial changes in the composition of the test-taker population (e.g., regarding age and country of origin). In addition, in response to stakeholder needs during the COVID-19 pandemic, as well as enduring preferences of “half of the [TOEFL] test-taking population” for an at-home testing option (Stacey, 2020), Educational Testing Service (ETS) began offering an at-home TOEFL version (Papageorgiou & Manna, 2021). International English Language Testing System [IELTS] also produced a temporary at-home version during the pandemic (Clark et al., 2021) and later announced plans for a permanent at-home offering (IELTS, 2021). (This study includes data from only the center-based IELTS because the at-home version was not widely available during data collection.) These changes in test offerings, accepting institutions, and the DET test-taker population motivate a re-evaluation of concordance tables because concordances quantify relationships between tests in a specific target population (von Davier et al., 2004, p. 4) and within a particular time frame.

Methods

A concordance requires collecting data, specifying the target population, and choosing equating methods.

Data collection

Obtaining individuals’ scores on tests developed by different organizations is challenging because few people attempt multiple tests, and data from other organizations are not readily available. To obtain official DET–TOEFL and DET–IELTS paired score data, we contacted DET test takers who had taken the test since late March 2022 and invited them to submit official IELTS or TOEFL score reports (from tests taken within 90 days of the DET) in exchange for compensation. This provided sample sizes of 1,643 (IELTS) and 328 (TOEFL).

In order to meet recommended minimum sample sizes (e.g., 1500 suggested by Kolen & Brennan, 2014, p. 304), we also included self-reported IELTS/TOEFL scores in analyses. These scores are requested during the DET exit survey. Potential reporting error was investigated using a subset of test takers who submitted both an official score report and self-reported their scores (the “paired sample”). Figure 1 shows that average self-reported scores are slightly higher than average official scores for each of TOEFL and IELTS across the DET score range.

Figure 1.

Estimated average self-reporting bias for test takers with different DET scaled scores. The sample sizes for paired score data are $n = 1, 228$ (IELTS) and $n = 294$ (TOEFL).

We adjusted the self-reported scores to account for the mean reporting bias as follows. For each of IELTS and TOEFL at a given DET scale score point (10–160 in 5-point increments), we identified the test takers in the paired sample whose DET score equaled the score or one scale point above or below.¹ We computed the average reporting bias $B \geq 0$ for these test takers. We then adjusted the conditional distribution of all self-reported scores at the target DET score so that its mean was reduced by $B$ using minimum discriminant information adjustment (MDIA; Haberman, 1984), a method for minimally adjusting distributions to meet constraints. These procedures added 3,806 IELTS records and 1,276 TOEFL records to the analysis sample.

Weighting to target population

We define the target population as the DET test-taker population. However, not all DET test takers also take TOEFL or IELTS, and few of those who do submitted score reports. We thus used weighting to adjust our analysis sample to match the marginal distribution of DET scores in the target population. Specifically, we adjusted the joint distribution of DET and IELTS/TOEFL scores to weight data points more (or less) if the DET score was underrepresented (or overrepresented) for those who reported TOEFL/IELTS scores. This process ensured that statistics of the weighted distribution (e.g., marginal mean and variance) matched statistics of the target DET population.

Concordance process

Although equating and concordance differ in assumptions (e.g., Pommerich, 2007), they employ the same methods. An equating analysis entails multiple decisions, which could impact the resulting concordance. We conducted sensitivity analyses to guide selection of the final concordance. Sensitivity analysis entails repeating a data analysis multiple times while varying aspects of the data and/or method to see how these decisions impact results. The five varied factors were (1) disambiguating multiple scores for individual test takers (highest or most recent); (2) removing bivariate outliers ( $z$ -score difference of 2 or 3); (3) combining official and self-report data (all self-report, equally weighted, all official); (4) performing log-linear smoothing or population weighting first; and (5) using equipercentile or kernel equating. Fully crossing all varied factors produced 48 candidate concordances for both TOEFL and IELTS.

We evaluated results by comparing candidate concordances and examining the conditional standard errors of equating (SEE). Given that the concordance process included adjusting self-reported data, reweighting the analysis sample, and smoothing, analytical SEE formulas (e.g., von Davier et al., 2004) potentially underestimate the uncertainty in this analysis. We thus computed SEEs by bootstrapping the entire estimation procedure (see Online Supplemental for details).

Score disambiguation

An individual might take the DET multiple times and/or self-report multiple TOEFL/IELTS scores. To uniquely identify DET scores, we first kept only the first seven DET attempts (including sessions not certified due to equipment error or minor rule violations); only 1.1% of test takers attempted the DET more than seven times between January 2019 and July 2022. We then excluded all attempts prior to March 28, 2022, coinciding with the launch of a new item type on the test. We finally chose a test taker’s highest or most recent DET score, depending on the condition in the sensitivity analysis (see Table 1). These disambiguation rules arguably produce scores that are representative of the scores that institutions receive from applicants. To uniquely identify the IELTS/TOEFL self-report score, we chose a test taker’s highest or most recent score depending on condition, and then ensured that the corresponding assessment date was at most 4 months² before or after the DET assessment date.

Table 1.

Conditions of the Concordance Study Sensitivity Analysis.

Condition	Disambiguation	Outlier	Data source	Smooth before
1	Highest	2	Self-reported	Equating
2	Recent	2	Self-reported	Equating
3	Highest	3	Self-reported	Equating
4	Recent	3	Self-reported	Equating
5	Highest	2	Combined	Equating
6	Recent	2	Combined	Equating
7	Highest	3	Combined	Equating
8	Recent	3	Combined	Equating
9	Highest	2	Official	Equating
10	Recent	2	Official	Equating
11	Highest	3	Official	Equating
12	Recent	3	Official	Equating
13	Highest	2	Self-reported	Weighting
14	Recent	2	Self-reported	Weighting
15	Highest	3	Self-reported	Weighting
16	Recent	3	Self-reported	Weighting
17	Highest	2	Combined	Weighting
18	Recent	2	Combined	Weighting
19	Highest	3	Combined	Weighting
20	Recent	3	Combined	Weighting
21	Highest	2	Official	Weighting
22	Recent	2	Official	Weighting
23	Highest	3	Official	Weighting
24	Recent	3	Official	Weighting

Outlier removal

It is reasonable to expect that a small proportion of test scores reflect a large measurement error (e.g., due to illness), and therefore do not accurately represent a test taker’s ability. As shown in Figure 2, some data points appear to deviate noticeably from the bivariate relationship. We wanted to remove such score pairs that might unduly influence the final concordance, making it less accurate for the majority of test takers. We eliminated pairs of scores if the DET $z$ -score was at least 2 or 3 units away from the corresponding TOEFL or IELTS $z$ -score, depending on the sensitivity analysis condition. These conditions were chosen to balance the elimination of potentially influential data points while keeping nearly all data in the analysis.

Figure 2.

Scatterplots between DET scaled scores and official TOEFL/IELTS scores.

Data source

Even after adjusting for reporting bias, official and self-report data sources might yield different concordance outcomes. We thus estimated concordances using only official data, only self-report data, and the combined data.

Smoothing order

Log-linear pre-smoothing (von Davier et al., 2004) could be applied before or after the data-weighting step. The order of weighting and pre-smoothing could impact the results. We thus evaluated both orders (weighting first and pre-smoothing first).

Equating method

We compared two standard equating methods: equipercentile (Kolen & Brennan, 2014) and kernel equating (von Davier et al., 2004). Due to highly similar results from both methods, only kernel-based results are reported. (See the Online Supplemental for details on equipercentile and kernel equating and the rationale for choosing kernel.)

Table 1 summarizes the 24 conditions of the sensitivity analysis resulting from fully crossing all manipulated factors except equating method.

Results

Figure 3 displays concordance results across studied conditions using kernel equating. The $x$ -axis represents the condition (Table 1), and the horizontal lines represent DET scaled-score points. Notice that the largest discrepancy is between Conditions 12 and 13, which represents the change from weighting first to pre-smoothing first. For the final concordance, we chose to apply weighting before smoothing so that smoothing affects only the equating step. Reversing the order of these steps could substantially affect the final concordance, especially at the highest and lowest scaled scores.

Figure 3.

Results across all conditions, with the $x$ -axis indicating the condition number (Table 1), the $y$ -axis depicting the IELTS/TOEFL map, and each line corresponding to a particular DET scaled score.

Although there was evidence of a small bias in self-report data, combining the official and adjusted self-report data sources had minimal impact on concordance estimates, particularly at score points most relevant to postsecondary admissions decisions (e.g., IELTS 6–7.5 and TOEFL 80–110). At lower score points, especially for TOEFL given the smaller sample, the concordance results are less stable. Supplementing the official data with self-report data thus allowed for a more accurate concordance across a wider score range.

Concordance method selection

We chose a final concordance from the candidate concordances to ensure that it made logical sense given score usage. In addition to combining official and self-report data, we decided the following. For disambiguation rule, we picked recent to ensure the data were as current as possible and to minimize the time between tests. For outlier threshold, we chose a $z$ -score difference of 3 to remove influential score pairs but still keep most data in the sample. For smoothing, we smoothed the data as the penultimate step so as to impact only the equating step. Table 2 presents the final concordance between the DET, IELTS Academic, and TOEFL iBT overall scores, corresponding to Condition 8 (Table 1, Figure 3). The chosen condition produced a concordance close to the median across all sensitivity analysis conditions, providing additional justification for using Condition 8 as the final concordance.

Table 2.

Concordance Tables From Selected Method (Condition 8 in Table 1 and Figure 3).

DET scaled	IELTS Academic scaled	TOEFL iBT scaled
160	8.5–9.0	120
155	8.0	119
150	8.0	117–118
145	7.5	113–116
140	7.5	109–112
135	7.0	104–108
130	7.0	98–103
125	6.5	93–97
120	6.5	87–92
115	6.0	82–86
110	6.0	76–81
105	6.0	70–75
100	5.5	65–69
95	5.5	59–64
90	5.0	53–58
85	5.0	47–52
80	5.0	41–46
75	4.5	35–40
70	4.5	30–34
65	4.5	24–29
10–60	0–4.0	0–23

Note. DET = Duolingo English Test; IELTS = International English Language Testing System; TOEFL = Test of English as a Foreign Language.

Conclusion

Building concordances between high-stakes language tests is intrinsically difficult. Obtaining official data for the same individuals is challenging without institutional collaboration, and the pool of individuals who have taken both tests is often modest. Our data set of official scores was not large, particularly for TOEFL, but not atypical: the most recent TOEFL–IELTS concordance is based on a sample of 1,153 test takers (ETS, 2010). (Conversely, a concordance between ACT and SAT was based on $\sim$ 600,000 test takers, made possible by collaboration between the test developers; The College Board & ACT, 2018.) These limitations commonly faced in the language testing context increase the likelihood that the methodological choices required to produce concordance tables will impact the results.

Our results provide two reasons for optimism. First, despite the intrinsic limitations of self-reported scores, they can usefully contribute to concordance studies, particularly if paired self-reported–official data are available. Our paired sample indicated that test takers tend to over-report their scores. But the magnitude of reporting bias was modest, and a standard adjustment can mitigate the impact of this bias. We found little sensitivity to whether the concordance tables were constructed using adjusted self-reported data alone, official data alone, or combined data. While this finding may not generalize, it warrants evaluation in other contexts because it mitigates the difficulty of collecting data for concordance studies. This methodological innovation is useful for language test developers, given that trends of institutions accepting multiple tests, at-home and center-based versions of tests, and frequent changes to test structure and content imply the need for more frequent concordance studies.

Another reason for optimism is the general robustness to different methodological choices, particularly in the scaled-score regions most relevant to institutional decision-making. This is somewhat surprising given the modest sample sizes, particularly DET–TOEFL, but is welcome given the role of concordance tables in supporting fair evaluations of applicants. This finding is useful for test score users and admissions policymakers, who must evaluate and interpret multiple tests used for the same purpose. However, it cannot be assumed that all concordance studies would be equally robust to methodological decisions. Thus, a clear accounting of decisions regarding data acquisition and analysis, and a demonstration of robustness to such decisions, should be best practice for concordance studies. Such methodological transparency would allow stakeholders to understand the origin and limitations of concordances.

A limitation of this study is that it considers only overall scores. The DET, TOEFL, and IELTS all report an overall score and four subscores. The DET reports integrated subscores (i.e., each subscore reflects a combination of two of speaking, writing, reading, and listening; Cardwell et al., 2023). There is thus no one-to-one correspondence between DET subscores and those of TOEFL/IELTS, and so the methods used here do not apply. As stakeholders use subscores in decision-making, future research should employ alternative methods to compare the tests’ subscores.

While there is no guarantee that our findings will generalize to other concordance analyses between language tests, we expect that the challenges in data acquisition and synthesis and methodological choices will exist in other applications. Thus, our inventory of practical issues and approaches to addressing them may provide useful guidance for conducting sensitivity analyses in concordance studies.

Supplemental Material

sj-pdf-1-ltj-10.1177_02655322231195027 – Supplemental material for Practical considerations when building concordances between English tests

Supplemental material, sj-pdf-1-ltj-10.1177_02655322231195027 for Practical considerations when building concordances between English tests by Ramsey L. Cardwell, Steven W. Nydick, J.R. Lockwood and Alina A. von Davier in Language Testing

Footnotes

Acknowledgements

We are grateful to Rogelio Alvarez, Kevin Hao, Shawn Jones, and Anthony Verardi for their indispensable roles in implementing the study. We also thank Paula Winke, Dylan Burton, Ruslan Suvorov, and the Language Testing reviewers for their valuable comments on previous versions and editorial guidance.

Author contributions

R.L.C.: Conceptualization; Data curation; Methodology; Project administration; Writing—original draft; Writing—review & editing.

S.W.N.: Conceptualization; Data curation; Formal analysis; Methodology; Validation; Visualization; Writing—original draft; Writing—review & editing.

J.R.L.: Conceptualization; Formal analysis; Methodology; Supervision; Validation; Writing—original draft; Writing—review & editing.

A.A.v.D.: Methodology; Resources; Supervision; Writing—review & editing.

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: All authors are employees of Duolingo, the developer of the Duolingo English Test.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Ramsey L. Cardwell

Steven W. Nydick

J.R. Lockwood

Alina A. von Davier

Supplemental material

Supplemental material for this article is available online. The video abstract for this article is available at

Notes

References

Bachman

L. F.

Kunnan Swathi Vanniaraian

Lynch

(1988). Task and ability analysis as a basis for examining content and construct comparability in two EFL proficiency test batteries. Language Testing, 5(2), 128–159. https://doi.org/10.1177/026553228800500203

Cardwell

Naismith

LaFlair

G. T.

Nydick

(2023). Duolingo English Test: Technical manual (Duolingo Research Report). https://duolingo-papers.s3.amazonaws.com/other/technical_manual.pdf

Clark

Spiby

Tasviri

(2021). Crisis, collaboration, recovery: IELTS and COVID-19. Language Assessment Quarterly, 18(1), 17–25. https://doi.org/10.1080/15434303.2020.1866575

Clesham

Hughes

S. R.

(2020). 2020 Concordance report: PTE Academic and IELTS Academic. Pearson. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/pearson-languages/en-gb/pdfs/gse-resources/gse-research-reports/2020-concordance-report-pte-academic-and-ielts-academic.pdf

The College Board & ACT. (2018). Guide to the 2018 ACT/SAT concordance. https://satsuite.collegeboard.org/media/pdf/guide-2018-act-sat-concordance.pdf

DET. (2023). Duolingo English Test: Certify your English proficiency today. https://englishtest.duolingo.com/applicants

ETS. (2010). Linking TOEFL iBT scores to IELTS scores – a research report. https://www.ets.org/pdfs/toefl/linking-toefl-ibt-scores-to-ielts-scores.pdf

ETS. (2023). TOEFL iBT� Test: The premier test of academic English communication. https://www.ets.org/toefl/test-takers/ibt/about.html

Haberman

S. J.

(1984). Adjustment by minimum discriminant information. The Annals of Statistics, 12(3), 971–988. https://doi.org/10.1214/aos/1176346715

10.

IELTS. (2021, October 5). IELTS announces at-home testing option. https://www.ielts.org/en-us/news/2021/ielts-new-at-home-testing-option

11.

IELTS. (2023). What can IELTS do for you? Discover a world of opportunity with IELTS. https://ielts.org/take-a-test/why-choose-ielts/what-can-ielts-do-for-you

12.

Isbell

D. R.

Kremmel

(2020). Test review: Current options in at-home language proficiency tests for making high-stakes decisions. Language Testing, 37(4), 600–619. https://doi.org/10.1177/0265532220943483

13.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking. Springer.

14.

LaFlair

G. T.

Langenfeld

Baig

Horie

A. K.

Attali

von Davier

A. A.

(2022). Digital-first assessments: A security framework. Journal of Computer Assisted Learning, 38, 1077–1086. https://doi.org/10.1111/jcal.12665

15.

LaFlair

G. T.

Settles

(2019). Duolingo English Test: Technical manual (Duolingo Research Report). https://s3.amazonaws.com/duolingo-papers/other/Duolingo%20English%20Test%20-%20Technical%20Manual%202019.pdf

16.

Papageorgiou

Manna

V. F.

(2021). Maintaining access to a large-scale test of academic language proficiency during the pandemic: The launch of TOEFL iBT Home Edition. Language Assessment Quarterly, 18(1), 36–619. https://doi.org/10.1080/15434303.2020.1864376

17.

Pommerich

(2007). Concordance: The good, the bad, and the ugly. In Dorans

N. J.

Pommerich

Holland

P. W.

(Eds.), Linking and aligning scores and scales (pp. 200–216). Springer.

18.

Pommerich

Dorans

N. J.

(Eds.) (2004). Concordance [Special issue]. Applied Psychological Measurement, 28(4), 216–289. https://doi.org/10.1177/0146621604265028

19.

Powers

Schedl

Papageorgiou

(2017). Facilitating the interpretation of English language proficiency scores: Combining scale anchoring and test score mapping methodologies. Language Testing, 34(2), 175–194. https://doi.org/10.1177/0265532215623582

20.

Stacey

(2020, December 10). ETS adds TOEFL Home Edition to product line. The PIE News. https://thepienews.com/news/testing/ets-adds-toefl-home-edition-to-product-line/

21.

von Davier

A. A.

Holland

P. W.

Thayer

D. T

. (2004). The Kernel method of test equating. Springer.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.25 MB