Sage Journals: Discover world-class research

Abstract

Score linking is widely used to place scores from different assessments, or the same assessment under different conditions, onto a common scale. A central concern is whether the linking function is invariant across subpopulations, as violations may threaten fairness. However, evaluating subpopulation differences in linked scores is challenging because linking error is not independent of sampling and measurement error when the same data are used to estimate the linking function and to compare score distributions. We show that common approaches involving neglecting linking error or treating it as independent substantially overestimate the standard errors of subpopulation differences. We introduce new methods that account for linking error dependencies. Simulation results demonstrate the accuracy of the proposed methods, and a practical example with real data illustrates how improved standard error estimation enhances power for detecting subpopulation non-invariance.

Keywords

linking equating standard error subpopulation invariance mode bridging

Get full access to this article

View all access options for this article.

References

Almaskut

LaRoche

Foy

(2023). Sample design in PIRLS 2021. In von Davier

Mullis

I. V. S.

Fishbein

Foy

(Eds.), Methods and procedures: PIRLS 2021 technical report (pp. 3.1–3.32): TIMSS & PIRLS International Study Center. https://doi.org/10.6017/lse.tpisc.tr2103.kb9560

Anselmi

Robusto

Cristante

(2023). Enhancing computerized adaptive testing with batteries of unidimensional tests. Applied Psychological Measurement, 47(3), 167–182. https://doi.org/10.1177/01466216231165301

Arel-Bundock

Enevoldsen

Yetman

(2018). countrycode: An R package to convert country names and country codes. Journal of Open Source Software, 3(28), 848. https://doi.org/10.21105/joss.00848

Arslan

Lehman

Tenison

Sparks

J. R.

López

A. A.

Zapata-Rivera

(2024). Opportunities and challenges of using generative AI to personalize educational assessment. Frontiers in Artificial Intelligence, 7, Article 1460651. https://doi.org/10.3389/frai.2024.1460651

Barnard

Rubin

D. B.

(1999). Small sample degrees of freedom with multiple imputation. Biometrika, 86, 948–955. https://doi.org/10.1093/biomet/86.4.948

Battauz

Leôncio

(2023). A likelihood approach to item response theory equating of multiple forms. Applied Psychological Measurement, 47(3), 200–220. https://doi.org/10.1177/01466216231151702

Brennan

R. L.

(2008). A discussion of population invariance. Applied Psychological Measurement, 32(1), 102–114. https://doi.org/10.1177/0146621607311582

Bulut

Beiting-Parrish

Casabianca

J. M.

Slater

S. C.

Jiao

Song

Morilova

(2024). The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. Chinese/English Journal of Educational Measurement and Evaluation, 5(3), 3. https://doi.org/10.59863/MIQL7785

Casella

Berger

R. L.

(2002). Statistical inference (2nd ed.): Duxbury.

10.

Castellano

K. E.

Johnson

M. S.

Lawless

(2024). Are large-scale test scores comparable for at-home versus test center testing? Applied Psychological Measurement, 48(6), 295–299. https://doi.org/10.1177/01466216241253795

11.

Castellano

K. E.

Sinharay

Hao

(2023). An investigation into the impact of test session disruptions for at-home test administrations. Applied Psychological Measurement, 47(1), 76–82. https://doi.org/10.1177/01466216221128011

12.

Chao

H. Y.

Chen

J. H.

(2023). Controlling the minimum item exposure rate in computerized adaptive testing: A two-stage sympson–hetter procedure. Applied Psychological Measurement, 47(7-8), 460–477. https://doi.org/10.1177/01466216231209756

13.

Deng

Rios

J. A.

(2022). Investigating the effect of differential rapid guessing on population invariance in equating. Applied Psychological Measurement, 46(7), 589–604. https://doi.org/10.1177/01466216221108991

14.

Dorans

N. J.

(2004). Using subpopulation invariance to assess test score equity. Journal of Educational Measurement, 41(1), 43–68. https://doi.org/10.1111/j.1745-3984.2004.tb01158.x

15.

Dorans

N. J.

(2018). Scores, scales, and score linking. In Irwing

Booth

Hughes

D. J.

(Eds). The wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development. (V.II, pp. 573–606): Wiley. https://doi.org/10.1002/9781118489772.ch19

16.

Dorans

N. J.

Holland

P. W.

(2000). Population invariance and the equatability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37(4), 281–306. https://doi.org/10.1111/j.1745-3984.2000.tb01088.x

17.

Dorans

N. J.

Liu

Hammond

(2008). Anchor test type and population invariance: An exploration across subpopulations and test administrations. Applied Psychological Measurement, 32(1), 81–97. https://doi.org/10.1177/0146621607311580

18.

Fishbein

Martin

M. O.

Mullis

I. V.

Foy

(2018). The TIMSS 2019 item equivalence study: Examining mode effects for computer-based assessment and implications for measuring trends. Large-scale Assessments in Education, 6(1), 1–23. https://doi.org/10.1186/s40536-018-0064-z

19.

Fishbein

Yin

Foy

(2024). PIRLS 2021 user guide for the international database (2nd ed). TIMSS & PIRLS International Study Center. https://pirls2021.org/data

20.

Foy

Almaskut

(2023). Estimating standard errors in the PIRLS 2021 results. In von Davier

Mullis

I. V. S.

Fishbein

Foy

(Eds.), Methods and procedures: PIRLS 2021 technical report (pp. 13.1–13.31): TIMSS & PIRLS International Study Center. https://doi.org/10.6017/lse.tpisc.tr2103.kb1576

21.

Haberman

S. J.

(2015). Pseudo-equivalent groups and linking. Journal of Educational and Behavioral Statistics, 40(3), 254–273. https://doi.org/10.3102/1076998615574772

22.

Haberman

S. J.

Lee

Y.-H.

Qian

(2009). Jackknifing techniques for evaluation of equating accuracy (Research Report No. RR-09-39): Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2009.tb02196.x

23.

Hao

von Davier

A. A.

Yaneva

Lottridge

von Davier

Harris

D. J.

(2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice, 43(2), 16–29. https://doi.org/10.1111/emip.12602

24.

Huggins

A. C.

(2014). The effect of differential item functioning in anchor items on population invariance of equating. Educational and Psychological Measurement, 74(4), 627–658. https://doi.org/10.1177/0013164413506222

25.

Huggins

A. C.

Elbaum

(2013). Test accommodations and equating invariance on a fifth-grade science exam. Educational Assessment, 18(1), 49–72. https://doi.org/10.1080/10627197.2013.761536

26.

IEA . (2020). Progress in international reading literacy study: Student questionnaire. Retrieved from. https://pirls2021.org/wp-content/uploads/2023/05/P21_StudentQuestionnaire.pdf

27.

Ikeda

Clark

Papageorgiou

Ohta

Blackhurst

Bruce

(2025). Aligning scores of language proficiency tests: A score concordance study between ielts academic and toefl iBT (research report No. RR-25-02): Educational Testing Service.

28.

Inal

Arıkan

Ç. A.

(2017). An investigation of group invariance in test equating according to gender. Journal of Measurement and Evaluation in Education and Psychology, 8(1), 128–145. https://doi.org/10.21031/epod.301326

29.

Jewsbury

P. A.

(2019). Error variance in common population linking bridge studies (Research Report No. RR-19-42): Educational Testing Service.

30.

Jewsbury

P. A.

(2023). Educational surveys: Methodological foundations. In Tierney

Rizvi

Ercikan

(Eds.), International encyclopedia of education: Quantitative research/educational measurement (4th ed.). Elsevier.

31.

Jewsbury

P. A.

(2024). Generally applicable variance estimation methods for common-population linking. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/10769986241263976

32.

Jewsbury

P. A.

Finnegan

R. J.

Jia

Rust

K. F.

Burg

(2020). 2017 NAEP transition to digitally-based assessment in mathematics and reading at grades 4 and 8: Mode evaluation study. Retrieved from. https://nces.ed.gov/nationsreportcard/subject/publications/main2020/pdf/transitional_whitepaper.pdf

33.

Jewsbury

P. A.

Jia

Gonzalez

(2024). Considerations for the use of plausible values in large-scale assessments. Large-scale Assessments in Education, 12(24), 1–27. https://doi.org/10.1186/s40536-024-00213-y

34.

Jewsbury

P. A.

(2025). Linking error on achievement levels accounting for dependencies and complex sampling. Journal of Educational Measurement. https://doi.org/10.1111/jedm.12439

35.

Johnson

E. G.

(1998). Linking the national assessment of educational progress (NAEP) and the third international mathematics and science study (TIMSS): A technical report (NCES 98-499). Government Printing Office.

36.

Johnson

M. S.

(2025). Responsible AI for measurement and learning: Principles and practices. ETS. (Research Report No. RR-25-03). https://www.ets.org/Media/Research/pdf/RR-25-03.pdf

37.

Jones

Tong

Liu

Borglum

Primoli

(2022). Score comparability between online proctored and in-person credentialing exams. Journal of Educational Measurement, 59(2), 180–207. https://doi.org/10.1111/jedm.12320

38.

Kawahashi

Miyazaki

Kubo

(2024). Asymptotic standard errors of equating coefficients using second-order delta method for non-parametric ability distribution [Poster presentation]. International Meeting of the Psychometric Society, Prague, Czechia.

39.

Kim

Walker

M. E.

McHale

(2010). Comparisons among designs for equating mixed-format tests in large-scale assessments. Journal of Educational Measurement, 47(1), 36–53. https://doi.org/10.1111/j.1745-3984.2009.00098.x

40.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking: Methods and practices (3rd ed.): Springer-Verlag.

41.

Leôncio

Wiberg

Battauz

(2023). Evaluating equating transformations in IRT observed-score and kernel equating methods. Applied Psychological Measurement, 47(2), 123–140. https://doi.org/10.1177/01466216221124087

42.

(2022). Assessing the impact of equating error on group means and group mean differences. Journal of Educational Measurement, 59(1), 62–79. https://doi.org/10.1111/jedm.12311

43.

Lim

Sireci

S. G.

(2017). Linking TIMSS and NAEP assessments to evaluate international trends in achievement. Education Policy Analysis Archives, 25(11), 1–21. https://doi.org/10.14507/epaa.25.2682

44.

Liu

Jurich

(2023). Outlier detection using t-test in Rasch IRT equating under NEAT design. Applied Psychological Measurement, 47(1), 34–47. https://doi.org/10.1177/01466216221124045

45.

Liu

Dorans

N. J.

(2013). Assessing a critical aspect of construct continuity when test specifications change or test forms deviate from specifications. Educational Measurement: Issues and Practice, 32(1), 15–22. https://doi.org/10.1111/emip.12001

46.

Mazzeo

Donoghue

J. R.

Liu

(2018). Estimating standard errors for NAEP that incorporate random-groups linking error for the transition from paper-based to digital-based assessments [Paper presentation]. In Annual Meeting of the National Council on Measurement in Education. New York, NY, United States.

47.

Moses

(2022). Linking and comparability across conditions of measurement: Established frameworks and proposed updates. Journal of Educational Measurement, 59(2), 231–250. https://doi.org/10.1111/jedm.12322

48.

Moses

Zhang

(2011). Standard errors of equating differences: Prior developments, extensions, and simulations. Journal of Educational and Behavioral Statistics, 36(6), 779–803. https://doi.org/10.3102/1076998610396892

49.

NCES . (2025a). 2018 NAEP transition to DBA and model evaluation for the U.S. History, geography, and civics assessments at grade 8. Retrieved from. https://www.nationsreportcard.gov/civics/supporting_files/2018_naep_social_sciences_dba_transition.docx

50.

NCES . (2025b). 2019 NAEP transition to DBA and mode evaluation for the mathematics and reading assessments at grade 12. Retrieved from. https://www.nationsreportcard.gov/mathematics/supportive_files/2019_rm_g12_dba_transition.pdf

51.

NCES . (2025c). 2019 NAEP transition to DBA and mode evaluation for the science assessments at grades 4, 8, and 12. Retrieved from. https://www.nationsreportcard.gov/science/supporting_files/NAEP_2019_Science_Transition_to_DBA_and_Mode_Evaluation_Study.docx

52.

NCES . (2025d). NAEP technical documentation: Definition of composite scales. Retrieved from. https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_determination_composite.aspx

53.

NCES . (2025e). Progress in international reading literacy study: Methodology and technical notes. Retrieved from. https://nces.ed.gov/surveys/pirls/pirls2021/technotes.asp

54.

Petersen

N. S.

(2008). A discussion of population invariance of equating. Applied Psychological Measurement, 32(1), 98–101. https://doi.org/10.1177/0146621607311581

55.

R Core Team . (2024). R: A language and environment for statistical computing: R Foundation for Statistical Computing.

56.

Reardon

S. F.

Kalogrides

A. D.

(2021). Validation methods for aggregate-level test scale linking: A case study mapping school district test score distributions to a common scale. Journal of Educational and Behavioral Statistics, 46(2), 138–167. https://doi.org/10.3102/1076998619874089

57.

Robitzsch

Lüdtke

(2019). Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assessment in Education: Principles, Policy & Practice, 26(4), 444–465. https://doi.org/10.1080/0969594x.2018.1433633

58.

Robitzsch

Lüdtke

(2024). An examination of the linking error currently used in PISA. Measurement: Interdisciplinary Research and Perspectives, 22(1), 61–77. https://doi.org/10.1080/15366367.2023.2198915

59.

Rubin

D. B.

(1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons.

60.

Sinharay

Haberman

S. J.

(2014). An empirical investigation of population invariance in the value of subscores. International Journal of Testing, 14(1), 22–48. https://doi.org/10.1080/15305058.2013.822712

61.

Sinharay

Johnson

M. S.

(2024). Computation and accuracy evaluation of comparable scores on culturally responsive assessments. Journal of Educational Measurement, 61(1), 5–46. https://doi.org/10.1111/jedm.12381

62.

Tian

Choi

(2023). The impact of item model parameter variations on person parameter estimation in computerized adaptive testing with automatically generated items. Applied Psychological Measurement, 47(4), 275–290. https://doi.org/10.1177/01466216231165313

63.

von Davier

A. A.

(2006). Population invariance of test equating and linking: Theory extension and applications across exams (Research Report No. RR-06-31). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2006.tb02037.x

64.

von Davier

A. A.

(2013). Observed-score equating: An overview. Psychometrika, 78(4), 605–623. https://doi.org/10.1007/s11336-013-9319-3

65.

von Davier

Mullis

I. V. S.

Fishbein

Foy

(Eds.), (2023). Methods and procedures: PIRLS 2021 technical report. TIMSS & PIRLS International Study Center. https://pirls2021.org/methods

66.

von Davier

Sinharay

Oranje

Beaton

(2006). The statistical procedures used in National Assessment of Educational Progress: Recent developments and future directions. In Rao

C. R.

Sinharay

(Eds.) Handbook of statistics, (pp. 1039–1055): Elsevier.

67.

Wallmark

Josefsson

Wiberg

(2023). Efficiency analysis of item response theory Kernel equating for mixed-format tests. Applied Psychological Measurement, 47(7-8), 496–512. https://doi.org/10.1177/01466216231209757

68.

Weirich

Sachse

K. A.

Henschel

Schnitzler

(2024). Comparing test-taking effort between paper-based and computer-based tests. Applied Psychological Measurement, 48(1-2), 3–17. https://doi.org/10.1177/01466216241227535

69.

Welch

B. L.

(1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29(3/4), 350–362. https://doi.org/10.1093/biomet/29.3-4.350

70.

Wolter

K. M.

(2007). Introduction to variance estimation (2nd ed.) Springer.

71.

Kim

S. Y.

Westine

Boyer

(2025). IRT observed-score equating for rater-mediated assessments using a hierarchical rater model. Journal of Educational Measurement, 62(1), 145–171. https://doi.org/10.1111/jedm.12425

72.

Wyse

A. E.

(2023). Two statistics for measuring the score comparability of computerized adaptive tests. Applied Psychological Measurement, 47(7-8), 513–525. https://doi.org/10.1177/01466216231209749

73.

Von Davier

(2010). Linking errors in trend estimation. In: Large-scale surveys: A case study (research report No. RR-10-10). Educational Testing Service.

74.

Yin

Bezirhan

Fishbein

Foy

(2023). Implementing the PIRLS 2021 achievement scaling methodology. In von Davier

Mullis

I. V. S.

Fishbein

Foy

(Eds.) Methods and procedures: PIRLS 2021 technical report (pp. 11.1–11.82): TIMSS & PIRLS International Study Center. https://doi.org/10.6017/lse.tpisc.tr2103.kb3067

75.

Yin

Fishbein

Bezirhan

Foy

von Davier

(2023). Examining country-level differences between digitalPIRLS data and bridge data. In von Davier

Mullis

I. V. S.

Fishbein

Foy

(Eds.) Methods and procedures: PIRLS 2021 technical report (pp. 12.1–12.33): TIMSS & PIRLS International Study Center. https://doi.org/10.6017/lse.tpisc.tr2103.kb9281

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

14.92 MB

0.00 MB

Standard Error Estimation for Subpopulation Non-invariance

Abstract

Keywords

Get full access to this article

References

Supplementary Material