Sage Journals: Discover world-class research

Abstract

Researchers frequently evaluate rater judgments in performance assessments for evidence of differential rater functioning (DRF), which occurs when rater severity is systematically related to construct-irrelevant student characteristics after controlling for student achievement levels. However, researchers have observed that methods for detecting DRF may be limited in sparse rating designs, where it is not possible for every rater to score every student. In these designs, there is limited information with which to detect DRF. Sparse designs can also exacerbate the impact of artificial DRF, which occurs when raters are inaccurately flagged for DRF due to statistical artifacts. In this study, a sequential method is adapted from previous research on differential item functioning (DIF) that allows researchers to detect DRF more accurately and distinguish between true and artificial DRF. Analyses of data from a rater-mediated writing assessment and a simulation study demonstrate that the sequential approach results in different conclusions about which raters exhibit DRF. Moreover, the simulation study results suggest that the sequential procedure results in improved accuracy in DRF detection across a variety of rating design conditions. Practical implications for language testing research are discussed.

Keywords

Many-facet Rasch model performance assessment rater bias rater effects rater-mediated assessment

Get full access to this article

View all access options for this article.

References

Andrich

Hagquist

(2012). Real and artificial differential item functioning. Journal of Educational and Behavioral Statistics, 37(3), 387–416. https://doi.org/10.3102/1076998611411913

Andrich

Hagquist

(2015). Real and artificial differential item functioning in polytomous items. Educational and Psychological Measurement, 75(2), 185–207. https://doi.org/10.1177/0013164414534258

Eckes

(2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. https://doi.org/10.1207/s15434311laq0203_2

Engelhard

(1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1(1), 19–33. https://doi.org/10.1111/j.1745-3984.1996.tb00479.x

Engelhard

(2008). Differential rater functioning. Rasch Measurement Transactions, 21(3), 1124.

Hagquist

(2019). Explaining differential item functioning focusing on the crucial role of external information—an example from the measurement of adolescent mental health. BMC Medical Research Methodology, 19, Article 185. https://doi.org/10.1186/s12874-019-0828-3

Hagquist

Andrich

(2017). Recent advances in analysis of differential item functioning in health research using the Rasch model. Health & Quality of Life Outcomes, 15(181), 1–8. https://doi.org/10.1186/s12955-017-0755-0

Jin

K.-Y.

Eckes

(2021). Detecting Differential rater functioning in severity and centrality: The dual DRF facets model. Educational and Psychological Measurement, 82(4), 757–781. https://doi.org/10.1177/00131644211043207.

Jin

K.-Y.

Wang

W.-C.

(2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55(4), 543–563. https://doi.org/10.1111/jedm.12191

10.

Kondo-Brown

(2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3–31. https://doi.org/10.1191/0265532202lt218oa

11.

Lin

C.-K.

(2017). Working with sparse data in rated language tests: Generalizability theory applications. Language Testing, 34(2), 271–289. https://doi.org/10.1177/0265532216638890

12.

Linacre

J. M.

(1989). Many-Facet Rasch measurement. MESA Press.

13.

Linacre

J. M.

(2020). A user’s guide to FACETS: Rasch-model computer programs (3.83.4) [Computer software]. winsteps.com. http://www.winsteps.com/manuals.htm

14.

Lumley

McNamara

T. F.

(1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71. https://doi.org/10.1177/026553229501200104

15.

Magis

Facon

(2013). Item purification does not always improve DIF detection: A counterexample with Angoff’s delta plot. Educational and Psychological Measurement, 73(2), 293–311. https://doi.org/10.1177/0013164412451903

16.

Marais

Andrich

D. A.

(2011). Diagnosing a common rater halo effect using the polytomous Rasch model. Journal of Applied Measurement, 12(3), 194–211.

17.

McNamara

(1996). Measuring second language performance. Longman.

18.

Myford

C. M.

(2012). Rater cognition research: Some possible directions for the future. Educational Measurement: Issues and Practice, 31(3), 48–49. https://doi.org/10.1111/j.1745-3992.2012.00243.x

19.

Rasch

(1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980). University of Chicago Press.

20.

Schaefer

(2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. https://doi.org/10.1177/0265532208094273

21.

Stafford

R. E.

Wolfe

E. W.

Casabianca

J. M.

Song

(2018). Detecting rater effects under rating designs with varying levels of missingness. Journal of Applied Measurement, 19(3), 243–257.

22.

Wang

Engelhard

Raczynski

Song

Wolfe

E. W.

(2017). Evaluating rater accuracy and perception for integrated writing assessments using a mixed-methods approach. Assessing Writing, 33, 36–47. https://doi.org/10.1016/j.asw.2017.03.003

23.

Wang

W.-C.

Shih

C.-L.

Yang

C.-C.

(2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69(5), 713–731. https://doi.org/10.1177/0013164409332228

24.

Wind

S. A.

(2019). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391

25.

Wind

S. A.

(2022). Rater connections and the detection of bias in performance assessment. Measurement: Interdisciplinary Research and Perspectives, 20(2), 91–106. https://doi.org/10.1080/15366367.2021.1942672

26.

Wind

S. A.

(2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108

27.

Wind

S. A.

Guo

(2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79(5), 962–987. https://doi.org/10.1177/0013164419834613

28.

Wind

S. A.

Jones

(2018). Exploring the influence of range restrictions on connectivity in sparse assessment networks: An illustration and exploration within the context of classroom observations. Journal of Educational Measurement, 55(2), 217–242. https://doi.org/10.1111/jedm.12173

29.

Wind

S. A.

Jones

(2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56(1), 76–100. https://doi.org/10.1111/jedm.12201

30.

Wind

S. A.

Sebok-Syer

S. S.

(2019). Examining differential rater functioning using a between-subgroup outfit approach. Journal of Educational Measurement, 56(2), 217–250. https://doi.org/10.1111/jedm.12198

31.

Wind

S. A.

Walker

A. A.

(2019). Exploring the correspondence between traditional score resolution methods and person fit indices in rater-mediated writing assessments. Assessing Writing, 39, 25–38. https://doi.org/10.1016/j.asw.2018.12.002

32.

Winke

Gass

Myford

(2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231–252. https://doi.org/10.1177/0265532212456968

33.

Wolfe

E. W.

McVay

(2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31–37. https://doi.org/10.1111/j.1745-3992.2012.00241.x

34.

Wolfe

E. W.

Song

(2015). Comparison of models and indices for detecting rater centrality. Journal of Applied Measurement, 16(3), 228–241.

35.

Wolfe

E. W.

Song

Jiao

(2016). Features of difficult-to-score essays. Assessing Writing, 27, 1–10. https://doi.org/10.1016/j.asw.2015.06.002

36.

Wright

B. D.

Douglas

G. A.

(1975). Best test design and self-tailored testing (Memo no. 19). MESA Psychometric Laboratory.

37.

Yüksel

Demir

Alkan

(2019). Factors causing occurrence of artificial DIF: A simulation study for dichotomous data. Communications in Statistics—Simulation and Computation, 48(7), 2004–2011. https://doi.org/10.1080/03610918.2018.1429622

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB

A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks

Abstract

Keywords

Get full access to this article

References

Supplementary Material