Sage Journals: Discover world-class research

Abstract

Halo effects refer to a persistent rater error posing a significant threat to the validity and fairness of assessments involving human raters. This research introduces a novel approach to analyzing halo effects by considering rating times recorded automatically in technology-based, online, or onscreen assessments as an additional data source. For this purpose, we propose the mixture Rasch facets model for halo with rating time. Utilizing Bayesian parameter estimation methods, we applied the model to a real dataset from a large-scale Chinese writing assessment. We found that rating time predicted illusory halo, such that longer rating times were associated with a greater likelihood of observing halo effects. Compared to traditional models, considering rating time as an additional variable resulted in better data–model fit and preserved the integrity of the latent scale, maintaining the informative value each rating criterion conveyed regarding the examinees’ performances. A simulation study confirmed the model’s superior parameter recovery under different levels of impact rating time had on the likelihood of halo error. Findings suggest that integrating rating time into measurement models can enhance the detection of illusory halo effects, enrich research on the psychological mechanisms underlying these effects, and improve the overall psychometric quality of assessments. The discussion focuses on implications for model development and future research into rater effects.

Keywords

Halo effects rating time Rasch measurement latent classes Bayesian modeling

Get full access to this article

View all access options for this article.

References

Andrich

Marais

(2019). A course in Rasch measurement theory: Measuring in the educational, social and health sciences. Springer. https://doi.org/10.1007/978-981-13-7496-8

Bacha

(2001). Writing evaluation: What can analytic versus holistic essay scoring tell us? System, 29(3), 371–383. https://doi.org/10.1016/S0346-251X(01)00025-2

Bartlett

C. J.

(1983). What’s the difference between valid and invalid halo? Forced-choice measurement without forcing a choice. Journal of Applied Psychology, 68(2), 218–226. https://doi.org/10.1037/0021-9010.68.2.218

Bolsinova

Tijmstra

Molenaar

De Boeck

(2017). Conditional dependence between response time and accuracy: An overview of its possible sources and directions for distinguishing between them. Frontiers in Psychology, 8, 202. https://doi.org/10.3389/fpsyg.2017.00202

Bond

T. G.

Yan

Heene

(2021). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge. https://doi.org/10.4324/9780429030499

Coniam

Falvey

(Eds.). (2016). Validating technological innovation: The introduction and implementation of onscreen marking in Hong Kong. Springer. https://doi.org/10.1007/978-981-10-0434-6

Cooper

W. H.

(1981). Ubiquitous halo. Psychological Bulletin, 90(2), 218–244. https://doi.org/10.1037/0033-2909.90.2.218

De Boeck

Jeon

(2019). An overview of models for response times and processes in cognitive tests. Frontiers in Psychology, 10, 102. https://doi.org/10.3389/fpsyg.2019.00102

Eckes

(2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. https://doi.org/10.1207/s15434311laq0203_2

10.

Eckes

(2023a). Detecting and measuring rater effects in performance assessments: Advances in many-facet Rasch modeling. In Dobrić

Cesnik

Harsch

(Eds.), Festschrift in honour of Günther Sigott: Advanced methods in language testing (pp. 195–223). Peter Lang. https://doi.org/10.3726/b21019

11.

Eckes

(2023b). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang. https://doi.org/10.3726/b20875

12.

Eckes

Jin

K.-Y.

(2022). Detecting illusory halo effects in rater-mediated assessment: A mixture Rasch facets modeling approach. Psychological Test and Assessment Modeling, 64(1), 87–111. https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam_2022-1/PTAM__1-2022_5_kor.pdf

13.

Eckes

Jin

K.-Y.

(2024). Examining illusory halo effects across successive writing assessments: An issue of stability and change. Journal of Applied Measurement, 25(1/2), 75–95. https://jamntnu.net/

14.

Engelhard

Wang

(2025). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences (2nd ed.). Routledge. https://doi.org/10.4324/9781003458746

15.

Engelhard

Wind

(2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge. https://doi.org/10.4324/9781315766829

16.

Fisicaro

S. A.

Lance

C. E.

(1990). Implications of three causal models for the measurement of halo error. Applied Psychological Measurement, 14(4), 419–429. https://doi.org/10.1177/014662169001400407

17.

Goldhammer

(2015). Measuring ability, speed, or both? Challenges, psychometric solutions, and what can be gained from experimental control. Measurement: Interdisciplinary Research and Perspectives, 13(3–4), 133–164. https://doi.org/10.1080/15366367.2015.1100020

18.

Jin

K.-Y.

Chiu

M. M.

(2022). A mixture Rasch facets model for rater’s illusory halo effects. Behavior Research Methods, 54(6), 2750–2764. https://doi.org/10.3758/s13428-021-01721-3

19.

Jin

K.-Y.

Eckes

(2024). Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times. Behavior Research Methods, 56(4), 3535–3547. https://doi.org/10.3758/s13428-023-02259-2

20.

Jin

K.-Y.

Eckes

(2025). When raters generalize: Examining sources of halo effects with mixture Rasch facets models. Behavior Research Methods, 57(4), 149. https://doi.org/10.3758/s13428-025-02667-6

21.

Jin

K.-Y.

Hsu

C.-L.

Chiu

M. M.

Chen

P.-H.

(2023). Modeling rapid guessing behaviors in computer-based testlet items. Applied Psychological Measurement, 47(1), 19–33. https://doi.org/10.1177/01466216221125177

22.

Johnson

R. L.

Penny

J. A.

Gordon

(2009). Assessing performance: Designing, scoring, and validating performance tasks. Guilford.

23.

Jones

E. A.

Wind

S. A.

Tsai

C.-L.

(2023). Comparing person-fit and traditional indices across careless response patterns in surveys. Applied Psychological Measurement, 47(5–6), 365–385. https://doi.org/10.1177/01466216231194358

24.

Laham

S. M.

Forgas

J. P.

(2022). Halo effects. In Pohl

R. F.

(Ed.), Cognitive illusions: Intriguing phenomena in thinking, judgment, and memory (3rd ed., pp. 259–271). Routledge. https://doi.org/10.4324/9781003154730-19

25.

Lai

E. R.

Wolfe

E. W.

Vickers

(2015). Differentiation of illusory and true halo in writing scores. Educational and Psychological Measurement, 75(1), 102–125. https://doi.org/10.1177/0013164414530990

26.

Linacre

J. M.

(1989). Many-facet Rasch measurement. MESA Press.

27.

Ling

Williams

O’Brien

Cavalle

C. S.

(2022). Scoring essays on an iPad versus a desktop computer: An exploratory study (Research Report, RR-22-08). Educational Testing Service. https://www.ets.org/research/policy_research_reports/publications/report/2022/kelf.html

28.

Murphy

K. R.

(1982). Difficulties in the statistical control of halo. Journal of Applied Psychology, 67(2), 161–164. https://doi.org/10.1037/0021-9010.67.2.161

29.

Murphy

K. R.

Jako

R. A.

Anhalt

R. L.

(1993). Nature and consequences of halo error: A critical analysis. Journal of Applied Psychology, 78(2), 218–225. https://doi.org/10.1037/0021-9010.78.2.218

30.

Myford

C. M.

Wolfe

E. W.

(2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. http://jampress.org/pubs.htm

31.

Myford

C. M.

Wolfe

E. W.

(2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227. http://jampress.org/pubs.htm

32.

Pinos Ullauri

L. A.

Van Den Noortgate

Debeer

(2024). Modelling the effect of instructional support on logarithmic-transformed response time: An exploratory study. Methodology, 20(2), 100–120. https://doi.org/10.5964/meth.12943

33.

Plummer

(2017). JAGS version 4.3.0 user manual. https://people.stat.sc.edu/hansont/stat740/jags_user_manual.pdf

34.

Ranger

König

Domingue

B. W.

Kuhn

J.-T.

Frey

(2023). A multidimensional partially compensatory response time model on basis of the log-normal distribution. Journal of Educational and Behavioral Statistics, 49(3), 431–464. https://doi.org/10.3102/10769986231184153

35.

Rubin

D. B.

(1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12(4), 1151–1172.

36.

Schroeders

Gnambs

(2025). Sample-size planning in item-response theory: A tutorial. Advances in Methods and Practices in Psychological Science, 8(1), 1–13. https://doi.org/10.1177/25152459251314798

37.

Stafford

R. E.

Wolfe

E. W.

Casabianca

J. M.

Song

(2018). Detecting rater effects under rating designs with varying levels of missingness. Journal of Applied Measurement, 19(3), 243–257. http://jampress.org/pubs.htm

38.

Thorndike

E. L.

(1920). A constant error in psychological ratings. Journal of Applied Psychology, 4(1), 25–29. https://doi.org/10.1037/h0071663

39.

van der Linden

W. J.

(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181

40.

van der Linden

W. J.

(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. https://doi.org/10.1007/s11336-006-1478-z

41.

van der Linden

W. J.

(2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3), 247–272. https://doi.org/10.1111/j.1745-3984.2009.00080.x

42.

van der Linden

W. J.

(2011). Modeling response times with latent variables: Principles and applications. Psychological Test and Assessment Modeling, 53(3), 334–358. https://www.psychologie-aktuell.com/fileadmin/download/ptam/3-2011_20110927/05_vanderLinden.pdf

43.

van der Linden

(2024). On the choice of parameters for the lognormal model for response times: Commentary on Becker et al. (2013). Journal of Educational Measurement, 61(4), 624–633. https://doi.org/10.1111/jedm.12411

44.

Vehtari

Gelman

Gabry

(2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

45.

Watanabe

(2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(116), 3571–3594. https://www.jmlr.org/papers/volume11/watanabe10a/watanabe10a.pdf

46.

Wickelgren

W. A.

(1977). Speed-accuracy tradeoff and information processing dynamics. Acta Psychologica, 41(1), 67–85. https://doi.org/10.1016/0001-6918(77)90012-9

47.

Wind

S. A.

(2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108

48.

Wind

S. A.

Peterson

M. E.

(2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999

49.

Wise

S. L.

Kong

(2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. https://doi.org/10.1207/s15324818ame1802_2

50.

Wolfe

E. W.

Song

(2016). Methods for monitoring and document rating quality. In Jiao

Lissitz

R. W.

(Eds.), The next generation of testing: Common core standards, smarter-balanced, PARCC, and the nationwide testing movement (pp. 107–142). Information Age.

51.

Mollaun

(2006). Investigating the utility of analytic scoring for the TOEFL Academic Speaking Test (TAST). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2006.tb02013.x

Does Rating Time Predict Illusory Halo? A Mixture Rasch Facets Analysis of Halo Effects in Onscreen Assessments

Abstract

Keywords

Get full access to this article

References