Triangulating natural language processing (NLP)-based analysis of rater comments and many-facet Rasch measurement (MFRM): An innovative approach to investigating raters’ application of rating scales in writing assessment

Abstract

Rater comments tend to be qualitatively analyzed to indicate raters’ application of rating scales. This study applied natural language processing (NLP) techniques to quantify meaningful, behavioral information from a corpus of rater comments and triangulated that information with a many-facet Rasch measurement (MFRM) analysis of rater scores. The data consisted of ratings on 987 essays by 36 raters (a total of 3948 analytic scores and 1974 rater comments) on a post-admission English Placement Test (EPT) at a large US university. We computed a set of comment-based features based on the analytic components and evaluative language the raters used to infer whether raters were aligned to the scale. For data triangulation, we performed correlation analyses between the MFRM measures of rater performance and the comment-based measures. Although the EPT raters showed overall satisfactory performance, we found meaningful associations between rater comments and performance features. In particular, raters with higher precision and fit to what the Rasch model predicts used more analytic components and used evaluative language more similar to the scale descriptors. These findings suggest that NLP techniques have the potential to help language testers analyze rater comments and understand rater behavior.

Keywords

Many-facet Rasch measurement natural language processing rater comments rater performance rating scale

Get full access to this article

View all access options for this article.

References

Baker

B. A.

(2012). Individual differences in rater decision-making style: An exploratory mixed-methods study. Language Assessment Quarterly, 9(3), 225–248. https://doi.org/10.1080/15434303.2011.637262

Ballard

(2017). The effects of primacy on rater cognition: An eye-tracking study (Publication No. 1897079271) [Doctoral dissertation, Michigan State University]. ProQuest Dissertations and Theses Global.

Barkaoui

(2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86–107. https://doi.org/10.1016/j.asw.2007.07.001

Barkaoui

(2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54–74. https://doi.org/10.1080/15434300903464418

Brown

(2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 1–25. https://doi.org/10.1191/0265532203lt242oa

Cronbach

L. J.

(1995). Giving method variance its due. In Shrout

P. E.

Fiske

S. T.

(Eds.), Personality research, methods, and theory: A Festschrift in honor of Donald Fiske (pp. 145–157). Lawrence Erlbaum Associates. https://doi.org/10.4324/9781315806815-10

Cumming

(1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. https://doi.org/10.1177/026553229000700104

Dimova

Yan

Ginther

(2020). Local language testing: Design, implementation, and development. Routledge. https://doi.org/10.4324/9780429492242

Dryer

D. B.

(2013). Scaling writing ability: A corpus-driven inquiry. Written Communication, 30(1), 3–35. https://doi.org/10.1177/0741088312466992

10.

Eckes

(2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. https://doi.org/10.1207/s15434311laq0203_2

11.

Greene

J. C.

Caracelli

V. J.

Graham

W. F.

(1989). Toward a conceptual framework for mixed-method evaluation design. Educational Evaluation and Policy Analysis, 11, 255–274. https://doi.org/10.3102/01623737011003255

12.

Ishaq

Asghar

Gillani

S. A.

(2020). Aspect-based sentiment analysis using a hybridized approach based on CNN and GA. IEEE Access, 8, 135499–135512. https://doi.org/10.1109/ACCESS.2020.3011802

13.

Jeong

(2019). Writing scale effects on raters: An exploratory study. Language Testing in Asia, 9(1), 1–19. https://doi.org/10.1186/s40468-019-0097-4

14.

Jurafsky

Martin

J. H.

(2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall.

15.

Kim

H. J.

(2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261. https://doi.org/10.1080/15434303.2015.1049353

16.

Kim

Y. H.

(2009). An investigation into native and non-native teachers’ judgments of oral English performance: A mixed methods approach. Language Testing, 26(2), 187–217. https://doi.org/10.1177/0265532208101010

17.

Kiritchenko

Zhu

Cherry

Mohammad

(2014). NRC-Canada-2014: Detecting aspects and sentiment in customer reviews. In Nakov

Zesch

(Eds.), Proceedings of the 8th international workshop on semantic Evaluation (SemEval 2014) (pp. 437–442). ACL Anthology. https://doi.org/10.3115/v1/S14-2076

18.

Kjell

O. N.

Kjell

Garcia

Sikström

(2019). Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Psychological Methods, 24(1), 92–115. https://doi.org/10.1037/met0000191

19.

Knoch

(2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275–304. https://doi.org/10.1177/0265532208101008

20.

Linacre

J. M.

(2002a). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85–106.

21.

Linacre

J. M.

(2002b). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 878. https://www.rasch.org/rmt/rmt162f.htm

22.

Linacre

J. M.

(2012). Many-facet Rasch measurement: Facets tutorial. http://winsteps.com/tutorials.htm

23.

Linacre

J. M.

(2021). Facets Rasch measurement computer program (Version 3.83.6) [Computer software]. Windsteps. https://www.winsteps.com/

24.

Lumley

(2005). Assessing second language writing: The rater’s perspective. Peter Lang.

25.

Masters

G. N.

(1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272

26.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement (pp. 13–104). American Council on Education and Macmillan.

27.

Myford

C. M.

Wolfe

E. W.

(2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

28.

Pennebaker

J. W.

Mehl

M. R.

Niederhoffer

K. G.

(2003). Psychological aspects of natural language use: Our words, our selves. Annual Review of Psychology, 54, 547–577. https://doi.org/10.1146/annurev.psych.54.101601.145041

29.

Ren

Zheng

Guo

(2019). Resource recommendation algorithm based on text semantics and sentiment analysis . In Brugali

Sheu

C. Y.

Siciliano

Tsai

J. P.

(Eds.), 2019 third IEEE international conference on robotic computing (IRC) (pp. 363–368). https://doi.org/10.1109/IRC.2019.00065

30.

Saipech

Seresangtakul

(2018, August 14–17). Automatic Thai subjective examination using cosine similarity. In Kuriyama

Lestari

D. P.

Khader

A. T.

Rasmequan

(Eds.), 2018 5th IEEE international conference on advanced informatics: Concept theory and applications (ICAICTA) (pp. 214–218). IEEE. https://doi.org/10.1109/ICAICTA.2018.8541276

31.

Schaefer

(2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. https://doi.org/10.1177/0265532208094273

32.

Singhal

(2001). Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24(4), 35–43.

33.

Socher

Perelygin

Chuang

Manning

C. D.

A. Y.

Potts

(2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Yarowsky

Baldwin

Korhonen

Livescu

Bethard

(Eds.), Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631–1642). ACL Anthology.

34.

Trace

Janssen

Meier

(2017). Measuring the impact of rater negotiation in writing performance assessment. Language Testing, 34(1), 3–22. https://doi.org/10.1177/0265532215594830

35.

Vasiliev

(2020). Natural language processing with Python and spaCy: A practical introduction. No Starch Press.

36.

Wang

Engelhard

Jr Raczynski

Song

Wolfe

E. W.

(2017). Evaluating rater accuracy and perception for integrated writing assessments using a mixed-methods approach. Assessing Writing, 33, 36–47. https://doi.org/10.1016/j.asw.2017.03.003

37.

Weigle

S. C.

(1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. https://doi.org/10.1177/026553229801500205

38.

Wind

S. A.

(2020). Do raters use rating scale categories consistently across analytic rubric domains in writing assessment? Assessing Writing, 43, 100416. https://doi.org/10.1016/j.asw.2019.100416

39.

Winke

Lim

. (2015). ESL essay raters’ cognitive processes in applying the Jacobs et al. Rubric: An eye-movement study. Assessing Writing, 25, 38–54. https://doi.org/10.1016/j.asw.2015.05.002

40.

Yan

(2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527. https://doi.org/10.1177/0265532214536171

41.

Zhang

Elder

(2011). Judgments of oral proficiency by non-native and native English speaking teacher raters: Competing or complementary constructs? Language Testing, 28(1), 31–50. https://doi.org/10.1177/0265532209360

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.10 MB