Sage Journals: Discover world-class research

Abstract

It is common practice for assessment programs to organize qualifying sessions during which the raters (often known as “markers” or “judges”) demonstrate their consistency before operational rating commences. Because of the high-stakes nature of many rating activities, the research community tends to continuously explore new methods to analyze rating data. We used simulated and empirical data from two high-stakes language assessments, to propose a new approach, based on social network analysis and exponential graph models, to evaluate the readiness of a group of raters for operational rating. The results of this innovative approach are compared with the results of a Rasch analysis, which is a well-established approach for the analysis of such data. We also demonstrate how the new approach can be practically used to investigate important research questions such as whether rater severity is stable across rating tasks. The merits of the new approach, and the consequences for practice are discussed.

Keywords

rater effects social network analysis exponential random graph models Rasch model

Get full access to this article

View all access options for this article.

References

Baird

Hayes

Johnson

Lamprianou

(2013). Marker effects and examination reliability: A comparative exploration from the perspectives of generalizability theory, Rasch modelling and multilevel modelling. Report to The Office of Qualifications and Examinations Regulation (Ofqual), UK.

Banerji

(1999). Validation of scores/measures from a K-2 developmental assessment in mathematics. Educational and Psychological Measurement, 59, 694-715.

Barrett

(2001). The impact of training on rater variability. International Education Journal, 2, 49-58.

BBC. (2003, February 25). New drive for exam markers. Retrieved from http://news.bbc.co.uk/go/pr/fr/-/1/hi/northern_ireland/2797651.stm

Block

(2015). Reciprocity, transitivity, and the mysterious three-cycle. Social Networks, 40, 163-173. doi:10.1016/j.socnet.2014.10.005

Bojanowski

Corten

(2014). Measuring segregation in social networks. Social Networks, 39, 14-32. doi:10.1016/j.socnet.2014.04.001

Bonk

W. J.

Ockey

G. J.

(2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20, 89-110.

Braun

H. I.

(1988). Understanding score reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 13, 1-18.

Brennan

R. L.

(2001). Generalizability theory. New York, NY: Springer-Verlag.

10.

Congdon

P. J.

McQueen

(2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37, 163-178.

11.

Cushing

S. W.

(1998). Using FACETS to model rater training effects. Language Testing, 15, 263-287.

12.

Cushing

S. W.

(1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6, 145-178.

13.

Eckes

(2005). Examining rater effects in TestDaF writing and speaking performance assessments: a many-facet Rasch analysis. Language Assessment Quarterly, 2, 197-221.

14.

Engelhard

Jr. (1992). The measurement of writing competence with a many-faceted Rasch model. Applied Measurement in Education, 5, 171-191.

15.

Engelhard

Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93-112.

16.

Engelhard

Jr. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1, 19-33.

17.

Engelhard

Jr. (2002). Monitoring raters in performance assessments. In Tindal

Haladyna

T. M.

(Eds.), Large scale assessments for all students: Validity, technical adequacy, and implementation (pp. 261-288). Mahwah, NJ: Lawrence Erlbaum.

18.

Engelhard

Jr. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York, NY: Routledge.

19.

Freedman

S. W.

(1981). Influences on evaluators of expository essays: beyond the text. Research in the Teaching of English, 15, 245-255.

20.

Foster

J. G.

Foster

D. V.

Grassberger

Paczuski

(2010). Edge direction and the structure of networks. Proceedings of the National Academy of Sciences of the U S A, 107, 10815-10820.

21.

Garner

Engelhard

(2009). Using paired comparison matrices to estimate parameters of the partial credit Rasch measurement model for rater-mediated assessments. Journal of Applied Measurement, 10, 30-41.

22.

Goldstein

(2010). Multilevel statistical models (4th ed.). London, England: Arnold.

23.

Goodreau

(2007). Advances in exponential random graph (p*) models applied to a large social network. Social Networks, 29, 231-248.

24.

Harris

J. K.

(2014). An introduction to exponential random graph modeling. Thousand Oaks, CA: Sage.

25.

Anwyll

Glanville

Deavall

(2013). An investigation of the reliability of marking of the Key Stage 2 National Curriculum English writing tests in England. Educational Research, 55, 393-410.

26.

Hoskens

Wilson

(2001). Real-time feedback on rater drift in constructed-response items: An example from the Golden State Examination. Journal of Educational Measurement, 38, 121-145.

27.

Hunter

D. R.

Handcock

M. S.

Butts

C. T.

Goodreau

S. M.

Morris

(2008). ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of Statistical Software, 24(3), 1-29.

28.

Lamprianou

(2004). Marking quality assurance procedures for large-scale high-stakes assessment (Not in public domain), National Assessment Agency, UK.

29.

Lamprianou

(2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7, 192-200.

30.

Lamprianou

(2008). High-stakes tests with self-selected essay questions: Addressing issues of fairness. International Journal of Testing, 18, 55-89.

31.

Lamprianou

Boyle

(2004). Accuracy of measurement in the context of mathematics National Curriculum tests in England for ethnic minority pupils and pupils who speak English as an additional language. Journal of Educational Measurement, 41, 239-260.

32.

Leckie

Baird

(2011) Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48, 399-418.

33.

Linacre

J. M.

(1994). Many-facet Rasch measurement (2nd ed.). Chicago, IL: MESA Press.

34.

Lunz

M. E.

Wright

B. D.

Linacre

J. M.

(1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345.

35.

Lunz

M. E.

Stahl

J. A.

Wright

B. D.

(1991). The invariance of judge severity calibrations. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.

36.

Masters

G. N.

(2002). Fair and meaningful measures? A review of examination procedures in the NSW Higher School Certificate. Camberwell, Victoria, Australia: ACER Press.

37.

McPherson

Smith-Lovin

Cook

J. M.

(2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27, 415-444.

38.

Myford

C. M.

Wolfe

E. W.

(2002). When raters disagree, then what: Examining a third-rating discrepancy resolution procedure and its utility for identifying unusual patterns of ratings. Journal of Applied Measurement, 3, 300-324.

39.

Rae

Hyland

(2001). Generalizability and classical test theory analyses of Koppitz’s Scoring System for human figure drawings. British Journal of Educational Psychology, 71, 369-382.

40.

Rasch

(1960). Probabilistic models for some intelligence and achievement tests. Copenhagen, Denmark: Danish Institute for Educational Research. (Expanded edition, 1980. Chicago, IL: University of Chicago Press)

41.

Robins

Pattison

Kalish

Lusher

(2007). An introduction to exponential random graph (p*) models for social networks. Social Networks, 29, 173-191.

42.

Royal-Dawson

Baird

J.-A.

(2009). Is teaching experience necessary for reliable scoring of extended English questions?Educational Measurement: Issues and Practice, 28(2) 2-8.

43.

Saal

F. E.

Downey

R. G.

Lahey

M. A.

(1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413-428.

44.

Smith

R. M.

(1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51, 541-565.

45.

Smith

R. M.

(2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1, 199-218.

46.

Smithers

(2003, January 31). Exams chief warns of more chaos this summer. The Guardian. Retrieved from https://www.theguardian.com/uk/2003/jan/31/politics.alevels2002

47.

Snijders

T. A.

Van de Bunt

G. G.

Steglich

C. E.

(2010). Introduction to stochastic actor-based models for network dynamics. Social Networks, 32, 44-60.

48.

Stahl

J. A.

Lunz

M. E.

(1991). Judge performance reports: Media and message. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

49.

Wang

Engelhard

Jr. Wolfe

E. W.

(2016). Evaluating rater accuracy in rater-mediated assessments using an unfolding model. Educational and Psychological Measurement, 76, 1005-1025. doi:10.1177/0013164415621606

50.

Wang

W.-C.

Chen

C.-T.

(2005). Item parameter recovery, standard error estimates, and fit statistics of the Winsteps program for the family of Rasch models. Educational and Psychological Measurement, 65, 376-404. doi:10.1177/0013164404268673

51.

Weigle

(1998). Using FACETS to model rater training effects. Language Testing, 15, 263-287.

52.

Williams

Del Genio

C. I.

(2014). Degree correlations in directed scale-free networks. PLoS One, 9:e110121.

53.

Wiseman

C. S.

(2012). Rater effects: Ego engagement in rater decision-making. Assessing Writing, 17, 150-173.

54.

Wright

B. D.

Masters

G. N.

(1982). Rating scale analysis. Chicago, IL: MESA Press.

55.

Wright

B. D.

Mok

(2000). Rasch models overview. Journal of Applied Measurement, 1, 83-106.

56.

Wright

B. D.

Stone

M. H.

(1979). Best test design. Chicago, IL: MESA Press.

57.

S. M.

Tan

(2016). Managing rater effects through the use of FACETS analysis: The case of a university placement test. Higher Education Research & Development, 35, 380-394. doi:10.1080/07294360.2015.1087381

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.10 MB

Investigation of Rater Effects Using Social Network Analysis and Exponential Random Graph Models

Abstract

Keywords

Get full access to this article

References

Supplementary Material