Sage Journals: Discover world-class research

Abstract

We used multivariate generalizability theory to examine the reliability of hand-scoring and automated essay scoring (AES) and to identify how these scoring methods could be used in conjunction to optimize writing assessment. Students (n = 113) included subsamples of struggling writers and non-struggling writers in Grades 3–5 drawn from a larger study. Students wrote six essays across three genres. All essays were hand-scored by four raters and an AES system called Project Essay Grade (PEG). Both scoring methods were highly reliable, but PEG was more reliable for non-struggling students, while hand-scoring was more reliable for struggling students. We provide recommendations regarding ways of optimizing writing assessment and blending hand-scoring with AES.

Keywords

elementary grades generalizability theory writing assessment

Get full access to this article

View all access options for this article.

References

Applebee

A. N.

Langer

J. A.

(2009). What is happening in the teaching of writing? English Journal, 98(5), 18–28.

Applebee

A. N.

Langer

J. A.

(2011). A snapshot of writing instruction in middle schools and high schools. English Journal, 100(6), 14–27.

Attali

(2015). Reliability-based feature weighting for automated essay scoring. Applied Psychological Measurement, 39(4), 303–313. https://doi.org/10.1177/0146621614561630

Barkaoui

(2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12, 86–107.

Behizadeh

Pang

M. E.

(2016). Awaiting a new wave: The status of state writing assessment in the United States. Assessing Writing, 29, 25–41.

Bouwer

Béguin

Sanders

van den Bergh

(2015). Effect of genre on the generalizability of writing scores. Language Testing, 32, 83–100.

Brennan

R. L.

(2001). Generalizability theory. Springer. http://dx.doi.org/10.1007/978-1-4757-3456-0

Brennan

R. L.

(2010). Generalizability theory and classical test theory. Applied Measurement in Education, 24(1), 1–21. https://doi.org/10.1080/08957347.2011.532417

Bridgeman

(2013). Human ratings and automated essay evaluation. In Shermis

M. D.

Burstein

(Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 221–232). New York, NY: Routledge.

10.

Chen

Niemi

Wang

Mirocha

(2007). Examining the generalizability of direct writing assessment tasks. CSE Technical report 718. National Center for Research on Evaluation, Standards, and Student (CRESST).

11.

Coe

Hanita

Nishioka

Smiley

(2011). An investigation of the impact of the 6+1 Trait Writing model on Grade 5 student writing achievement (NCEE 2012–4010). National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

12.

Cronbach

L. J.

Gelser

G. C.

Nanda

Rajaratnam

(1972). The dependability of behavioral measurements. Wiley.

13.

Cronbach

L. J.

Rajaratnam

Gelser

G. C.

(1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137–163.

14.

Deane

(2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18, 7–24.

15.

Eckes

(2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang.

16.

Etmanskie

J. M.

Partanen

Siegel

L. S.

(2016). A longitudinal examination of the persistence of late emerging reading disabilities. Journal of Learning Disabilities, 49, 21–35. http://dx.doi.org/10.1177/0022219414522706

17.

Gansle

K. A.

VanDerHeyden

A. M.

Noell

G. H.

Resetar

J. L.

Williams

K. L.

(2006). The technical adequacy of curriculum-based and rating-based measures of written expression for elementary school students. School Psychology Review, 35(4), 435-450. DOI: 10.1080/02796015.2006.12087977

18.

Graham

(2018). A revised writer(s)-within-community model of writing. Educational Psychologist, 53, 258-279. https://doi.org/10.1080/00461520.2018.1481406

19.

Graham

(2019). Changing how writing is taught. Review of Research in Education, 43, 277-303. https://doi.org/10.3102/0091732X18821125

20.

Graham

Capizzi

Harris

Hebert

Morphy

(2014). Teaching writing to middle school students: A national survey. Reading and Writing, 27, 1015-1042. https://doi.org/10.1007/s11145-013-9495-7

21.

Graham

Harris

K. R.

(2015). Common Core State Standards and writing: Introduction to the special issue. Elementary School Journal, 115, 457–463. https://doi.org/10.1086/681963

22.

Graham

Harris

Hebert

M. A.

(2011a). Informing writing: The benefits of formative assessment. A Carnegie Corporation Time to Act report. Alliance for Excellent Education.

23.

Graham

Harris

K. R.

Hebert

(2011b). It is more than just the message: Presentation effects in scoring writing. Focus on Exceptional Children, 44(4), 1–12.

24.

Graham

Hebert

Harris

K. R.

(2015). Formative assessment and writing. Elementary School Journal, 115, 523–547.

25.

Graham

Hebert

Sandbank

M. P.

Harris

K. R.

(2016). Assessing the writing achievement of young struggling writers: Application of generalizability theory. Learning Disability Quarterly, 39, 72–82.

26.

Hallgren

K. A.

(2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, 25–34. https://doi.org/10.20982/tqmp.08.1.p023

27.

Higgins

Heilman

(2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(3), 36–46. https://doi.org/10.1111/emip.12036

28.

Hoang

G. T. L.

Kunnan

A. J.

(2016) Automated essay evaluation for English language learners: A case study of MY access. Language Assessment Quarterly, 13(4), 359–376. https://doi.org/10.1080/15434303.2016.1230121

29.

In’nami

Koizumi

(2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341–366. https://doi.org/10.1177/0265532215587390

30.

Johnson

A. C.

Wilson

Roscoe

R. D.

(2017). College student perceptions of writing errors, text quality, and author characteristics. Assessing Writing, 34, 72–87.

31.

Johnson

R. L.

Penny

Fisher

Kuhs

(2003). Score resolution: An investigation of the reliability and validity of resolved scores. Applied Measurement in Education, 16, 299–322. https://doi.org/10.1207/S15324818AME1604_3

32.

Kim

Y. G.

Schatschneider

Wanzek

Gatlin

Al Otaiba

(2017). Writing evaluation: Rater and task effects on the reliability of writing scores for children in Grades 3 and 4. Reading and Writing, 30, 1287–1310.

33.

Lee

Anderson

(2007). Validity and topic generality of a writing performance test. Language Testing, 24, 307–330. https://doi.org/10.1177/0265532207077200

34.

(2017). Narrative and expository genre effects on students, raters, and performance criteria. Assessing Writing, 31, 113–125. https://doi.org/10.1016/j.asw.2016.08.006

35.

Lottridge

S. M.

Schulz

E. M.

Mitzel

H. C.

(2013). Using automated scoring to monitor reading performance and detect reader drift in essay scoring. In Shermis

M. D.

Burstein

(Eds.), Handbook of automated essay evaluation (pp. 233–250). Routledge.

36.

National Center for Education Statistics. (2012). The Nation’s Report Card: Writing 2011 (NCES 2012–470). Institute of Education Sciences, U.S. Department of Education.

37.

National Commission on Writing. (2003). The neglected R: The need for a writing revolution. College Entrance Examination Board. www.collegeboard.com

38.

Nunnally

J. C.

(1967). Psychometric theory. McGraw-Hill.

39.

Osborn Popp

S. E.

Ryan

J. M.

Thompson

M. S.

(2009). The critical role of anchor paper selection in writing assessment. Applied Measurement in Education, 22, 225–271. https://doi.org/10.1080/08957340902984026

40.

Page

E. B.

(2003). Project Essay Grade: PEG. In Shermis

M. D.

Burstein

J. C.

(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Erlbaum.

41.

Palermo

Bunch

Ridge

(2019). Scoring stability in a large-scale assessment program: A longitudinal analysis of leniency/severity effects. Journal of Educational Measurement, 56(3), 626–652. https://doi.org/10.1111/jedm.12228

42.

Penny

Johnson

R. L.

Gordon

(2000a). The effect of rating augmentation on inter-rater reliability: An empirical study of a holistic rubric. Assessing Writing, 7, 143–164. https://doi.org/10.1016/S1075-2935(00)00012-X

43.

Penny

Johnson

R. J.

Gordon

(2000b). Using rating augmentation to expand the scale of an analytic rubric. Journal of Experimental Education, 68, 269–287. https://doi.org/10.1080/00220970009600096

44.

Persky

H. R.

Daane

M. C.

Jin

(2002). The nation’s report card: Writing 2002. (NCES 2003–529). National Center for Education Statistics, Institute of Education Sciences. U.S. Department for Education.

45.

Philippakos

Z. A. T.

FitzPatrick

(2018). A proposed tiered model of assessment in writing instruction: Supporting all student-writers. Insights Into Learning Disabilities, 15, 149–173.

46.

Powers

D. E.

Escoffery

D. S.

Duchnowski

M. P.

(2015). Validating automated essay scoring: A (modest) refinement of the “gold standard.” Applied Measurement in Education, 28, 130–142. https://psycnet.apa.org/doi/10.1080/08957347.2014.1002920

47.

Raczynski

Cohen

(2018). Appraising the scoring performance of automated essay scoring systems—some additional considerations: Which essays? Which human raters? Which scores? Applied Measurement in Education, 31, 233–244. https://doi.org/10.1080/08957347.2018.1464449

48.

Schoonen

(2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22, 1–30.

49.

Schoonen

(2012). The validity and generalizability of writing scores: The effect of rater, task and language. In Van Steendam

Tillema

Rijlaarsdam

Van den Bergh

(Eds.), Measuring writing: Recent insights into theory, methodology and practices (pp. 1–22). Brill.

50.

Shanahan

(2015). Common Core State Standards: A new role for writing. Elementary School Journal, 115, 464–479. https://doi.org/10.1086/681130

51.

Shavelson

R. J.

Webb

N. M.

(1991). Generalizability theory: A primer. Sage.

52.

Shermis

M. D.

(2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76. https://doi.org/10.1016/j.asw.2013.04.001

53.

Shrout

P. E.

Fleiss

J. L.

(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. https://doi.org/10.1037/0033-2909.86.2.420

54.

Sinharay

Zhang

Deane

(2019). Prediction of essay scores from writing process and product features using data mining methods. Applied Measurement in Education, 32, 116–137. https://doi.org/10.1080/08957347.2019.1577245

55.

Speece

D. L.

Ritchey

K. D.

(2005). A longitudinal study of the development of oral reading fluency in young children at risk for reading failure. Journal of Learning Disabilities, 38, 387–399. http://dx.doi.org/10.1177/00222194050380050201

56.

Wang

E. L.

Matusmura

L. C.

Correnti

Litman

Zhang

Howe

Magooda

Quintana

(2020). eRevis(ing): Students’ revision of text evidence use in an automated writing evaluation system. Assessing Writing, 44. https://doi.org/10.1016/j.asw.2020.100449

57.

Webb

N. M.

Rowley

G. L.

Shavelson

R. J.

(1988). Using generalizability theory in counseling and development. Measurement and Evaluation in Counseling and Development, 21, 81–90. http://dx.doi.org/10.1080/07481756.1988.12022886

58.

Webb

Shavelson

(2005). Generalizability theory: Overview. In Everitt

B. S.

Howell

D. C.

(Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 717–719). John Wiley.

59.

Wilson

(2018). Universal screening with automated essay scoring: Evaluating classification accuracy in Grades 3 and 4. Journal of School Psychology, 68, 19–37. https://doi.org/10.1016/j.jsp.2017.12.005

60.

Wilson

Chen

Sandbank

M. P.

Hebert

(2019). Generalizability of automated scores of writing quality in Grades 3–5. Journal of Educational Psychology, 111, 619–640. https://doi.org/10.1037/edu0000311

61.

Wilson

Rodrigues

(2020). Classification accuracy and efficiency of writing screening using automated essay scoring. Journal of School Psychology, 82, 123–140. https://doi.org/10.1016/j.jsp.2020.08.008

62.

Wilson

Roscoe

Ahmed

(2017). Automated formative writing assessment using a levels of language framework. Assessing Writing, 34, 16–36.

63.

Wind

S. A.

(2018). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43, 159–171. https://doi.org/10.1177%2F0146621618789391

64.

Wind

S. A.

(2020). Exploring the impact of rater effects on person fit in rater-mediated. Assessments. Assessments. Educational Measurement: Issues and Practice, 39(4), 76–94. https://doi.org/10.1111/emip.12354

65.

Wind

S. A.

Walker

A. A.

(2019). Exploring the correspondence between traditional score resolution methods and person fit indices in rater-mediated writing assessments. Assessing Writing, 39, 25–38. https://doi.org/10.1016/j.asw.2018.12.002

66.

Wind

S. A.

Wolfe

E. W.

Engelhard

G. E.

Foltz

Rosenstein

(2017). The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing, 18, 27–49. https://doi.org/10.1080/15305058.2017.1361426

67.

Yamanishi

(2005). Ippanka kanousei riron wo mochiita kokosei no jiyueisakubun hyoka no kento [Using Generalizability Theory in the evaluation of L2 writing]. JALT Journal, 27, 169–185.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB

Examining Human and Automated Ratings of Elementary Students’ Writing Quality: A Multivariate Generalizability Theory Application

Abstract

Keywords

Get full access to this article

References

Supplementary Material