Reliability-Based Feature Weighting for Automated Essay Scoring

Abstract

From their earliest origins, automated essay scoring systems strived to emulate human essay scores and viewed them as their ultimate validity criterion. Consequently, the importance (or weight) and even identity of computed essay features in the composite machine score were determined by statistical techniques that sought to optimally predict human scores from essay features. However, it is evident that machine evaluation of essays is fundamentally different from human evaluation and therefore is not likely to measure the same set of writing skills. As a consequence, feature weights of human-prediction machine scores (reflecting their importance in the composite scores) are bound to reflect statistical artifacts. This article suggests alternative feature weighting schemes based on the premise of maximizing reliability and internal consistency of the composite score. The article shows, in the context of a large-scale writing assessment, that these alternative weighting schemes are significantly different from human-prediction weights and give rise to comparable or even superior reliability and validity coefficients.

Keywords

writing essay assessment automated scoring validity reliability

Get full access to this article

View all access options for this article.

References

Attali

(2007a). Construct validity of e-rater in scoring TOEFL essays (Research Report No. 07-21). Princeton, NJ: Educational Testing Service.

Attali

(2007b). On-the-fly customization of automated essay scoring (Research Report No. 07-42). Princeton, NJ: Educational Testing Service.

Attali

(2011). A differential word use measure for content analysis in automated essay scoring (Research Report No. 11-36). Princeton, NJ: Educational Testing Service.

Attali

(2013). Validity and reliability of automated essay scoring. In Shermis

M. D.

Burstein

(Eds.), Handbook of Automated essay evaluation (pp. 181-198). New York, NY: Routledge.

Attali

Burstein

(2006). Automated essay scoring with e-rater® V.2. Journal of Technology, Learning, and Assessment, 4(3). Retrieved from http://www.jtla.org

Bennett

R. E.

Bejar

I. I.

(1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17, 9-17.

Ben-Simon

Bennett

R. E.

(2007). Toward more substantively meaningful automated essay scoring. Journal of Technology, Learning, and Assessment, 6(2). Retrieved from http://www.jtla.org

Bentler

P. M.

(1968). Alpha-maximized factor analysis (Alphamax): Its relation to alpha and canonical factor analysis. Psychometrika, 33, 335-345.

Bereiter

(2003). Foreword. In Shermis

M. D.

Burstein

(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. vii-ix). Mahwah, NJ: Lawrence Erlbaum.

10.

Burstein

Tetreault

Madnani

(2013). The E-rater® automated essay scoring system. In Shermis

M. D.

Burstein

(Eds.), Handbook of automated essay evaluation (pp. 55-67). New York, NY: Routledge.

11.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

12.

Criterion. (2013). About the criterion online writing evaluation service. Retrieved from www.ets.org/criterion/about

13.

Eckes

(2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155-185.

14.

Ericsson

P. F.

Haswell

R. H.

(2006). Machine scoring of student essays: Truth and consequences. Logan: Utah State University Press.

15.

Foltz

P. W.

Streeter

L. A.

Lochbaum

K. E.

Landauer

T. K.

(2013). Implementation and applications of the intelligent essay assessor. In Shermis

M. D.

Burstein

(Eds.), Handbook of automated essay evaluation (pp. 68-88). New York, NY: Routledge.

16.

Haertel

E. H.

(2006). Reliability. In Brennan

R. L.

(Ed.), Educational measurement (4th ed., pp. 65-110). Westport, CT: American Council on Education.

17.

Rosenthal

Rubin

D. B.

(1996). Reliability of measurement in psychology: From Spearman-Brown to maximal reliability. Psychological Methods, 1, 98-107.

18.

Lumley

McNamara

T. F.

(1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54-71.

19.

McDonald

R. P.

(1968). A unified treatment of the weighting problem. Psychometrika, 33, 351-381.

20.

Page

E. B.

(1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238-243.

21.

Page

E. B.

Petersen

N. S.

(1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan, 76, 561-565.

22.

Powers

D. E.

(2005). “Wordiness”: A selective review of its influence, and suggestions for investigating its relevance in tests requiring extended written responses (Research Memorandum No. 04–08). Princeton, NJ: Educational Testing Service.

23.

Schultz

M. T.

(2013). The intellimetric™ automated essay scoring engine: A review and an application to Chinese essay scoring. In Shermis

M. D.

Burstein

(Eds.), Handbook of automated essay evaluation (pp. 89-98). New York, NY: Routledge.

24.

Wang

Stanley

(1970). Differential weighting: A review of methods and empirical studies. Review of Educational Research, 40, 663-705.

25.

Whitely

S. E.

(1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179-197.