Recurrent Issues and Recent Advances in Scoring Performance Assessments

Abstract

A conceptual framework is provided to guide the development of scoring procedures for performance assessments. Within this framework are four questions: (1) what aspects of the performance are to be scored, (2) what criteria are to be applied to evaluate the identified aspects or components of the performance, (3) how should the scoring criteria be developed, and (4) how should these criteria be applied? The historical limitations of performance assessments are reviewed and recent efforts to avoid those limitations are discussed in the context of these questions.

Get full access to this article

View all access options for this article.

References

Baxter, G. P. , Shavelson, R. J. , Goldman, S. R. , & Pine, J. (1992). Evaluation of procedure-based scoring for hands-on science assessment. Journal of Educational Measurement, 29, 1–17.

Bejar, I. I. (1991). A methodology for scoring openended architectural design problems. Journal of Applied Psychology, 76, 522–532.

Bejar, I. I. (1995). From adaptive testing to automated scoring of architectural simulations. In E. L.Mancall & P. G. Bashook (Eds.), Assessing clinical reasoning: The oral examination and alternative methods(pp. 115–130). Evanston IL: American Board of Medical Specialities.

Bejar, I. I. , & Bennett, R. E. (1997, March). Validity and automated scoring: It’s not only the scoring. Paper presented at the meeting of the National Council on Measurement in Education, Chicago.

Bejar, I. I. , & Braun, H. I. (1999). Architectural simulations: From research to implementation(RM-99-2). Princeton NJ: Educational Testing Service.

Bennett, R. E. , Morley, M. , & Quardt, D. (2000). Three response types for broadening the conception of mathematical problem solving in computerized tests. Applied Psychological Measurement, 24, 294–309.

Bennett, R. E. , & Sebrechts, M. M. (1996). The accuracy of expert-system diagnoses of mathematical problem solutions. Applied Measurement inEducation, 9, 133–150.

Bennett, R. E. , Steffen, M. , Singley, M. K. , Morley, M. , & Jacquemin, D. (1997). Evaluating an automatically scorable, open-ended response type for measuring mathematical reasoning in computeradaptive testing. Journal of Educational Measurement, 34, 162–176.

Braun, H. I. (1988). Understanding score reliability: Experience calibrating essay readers. Journal of Educational Statistics, 13, 1–18.

10.

Braun, H. I. , Bennett, R. E. , Frye, D. , & Soloway, E. (1990). Scoring constructed responses using expert systems. Journal of Educational Measurement, 27, 93–108.

11.

Burstein, J. , Kukich, K. , Wolff, S. , & Lu, C. (1998, April). Computer analysis of essay content for automated score prediction. Paper presented at the meeting of the National Council on Measurement in Education, San Diego CA.

12.

Clauser, B. E. , & Clyman, S. G. (1997, March). The validity of automated scores for a computerbased examination of physicians’ patientmanagement skills. Paper presented at the meeting of the National Council on Measurement in Education, Chicago.

13.

Clauser, B. E. , Clyman, S. G. , & Swanson, D. B. (1999). Components of rater error in a complex performance assessment. Journal of Educational Measurement, 36, 29–45.

14.

Clauser, B. E. , Harik, P. , & Clyman, S. G. (in press). The generalizability of scores for a performance assessment scored with a computerautomated scoring system. Journal of Educational Measurement.

15.

Clauser, B. E. , Margolis, M. J. , Clyman, S. G. , & Ross, L. P. (1997). Development of automated scoring algorithms for complex performance assessments: A comparison of two approaches. Journal of Educational Measurement, 34, 141–161.

16.

Clauser, B. E. , Ross, L. P. , Clyman, S. G. , Rose, K. M. , Margolis, M. J. , Nungester, R. J. , Piemme, T. E. , Chang, L. , El-Bayoumi, G. , Malakoff, G. L. , & Pincetl, P. S. (1997). Development of a scoring algorithm to replace expert rating for scoring a complex performance-based assessment. Applied Measurement in Education, 10, 345–358.

17.

Clauser, B. E. , Subhiyah, R. G. , Nungester, R. J. , Ripkey, D. R. , Clyman, S. G. , & McKinley, D. (1995). Scoring a performance-based assessment by modeling the judgments of experts. Journal of Educational Measurement, 32, 397–415.

18.

Clauser, B. E. , Swanson, D. B. , & Clyman, S. G. (1996). The generalizability of scores from a performance assessment of physicians’ patient management skills. Academic Medicine (RIME Supplement), 71, S109–S111.

19.

Clauser, B. E. , Swanson, D. B. , & Clyman, S. G. (1999). A comparison of the generalizability of scores produced by expert raters and automated scoring systems. Applied Measurement in Education, 12, 281–299.

20.

Clyman, S. G. , Melnick, D. E. , & Clauser, B. E. (1995). Computer-based case simulations. In E. L. Mancall & P. G. Bashook (Eds.), Assessing clinical reasoning: The oral examination and alternative methods(pp. 139–149). Evanston IL: American Board of Medical Specialities.

21.

Coffman, W. E. (1971). Essay examinations. In R. L. Thorndike (Ed.), Educational measurement(2nd ed., pp. 271–302). Washington DC: American Council on Education.

22.

Cronbach, L. J. , Linn, R. L. , Brennan, R. L. , & Haertel, E. (1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological Measurement, 57, 373–399.

23.

Dawes, R. M. & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106.

24.

Dawes, R. M. , Faust, D. , & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674.

25.

Day, S. C. , Norcini, J. J. , Diserens, D. , Cebul, R. C. , Schwartz, J. S. , Beck, L. H. , Webster, G. D. , Schnabel, T. G. , & Elstein, A. S. (1990). The validity of an essay test of clinical judgment. Academic Medicine (RIME Supplement), 65, S39–S40.

26.

Dunbar, S. B. , Koretz, D. M. , & Hoover, H. D. (1991) Quality control in the development and use of performance assessments. Applied Measurement in Education, 4, 289–303.

27.

Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a manyfaceted Rasch model. Journal of Educational Measurement, 31, 93–112.

28.

Hardy, R. A. (1995). Examining the costs of performance assessment. Applied Measurement in Education, 8, 121–134.

29.

Hartog, P. , & Rhodes, E. C. (1936). The marks of examiners. London: MacMillan.

30.

Kane, M. , Crooks, T. ,& Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18 (2), 5–17.

31.

Kaplan, R. M. & Bennett, R. E. (1994). Using a free-response scoring tool to automatically score the formulating-hypotheses item(RR No. 94-08). Princeton NJ: Educational Testing Service.

32.

Lindquist, E. F. (Ed.). (1951). Educational measurement(1st ed.). Washington DC: American Council on Education.

33.

Linn, R. L. (Ed.). (1989). Educational measurement (3rd ed.). New York: American Council on Education and Macmillan.

34.

Longford, N. T. (1995). Models for uncertainty in educational testing. New York: Springer-Verlag.

35.

Martin, J. A. , Reznick, R. K. , Rothman, A. , Tamblyn, R. M. , & Regehr, G. (1996). Who should rate candidates in an objective structured clinicalexamination? Academic Medicine, 71, 170–175.

36.

Mazzeo, J., Schmitt, A., & Cook, L. (1987). The comparability of adjudicated and non-adjudicated scores on the ATP English Composition Test with Essay. Unpublished manuscript.

37.

Meehl, P. E. (1954). Clinical versus statistical prediction. Minneapolis MN: University of Minnesota Press.

38.

Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 439–483.

39.

Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33, 379–416.

40.

Mislevy, R. J. , Steinberg, L. S. , Breyer, F. J. , Almond, R. G. , & Johnson, L. (2000, April). Making sense of data from complex assessments. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans LA.

41.

Norcini, J. J. , Diserens, D. , Day, S. C. , Cebul, R. C. , Schwartz, J. S. , Beck, L. H. , Webster, G. D. , Schnabel, T. G. , & Elstein, A. S. (1990). The scoring and reproducibility of an essay test of clinical judgment. Academic Medicine (RIME Supplement), 65, S41–S42.

42.

Norman, G. R. (1985). Objective measurement of clinical performance. Medical Education, 174, 43–47.

43.

Page, E. B. (1966). Grading essays by computer: Progress report. Proceedings of the 1966 invitational conference on testing. Princeton NJ: Educational Testing Service.

44.

Page, E. B. , & Petersen, N. S. (1995). The computer moves into essay grading. Phi Delta Kappan, 76, 561–565.

45.

Ryans, D. G. , & Frederiksen, N. (1951). Performance tests of educational achievement. In E. F.Lindquist (Ed.), Educational measurement(1st ed.) (pp. 455-494). Washington DC: American Council on Education.

46.

Sebrechts, M. M. , Bennett, R. E. , & Rock, D. A. (1991). Agreement between expert-system and human raters on complex constructed-response quantitative items. Journal of Applied Psychology, 76, 856–862.

47.

Shavelson, R. J. , Baxter, G. P. , & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232.

48.

Stalnaker, J. M. (1951). The essay type of examination. In E. F. Lindquist (Ed.), Educational measurement(1st ed.) (pp. 495-530). Washington DC: American Council on Education.

49.

Thorndike, E. L. (1918). Fundamental theorems in judging men. Journal of Applied Psychology, 2, 67–76.

50.

Wainer, H. , & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.

51.

Webster, G. D. , Shea, J. A. , Norcini, J. J. , Grosso, L. J. , & Swanson, D. B. (1988). Strategies in comparison of methods for scoring patient management problems: Use of external criteria to validate scores. Evaluation and the Health Professions, 2, 231–248.

52.

Wiley, D. E. , & Haertel, E. H. (1996). Extended assessment tasks: Purposes, definitions, scoring, and accuracy. In M. B. Kane & R. Mitchell (Eds.), Implementing performance assessment(pp. 61–89). Mahwah NJ: Erlbaum.

53.

Williamson, D. M. , Bejar, I. I. , & Hone, A. S. (1999). “Mental model” comparison of automated and human scoring. Journal of Educational Measurement, 36, 158–184.