Using Portfolios in Program Evaluation: An Investigation of Interrater Reliability

Abstract

Portfolios and other open-ended assessments are increasingly incorporated into evaluations and testing programs. However, questions about the reliability of such assessments continue to be raised. After reviewing forces that may be leading to increased interest in and use of portfolio assessment, we investigate the interrater reliability of a portfolio assessment used in a small-scale program evaluation. Three types of portfolio scores were investigated—analytic, combined analytic (formed by summing across analytic scores), and holistic. The interrater reliability coefficient was highest for summed analytic scores (r 5 .86). Results indicate that at least three raters are required to obtain acceptable levels of reliability for holistic and individual analytic scores.

Get full access to this article

View all access options for this article.

References

American Association for the Advancement of Science . (1993). Benchmarks for science literacy. New York: Oxford University Press.

Baker, E. , Freeman, M. , & Clayton, S. (1991). Cognitive assessment of history for large-scale testing. In M. Wittrock & E. Baker (Eds.), Testing and cognition (pp. 131-153). Englewood Cliffs, NJ: Prentice Hall.

Bond, L. , & Roeber, E. (1998). What’s new in state assessment? A summary of findings of the 1997 state student assessment programs survey. Paper presented at the National Conference on Large-Scale Assessment, Colorado Springs, CO.

Breland, H. (1983). The direct assessment of writing skill: A measurement review (College Board Report No. 83-6). New York: College Entrance Examination Board.

Brennan, R. (1995). Generalizability of performance assessments. Paper presented at the annual meeting of the National Council of Measurement in Education, San Francisco, CA.

Brennan, R. , & Kane, M. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14, 277-289.

Brown, A. , Campione, J. , Webber, L. , & McGilly, K. (1992). Interactive learning environments: A new look at assessment and instruction. In B. Gifford & M. O’Connor (Eds.), Changing assessments: Alternative views of aptitude, achievement, and instruction (pp. 37-75). Boston: Kluwer.

Cherry, R. , & Myer, P. (1993). Reliability issues in holistic assessment. In M. Williamson & B. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 109-141). Cresskill, NJ: Hampton Press.

Crick, J. , & Brennan, R. (1984). GENOVA [Computer software]. Philadelphia, PA: National Board of Medical Examiners.

10.

Cronbach, L. , Linn, R. , Brennan, R. , & Haertel, E. (1995). Generalizability analysis for educational assessments. Los Angeles: Center for the Study of Evaluation, Standards, and Student Testing, University of California at Los Angeles.

11.

Gredler, M. (1995). Implications of portfolio assessment for program evaluation. Studies in Educational Evaluation, 21, 432-437.

12.

Hayes, J. , & Hatch, J. (1999). Issues in measuring reliability: Correlation versus percentage of agreement. Written Communication, 16(3), 354-367.

13.

Herman, J. , & Golan, S. (1993). The effects of standardized testing on teaching and schools. Educational Measurement: Issues and Practice, 12(4), 20-25, 41-42.

14.

Johnson, R. , Penny, J. , & Gordon, B. (in press). The relationship between score resolution methods and interrater reliability: An empirical study of an analytic scoring rubric. Applied Measurement in Education.

15.

Johnson, R. , Willeke, M. , Bergman, T. , & Steiner, D. (1997). Family literacy portfolios: Development and implementation. Window on the World of Family Literacy, 2(2), 10-17.

16.

Johnson, R. , Willeke, M. , & Steiner, D. (1998). Stakeholder collaboration in the design and implementation of a family literacy portfolio assessment. The American Journal of Evaluation, 19(3), 339-353.

17.

Joint Committee on Standards for Educational Evaluation (1994). The Program Evaluation Standards: How to Assess Evaluations of Educational Programs (2nd ed.). Thousand Oaks, CA: Sage.

18.

Koretz, D. , Stecher, B. , Klein, S. , & McCaffrey, D. (1994). The Vermont portfolio assessment program: Findings and implications. Educational Measurement: Issues and Practice, 13(3), 5-16.

19.

LeMahieu, P. , Gitomer, D. , & Eresh, J. (1995). Portfolios in large-scale assessment: Difficult but not impossible. Educational Measurement: Issues and Practice, 14(3), 11-16, 25-28.

20.

Linn, R. , & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement: Issues and Practices, 13(1), 5-8, 15.

21.

Mehrens, W. , & Lehmann, I. (1991). Measurement and evaluation in education and psychology (4th ed.) Fort Worth, TX: Harcourt Brace College Publishers.

22.

Meisels, S. , Liaw, F. , Dorfman, A. , & Nelson, R. (1995). The Work Sampling System: Reliability and validity of a performance assessment for young children. Early Childhood Research Quarterly, 10, 277-296.

23.

National Council of Teachers of English & International Reading Association . (1996). Standards for the English language arts. Urbana, IL: Author.

24.

National Council of Teachers of Mathematics . (1989). Curriculum and evaluation standards for school mathematics. Reston, VA: Author.

25.

Nunnally, J. (1978). Psychometric theory (2nd ed.) New York: McGraw-Hill.

26.

Paris, S. , Lawton, T. , Turner, J. , & Roth, J. (1991). A developmental perspective on standardized achievement testing. Educational Researcher, 20(5), 12-20.

27.

Popp, R. (1992). Family portfolios: Documenting change in parent-child relationships. Louisville, KY: National Center for Family Literacy.

28.

Resnick, L. , & Resnick, D. (1992). Assessing the thinking curriculum: New tools for educational reform. In B. Gifford & M. O’Connor (Eds.), Changing assessments: Alternative views of aptitude, achievement, and instruction (pp. 37-75). Boston: Kluwer.

29.

Shavelson, R. , & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.

30.

Shepard, L. (1991). Will national tests improve student learning? Phi Delta Kappan, 232-238.

31.

Shepard, L. (1992). Commentary: What policy makers should know about the new psychology of intellectual ability and learning. In B. Gifford & M. O’Connor (Eds.), Changing assessments: Alternative views of aptitude, achievement, and instruction (pp. 301-328). Boston: Kluwer.

32.

Smith, M. L. (1991). Put to the test: The effects of external testing on teachers. Educational Researcher, 20(5), 8-11.

33.

Smith, M. , & Rottenberg, C. (1991). Unintended consequences of external testing in elementary schools. Educational Measurement: Issues and Practice, 10(4), 7-11.

34.

Supovitz, J. , MacGowan, A. , & Slattery, J. (1997). Assessing agreement: An examination of the interrater reliability of portfolio assessment in Rochester, New York. Educational Assessment, 4(3), 237-259.

35.

Thorndike, R. , Cunningham, G. , Thorndike, R. , & Hagen, E. (1991). Measurement and evaluation in psychology and education, (5th ed.) New York: Macmillan.

36.

Valencia, S. , & Au, K. (1997). Portfolios across educational contexts: Issues of evaluation, teacher development, and system validity. Educational Assessment, 4(1), 1-35.

37.

Wiggins, G. (1993). Assessment: Authenticity, context, and validity. Phi Delta Kappan, 200-214.

38.

Wolf, D. , Bixby, J. , Glenn, J. , & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In G. Grant (Ed.), Review of research in education, Vol. 17 (pp. 31-74). Washington, DC: American Educational Research Association.

39.

Worthen, B. (1996). A survey of Evaluation Practice readers. Evaluation Practice, 17(1), 85-90.