Estimating Interinterviewer Reliability for Interview Schedules used in Special Education Research

Abstract

Using data from an individually administered interview schedule, the Consumer Satisfaction Inventory (CSI), interinterviewer reliability was estimated with several different approaches: simple percentages of agreement, kappa and weighted kappa, Pearson correlations, t tests on interviewers' means, and generalizability (G) theory techniques. The reliability estimates varied, sometimes widely. Differences and similarities among the approaches are discussed, and some suggestions are given to aid researchers in choosing estimation techniques for particular situation.

Get full access to this article

View all access options for this article.

References

Berry, K.J. , & Mielke, P.W. (1988). A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48, 921–933.

Birkimer, J.C. , & Brown, J.H. (1979a). Back to basics: Percentage agreement measures are adequate but there are easier ways. Journal of Applied Behavior Analysis, 12, 535–545.

Birkimer, J.C. , & Brown, J.H. (1979b). A graphical judgmental aid which summarized obtained and chance reliability data and helps assess the believability of experimental effects. Journal of Behavior Analysis, 12, 523–533.

Brennan, R.L. (1983). Elements of generalizability theory. Iowa City, IA: The American College Testing Program.

Brennan, R.L. , & Kane, M.T. (1979). Generalizability theory: A review of basic concepts, issues, and procedures. In R.E. Traub (Ed.), New directions in testing and measurement. San Francisco: Jossey-Bass.

Coates, T.J. , & Thoresen, C.E. (1978). Using generalizability in behavioral observation. Behavior Therapy, 9, 605–613.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.

Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.

10.

Crocker, L. , & Angina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart & Winston.

11.

Cronbach, L.J. , Gleser, G.C. , Nanda, H. , & Rajaratnam, N. (1972). The dependability of behavioral measurements. New York: Wiley.

12.

Flanders, N.A. (1967). Estimating reliability. In E.J. Amidon & J.B. Hough (Eds.), Interaction analysis: Theory, research, and application (pp. 161–166). Reading, MA: Addison-Wesley.

13.

Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.

14.

Fleiss, J.L. , Cohen, J. , & Everitt, B.S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323–327.

15.

Frick, T. , & Semmel, M.I. (1978). Observer agreement and reliabilities of classroom observational measures. Review of Educational Research, 48, 157–184.

16.

Glass, G.V. , & Hopkins, K.D. (1984). Statistical methods in education and psychology. (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall.

17.

Goodwin, L.D. , & Goodwin, W.L. (in press). Using generalizability theory in early childhood special education. Journal of Early Intervention.

18.

Goodwin, L.D. , & Prescott, P.A. (1981). Issues and approaches to estimating interrater reliability in nursing research. Research in Nursing and Health, 4, 323–337.

19.

Goodwin, L.D. , & Sandall, S.R. (1988). Interrater reliability of parent-infant interaction scales. Diagnostique, 13, 106–119.

20.

Hartmann, D.P. (1977). Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis, 10, 103–116.

21.

Hopkins, K.D. , Stanley, J.C. , & Hopkins, B.L. (1990). Educational and psychological measurement and evaluation. (7th ed.). Englewood Cliffs, NJ: Prentice-Hall.

22.

Hoyt, C. (1941). Test reliability estimated by analysis of variance. Psychometrika, 6, 153–160.

23.

Kazdin, A.E. (1982). Single case research designs: Methods for clinical and applied settings. New York: Oxford University Press.

24.

Kerlinger, F.N. (1986). Foundations of behavioral research. (3rd ed.). New York: Holt, Rinehart & Winston.

25.

Kirk, R.E. (1982). Experimental design: Procedures for the behavioral sciences. (2nd ed.). Monterey, CA: Brooks/Cole.

26.

Lane, S. , & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays. Applied Measurement in Education, 2, 195–205.

27.

Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377.

28.

Lindquist, E.F. (1953). Design and analysis of experiments in psychology and education. Boston: Houghton Mifflin.

29.

Rae, G. (1988). The equivalence of multiple rater kappa statistics and intraclass correlation coefficients. Educational and Psychological Measurement, 48, 367–374.

30.

Salvia, J. , & Ysseldyke, J. (1988). Assessment in special and remedial education. (4th ed.). Boston: Houghton Mifflin.

31.

Sands, D.J., Barker, L.T., Bronicki, B., Hoyt, D., Jones, J., Kelsey, S., & Kozleski, E. (1989, December). Whose needs are we meeting? An analysis of consumer satisfaction. Panel presentation at the annual conference of The Association for Persons with Severe Handicaps, San Francisco.

32.

Sands, D.J. , Kozleski, E.B. , & Goodwin, L.D. (1989). A consumer satisfaction survey of persons with developmental disabilities: A final report for the Colorado Developmental Disabilities Planning Council. Denver: University of Colorado.

33.

Schouten, H.J.A. (1986). Nominal scale agreement among observers. Psychometrika, 51, 453–466.

34.

Scott, W.A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321–325.

35.

Shavelson, R.J. , & Webb, N.M. (1981). Generalizability theory: 1973–1980. British Journal of Mathematical and Statistical Psychology, 34, 133–166.

36.

Shavelson, R.J. , Webb, N.M. , & Burstein, L. (1985). The measurement of teaching. In M.C. Wittrock (Ed.), Handbook for research on teaching (3rd ed., pp. 50–91). New York: Macmillan.

37.

Shavelson, R.J. , Webb, N.M. , & Rowley, G.L. (1989). Generalizability theory. American Psychologist, 44, 922–932.

38.

Stack, J.G. (1984). Interrater reliabilities of the Adaptive Behavior Scale with environmental effects controlled. American Journal of Mental Deficiency, 88, 396–400.

39.

Taylor, R.L. (1989). Assessment of exceptional students: Educational and psychological procedures. (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall.

40.

van der Kamp, L.J.T. (1976). Generalizability and educational measurement. In D.N.M. De Gruijter & L.J.T. van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 173–184). New York: Wiley.

41.

Webb, N.M. , Rowley, G.L. , & Shavelson, R.J. (1988). Using generalizability theory in counseling and development. Measurement and Evaluation in Counseling and Development, 21, 81–90.

42.

Winer, B.J. (1971). Statistical principles in experimental design. (2nd ed.). New York: McGraw-Hill.

43.

Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–378.