Adjustments for Rater Effects in Performance Assessment

Abstract

Alternative methods to correct for rater leniency/stringency effects (i.e., rater bias) in per formance ratings were investigated. Rater bias effects are of concern when candidates are evaluated by different raters. The three correction methods evaluated were ordinary least squares (OLS), weighted least squares (WLS), and imputation of the missing data (IMPUTE). In addition, the usual procedure of averaging the observed ratings was investigated. Data were simulated from an essentially τ-equivalent measure ment model, with true scores and error scores nor mally distributed. The variables manipulated in the simulations were method of correction (OLS, WLS, IMPUTE, averaging the observed ratings), amount of missing data (50% missing, 75% missing), rater bias (low, high), and number of examinees or can didates (N = 50, N = 100). The accuracy of the methods in estimating true scores was assessed based on the square root of the average squared difference between the estimated and known true scores. The three correction methods consistently outperformed the procedure of averaging the observed ratings. IMPUTE was superior to the least squares methods.

Keywords

Index terms: EM algorithm,incomplete data incomplete rating designs least squares adjustments performance assessment rater calibration.

Get full access to this article

View all access options for this article.

References

Beale, E.M.L. , & Little, R.J.A. (1975). Missing data in multivariate analysis. Journal of the Royal Statistical Society, 37, (Series B), 129-145.

Berk, R. A. (Ed.) (1986). Performance assessment. Baltimore: Johns Hopkins University Press.

Braun, H.I. (1988). Understanding scoring reliability : Experiments in calibrating essay readers. Journal of Educational Statistics , 13, 1-18.

Buck, S.F. (1960). A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society, 22, (Series B), 302-307.

Cason, G.J. , & Cason, C.L. (1984). A deterministic theory of clinical performance rating. Evaluation and the Health Professions, 7, 221-247.

de Gruijter, D.N.M. (1984). Two simple models for rater effects. Applied Psychological Measurement, 8, 213-218.

Dempster, A.P. , Laird, N.M. , & Rubin, D.B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, (Series B), 1-38.

Draper, N.R. , & Smith, H. (1981). Applied regression analysis. (2nd ed.). New York: Wiley.

Guilford, J.P. (1954). Psychometric methods (2nd ed.). New York: McGraw Hill.

10.

Hartley, H.O. , & Hocking, R.R. (1971). The analysis of incomplete data. Biometrika, 27, 783-832.

11.

King, L.M. , Schmidt, F.L. , & Hunter, J.E. (1980). Halo in a multidimensional forced-choice evaluation scale. Journal of Applied Psychology, 65, 507-516.

12.

Little, R. (1976). Inferences about means from incomplete multivariate data. Biometrika, 63, 593-604.

13.

Little, R. , & Rubin, D. (1987). Statistical analysis with missing data. New York: Wiley.

14.

Lord, F.M. , & Novick, M.R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Wesley.

15.

Neufeld, V. R. , & Norman, G. R. (Eds.). (1985). Assessing clinical competence . New York: Springer.

16.

Novick, M.R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3, 1-18.

17.

Orchard, T. , & Woodbury, M. (1972). A missing information principle: Theory and applications . Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Problems, Vol 1, 697-715.

18.

Raymond, M.R. (1986). Missing data in evaluation research. Evaluation and the Health Professions, 9, 395-420.

19.

Raymond, M.R. , & Roberts, D.M. (1987). A comparison of methods for treating incomplete data in selection research. Educational and Psychological Measurement, 47, 1326.

20.

Raymond, M.R. , Webb, L.C. , & Houston, W.M. (1991). Correcting performance rating errors in oral examinations . Evaluation and the Health Professions, 14, 100-122.

21.

Rothstein, H.R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322-327.

22.

Rubin, D. (1976). Inferences and missing data. Biometrika, 63, 581-592.

23.

Scheffé, H. (1959). The analysis of variance. New York: Wiley.

24.

Wilson, H.G. (1988). Parameter estimation for peer grading under incomplete designs. Educational and Psychological Measurement, 48, 69-81.