IRT-Based Assessments of Rater Effects in Multiple-Source Feedback Instruments

Abstract

This study compares three item response theory–based models of assessing measurement equivalence in 360° feedback: the traditional differential item functioning (DIF) methodology, Muraki’s rater’s effect model, and Patz, Junker, and Johnson’s hierarchical rater model. Using data from 491 managers collected on the Benchmarks instrument, the authors found that the traditional DIF methodology provides the most information about the rater’s conception of the ratee’s ability, whereas the other two models provide explicit estimates of rater leniency/severity. The authors also found that rater source effects of leniency and severity, even though statistically significant, did not substantially affect the observed score at the item and scale levels. The different results and conclusions produced by each model are discussed.

Get full access to this article

View all access options for this article.

References

Atwater, L. (1998). The advantages and pitfalls of self-assessment in organizations. In J. W. Smither (Ed.), Performance appraisal: State of the art in practice (pp. 331-369). San Francisco: Jossey-Bass.

Atwater, L. E., Ostroff, C., Yammarino, F. J., & Fleenor, J. W. (1998). Self-other agreement: Does it really matter? Personnel Psychology, 51, 577-598.

Baker, F. (1995). EQUATE Computer Program (Version 2.1) [Computer software]. Madison: University of Wisconsin.

Collins, W. C., Raju, N. S., & Edwards, J. E. (2000). Assessing differential functioning in a satisfaction scale. Journal of Applied Psychology, 85, 451-461.

Conway, J., & Huffcutt, A. (1997). Psychometric properties of multisource performance ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings. Human Performance, 10, 331-360.

Donoghue, J. R., & Hombo, C. M. (2000, April). A comparison of different model assumptions about rater effects. Paper presented at the 2000 annual meeting of the National Council on Measurement in Education, New Orleans, LA.

Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in heterogeneous populations. Journal of Applied Psychology, 70, 662-680.

Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86, 215-227.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

10.

Harris, M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, and peer-supervisor ratings. Personnel Psychology, 41, 43-62.

11.

Jawahar, I. M., & Williams, C. R. (1997). Where all the children are above average: The performance appraisal purpose effect. Personnel Psychology, 50, 905-925.

12.

Kenny, D. A. (1991). A general model of consensus and accuracy in interpersonal perception. Psychological Review, 2, 155-163.

13.

Landy, F. L., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72-107.

14.

Lombardo, M. M., & McCauley, C. D. (1990). Benchmarks developmental reference points for managers and executives. Greensboro, NC: Center for Creative Leadership.

15.

London, M., & Smither, J. W. (1995). Can multi-source feedback change perceptions of goal accomplishment, self-evaluations, and performance-related outcomes: Theory-based applications and directions for research. Personnel Psychology, 48, 803-839.

16.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.

17.

Maurer, T. J., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate performance appraisal measurement equivalence. Journal of Applied Psychology, 83, 693-702.

18.

Mislevy, R. J., & Bock, R. D. (1990). Bilog 3 user manual. Mooresville, IN: Scientific Software.

19.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.

20.

Muraki, E. (1993). Variations of polytomous item response models: Raters effect model, DIF model, and trend model. Paper presented at the annual meting of the American Educational Research Association, Atlanta, GA.

21.

Muraki, E., & Bock, R. D. (1996). PARSCALE (Version 3) [Computer program]. Chicago: Scientific Software International.

22.

Muraki, E., & Bock, R. D. (2001). PARSCALE (Version 4) [Computer program]. Chicago: Scientific Software International.

23.

Patz, R. J., & Junker, B. W. (1999a). Applications and extensions extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Beahvioral Statistics, 24, 342-366.

24.

Patz, R. J., & Junker, B. W. (1999b). A straightforward approach to Markov Chain Monte Carlo Methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146-178.

25.

Patz, R. J., Junker, B. W., & Johnson, M. S. (1999, March). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Paper pre sented at the annual meeting of the American Educational Research Association, Montreal, Quebec.

26.

Raju, N. S. (2001a). DFITP6M: A FORTRAN program for calculating DIF/DTF using the GPCM [Computer program]. Chicago: Illinois Institute of Technology.

27.

Raju, N. S. (2001b). DFITP6MR: A FORTRAN program for calculating DIF/DTF using rater shift parameters [Computer program]. Chicago: Illinois Institute of Technology.

28.

Raju, N. S., Laffitte, L. J., & Byrne, B. M. (in press). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology.

29.

Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-Based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353-368.

30.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores [Monograph]. Psychometrika, Monograph Supplement No. 17.

31.

Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85, 956-970.

32.

Van Velsor, E., Taylor, S., & Leslie, J. B. (1993). An examination of the relationships among self-perception accuracy, self-awareness, gender, and leader effectiveness. Human Resource Management, 32, 249-263.

33.

Yammarino, F. J., & Atwater, L. E. (1993). Understanding self-perception accuracy: Implications for human resource management. Human Resource Management, 32, 231-247.

34.

Yammarino, F. J., & Atwater, L. E. (1997). Do managers see themselves as others see them? Implications of self-other rating agreement for human resources management. Organizational Dynamics, 25, 35-44.