Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning

Abstract

The purpose of this study was to investigate model-data fit and differential rater functioning in the context of large group music performance assessment using the Many-Facet Rasch Partial Credit Measurement Model. In particular, we sought to identify whether or not expert raters’ (N = 24) severity was invariant across four school levels (middle school, high school, collegiate, professional). Interaction analyses suggested that differential rater functioning existed for both the group of raters and some individual raters based on their expected locations on the logit scale. This indicates that expert raters did not demonstrate invariant levels of severity when rating subgroups of ensembles across the four school levels. Of the 92 potential pairwise interactions examined, 14 (15.2%) interactions were found to be statistically significant, indicating that 10 individual raters demonstrated differential severity across at least one school level. Interpretations of meaningful systematic patterns emerged for some raters after investigating individual pairwise interactions. Implications for improving the fairness and equity in large group music performance evaluations are discussed.

Keywords

big band performance differential rater functioning invariant measurement theory item response theory Many-Facet Partial Credit Model music performance assessment music performance evaluation Rasch model

Get full access to this article

View all access options for this article.

References

Abeles

H. F.

Hoffer

C. R.

Klottman

R. H.

(1994). Foundations of music education (2nd ed.). New York, NY: Schirmer Books.

AERA, APA, & NCME. (2014). Standards for educational and psychological testing (2nd ed.). Washington, DC: American Educational Research Association (AERA).

Aiello

Williamon

(2002). Memory. In Parncutt

McPherson

G. E.

(Eds.), The science and psychology of music performance: Creative strategies for teaching and learning (pp. 167–181). Oxford, UK: Oxford University Press.

Alluisi

E. A.

(1962). Rater-rater reliabilities in judging musical performances. Perceptual and Motor Skills, 14, 145–146.

Ando

(1988). Architectural acoustics: Blending sound sources, sound fields, and listeners. New York, NY: Springer.

Angoff

W. H.

(1993). Perspectives on differential item functioning methodology. In Holland

P. W.

Wainer

(Eds.), Differential item functioning (pp. 3–23). Hillside, NJ: Lawrence Erlbaum.

Austin

J. R.

(1988). The effect of music contest format on self-concept, motivation, achievement, and attitude of elementary band students. Journal of Research in Music Education, 36(2), 95–107.

Banister

(1992). Attitudes of high school band directors toward the value of marching band and concert band contests and selected aspects of the overall band program. Missouri Journal of Research in Music Education, 29, 49–57.

Bergee

M. J.

(2003). Faculty interjudge reliability of music performance evaluation. Journal of Research in Music Education, 51(2), 137–150.

10.

Bermingham

G. A.

(2000). Effects of performers’ external characteristics on performance evaluations. Update: Applications of Research in Music Education, 18(2), 3–7.

11.

Bock

R. D.

Jones

L. V.

(1968). The measurement and prediction of judgment and choice. San Francisco, CA: Holden-Day.

12.

Bond

T. G.

Fox

C. M.

(2007). Applying the Rasch model: Fundamental measurement in the human sciences. New York, NY: Routledge.

13.

Boyle

D. J.

(1992). Program evaluation for secondary school music programs. NASSAP Bulletin, 76(544), 63–68.

14.

Brakel

T. D.

(2006). Inter-judge reliability of the Indiana State School Music Association high school instrumental festival. Journal of Band Research, 42(1), 59–69.

15.

Burnsed

Hinkle

King

(1985). Performance evaluation reliability at selected concert festivals. Journal of Band Research, 21(1), 22–29.

16.

Cattell

R. B.

Anderson

J. C.

(1953). The measurement of personality and behavior disorders by the I.P.A.T. Music Preference Test. Journal of Applied Psychology, 37(6), 446–454.

17.

Charney

D. A.

(1984). The validity of using holistic scoring to evaluate writing: A critical overview. Research in the Teaching of English, 18(1), 65–81.

18.

Colwell

(1970). The evaluation of music teaching and learning. Engelwood Cliffs, NJ: Prentice-Hall.

19.

Colwell

(2007). Music assessment in an increasingly politicized, accountability-driven educational environment. In Brophy

T. S.

(Ed.), Proceedings of the 2007 Symposium on Assessment in Music Education (pp. 3–16). Chicago, IL: GIA.

20.

Conrad

(2003). Judging the judges: Improving rater reliability at music contests. NFHS Music Association Journal, 20(2), 27–31.

21.

Crochet

L. S.

(2006). Repertoire selection practices of band directors as a function of teaching experience, training, instructional level, and degree of success (Unpublished doctoral dissertation). University of Miami, Coral Gables, FL.

22.

Davidson

J. W.

(1993). Visual perception of performance manner in the movements of solo musicians. Psychology of Music, 21, 103–113.

23.

Davidson

J. W.

(1994). Which areas of a pianist’s body convey information about expressive intention to an audience? Journal of Human Movement Studies, 26, 279–301.

24.

Davidson

J. W.

(1995). What does the visual information contained in musical performances offer the observer? Some preliminary throughts. In Steinberg

(Ed.), The music machine: Psychophysiology and psychopathology of the sense of music (pp. 105–113). Berlin, Germany: Springer-Verlag.

25.

Davidson

J. W.

(2001). The role of the body in the production and perception of solo vocal performance: A case study of Annie Lennox. Musicae Scientiae, 5(2), 235–256.

26.

Davidson

J. W.

Coimbra

D. D. C.

(2001). Investigating performance evaluation by assessors of singers in a music college setting. Musicae Scientiae, 5, 33–53.

27.

Wright

B. D.

Brown

W. L.

(1996, April). Differential facet functioning detection in direct writing assessment. Paper presented at the Annual Meeting of the American Educational Research Association, New York, NY.

28.

Duerksen

G. L.

(1972). Some effects of expectation on evaluation of recorded musical performance. Journal of Research in Music Education, 20(2), 268–272.

29.

Embretson

S. E.

Hershberger

S. L.

(Eds.). (1999). The new rules of measurement: What every psychologist and educator should know. Mahwah, NJ: Erlbaum.

30.

Engelhard

Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171–191.

31.

Engelhard

Jr. (1996a). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 33(1), 56–70.

32.

Engelhard

Jr. (1996b). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 33(2), 115–116.

33.

Engelhard

Jr. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1(1), 19–33.

34.

Engelhard

Jr. (2002). Monitoring raters in performance assessments. In Tindal

Haladyna

(Eds.), Large-scale assessment programs for all students: Development, implementation, and analysis (pp. 261–287). Mahwah, NJ: Erlbaum.

35.

Engelhard

Jr. (2009). Using item response theory and model-data fit to conceptualize differential item and person functioning for students with disabilities. Educational and Psychological Measurement, 69(4), 585–602.

36.

Engelhard

Jr. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York, NY: Routledge.

37.

Engelhard

Jr. Myford

C. M.

(2003). Monitoring faculty consultant performance in the advanced placement English literature and composition program with a Many-faceted Rasch Model (ETS Research Report). Retrieved from http://www.ets.org/research/policy_research_reports/publications/report/2003/ihav

38.

Engelhard

Jr. Wind

S. A.

(2013). Rating quality studies using Rasch measurement theory (Research Report 2013-3). New York: The College Board. Retrieved from https://research.collegeboard.org/sites/default/files/publications/2013/8/researchreport-2013-3-rating-quality-studies-using-rasch-measurement-theory.pdf

39.

Fiske

H. E.

(1983). The effect of a training procedure in music performance evaluation on judge reliability. Toronto, Canada: Ontario Ministry of Education.

40.

Flores

J. R. G.

Gonsburgh

V. A.

(1996). The Queen Elizabeth musical competition: How fair is the final ranking? Journal of the Royal Statistical Society. Series D (The Statistician), 45(1), 97–104.

41.

Forbes

G. W.

(1994). Evaluating music festivals and contests – are they fair? Update: Applications of Research in Music Education, 12(2), 16–20.

42.

Franklin

J. O.

(1979). Attitudes of school administrators, band directors, and band students towards selected activities of the public school band program (Unpublished doctoral dissertation). Northwestern State University of Louisiana, Natchitoches, LA.

43.

Granger

C. V.

(2008). Rasch analysis is important to understand and use for measurement. Rasch Measurement Transactions, 21(3), 1122–1123.

44.

Hash

P. M.

(2012). An analysis of the ratings and interrater reliability of high school band contests. Journal of Research in Music Education, 60(1), 81–100.

45.

Henning

(1997). Accounting for nonsystematic error in performance ratings. Language Testing, 13(1), 53–63.

46.

Howard

K. K.

(1994). A survey of Iowa high school band students’ self-perceptions and attitudes toward types of music contests (Unpublished doctoral dissertation). University of Iowa, Iowa City, IA.

47.

Howard

R. L.

(2002). Repertoire selection practices and the development of a core repertoire for the middle school concert band (Unpublished doctoral dissertation). University of Florida, Gainsville, FL.

48.

Huot

(1990). Reliability, validity, and holistic scoring: What we know and what we need to know. College Composition and Communication, 41(2), 201–213.

49.

Hurst

C. W.

(1994). A nationwide investigation of high school band directors’ reasons for participating in music competitions (Unpublished doctoral dissertation). The University of North Texas, Denton, TX.

50.

Johnson

R. L.

Penny

J. A.

Gordon

(2009). Assessing performance: Developing, scoring, and validating performance tasks. New York, NY: Guilford Press.

51.

King

S. E.

Burnsed

(2007). A study of the reliability of adjudicator ratings at the 2005 Virginia band and orchestra directors association state marching band festivals. Journal of Band Research, 27–33.

52.

Kirchhoff

(1988). The school and college band: Wind band pedagogy in the United States. In Gates

J. T.

(Ed.), Music education in the United States: Contemporary issues (pp. 259–276). Tuscaloosa, AL: The University of Alabama Press.

53.

Kondo-Brown

(2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3–31.

54.

Langmeyer

Guglhör-Rudan

Tarnai

(2012). What do music preferences reveal about personality? Journal of Individual Differences, 33(2), 119–130.

55.

LeBlanc

Jin

Y. C.

Obert

Siivola

(1997). Effects of audience on music performance anxiety. Journal of Research in Music Education, 45(3), 480–496.

56.

Linacre

J. M.

(2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 86–106.

57.

Linacre

J. M.

(2014). Facets. Chicago, IL: MESA Press.

58.

Linacre

J. M.

(n.d.). Bias interaction DIF DPF DRF estimation: Help for Facets Rasch Measurement Software. Retrieved from www.winsteps.com/facetman/biasestimation.htm

59.

Linacre

J. M.

Wright

B. D.

(2004). Construction of measures from many-facet data. In Smith

E. V.

Smith

R. M.

(Eds.), Introduction to Rasch measurement: Theories, models, and applications (pp. 296–321). Maple Grove, MN: JAM Press.

60.

Loewenstein

Lerner

J. S.

(2003). The role of affect in decision making. In Davidson

R. J.

Scherer

K. R.

Goldsmith

H. H.

(Eds.), Handbook of affective sciences (pp. 619–642). Oxford, UK: Oxford University Press.

61.

Lumley

(2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276.

62.

Lumley

(2005). Assessing second language writing: The rater’s perspective. Frankfurt am Main, Germany: Peter Lang.

63.

Lunz

M. E.

Stahl

J. A.

(1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13, 425–444.

64.

Lunz

M. E.

Stahl

J. A.

Wright

B. D.

(1996). The invariance of rater severity calibrations. In Engelhard

Jr. Wilson

(Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 99–112). Norwood, NJ: Ablex.

65.

McNamara

(1996). Measuring second language performance: Applied linguistics and language study. New York, NY: Longman.

66.

McPherson

G. E.

Schubert

(2004). Measuring performance enhancement in music. In Williamon

(Ed.), Musical excellence: Strategies and techniques to enhance performance (pp. 61–82). Oxford, UK: Oxford University Press.

67.

McPherson

G. E.

Thompson

W. F.

(1998). Assessing music performance: Issues and influences. Research Studies in Music Education, 10(1), 12–24.

68.

Mills

(1991). Assessing musical performance musically. Educational Studies, 17(2), 173–181.

69.

Millsap

R. E.

(2011). Statistical approaches to measurement invariance. New York, NY: Routledge.

70.

Myford

C. M.

Wolfe

E. W.

(2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

71.

Myford

C. M.

Wolfe

E. W.

(2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.

72.

Norris

C. E.

Borst

J. D.

(2007). An examination of the reliabilities of two choral festival adjudication forms. Journal of Research in Music Education, 55(3), 237–251.

73.

O’Neill

T. R.

Lunz

M. E.

(2000). A method to study rater severity across several administrations. In Wilson

Engelhard

Jr. (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 135–146). Stamford, CT: Ablex.

74.

Platz

Kopiez

(2012). When the eye listens: A meta-analysis of how audio-visual presentation enhances the appreciation of music performance. Music Perception, 30(1), 71–83.

75.

Raymond

M. R.

Webb

L. C.

Houston

W. M.

(1991). Correcting performance-rating errors in oral examinations. Evaluation and the Health Professions, 14(1), 100–122.

76.

Saal

F. E.

Downey

R. G.

Lahey

M. A.

(1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413–428.

77.

Schubert

(2002). Continuous response methodology applied to expressive performance. In Stevens

Burnham

McPherson

Schubert

Renwick

(Eds.), Proceedings of the Seventh International Conference on Music Perception and Cognition (pp. 83–86). Adelaide, Australia: Causal Productions.

78.

Shaw

(1991). Descriptive IRT vs . prescriptive Rasch. Rasch Measurement Transactions, 5(1), 2–5.

79.

Silvey

B. A.

(2009). The effects of band labels on evaluators’ judgments of musical performance. Update: Applications of Research in Music Education, 28(1), 47–52.

80.

Sivill

J. R.

(2004). Students’ and directors’ perceptions of high school band competitions (Unpublished doctoral dissertation). Bowling Green State University, Bowling Green, OH.

81.

Stanley

Brooker

Gilbert

(2002). Examiner perceptions of using criteria in music performance assessment. Research Studies in Music Education, 18(1), 46–56.

82.

Sudweeks

R. R.

Reeve

Bradshaw

W. S.

(2004). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239–261.

83.

Sweeney

C. R.

(1998). A description of student and band director attitudes toward concert band competition (Unpublished master’s thesis). University of Miami, Coral Gables, FL.

84.

Thompson

Williamon

(2003). Evaluating evaluation: Musical performance assessment as a research tool. Music Perception, 21(1), 21–41.

85.

Wapnick

Darrow

A. A.

Kovacs

Dalrymple

(1997). Effects of physical attractiveness on evaluation of vocal performance. Journal of Research in Music Education, 45(3), 470–479.

86.

Wapnick

Mazza

J. K.

Darrow

A. A.

(1998). Effects of performer attractiveness, stage behavior, and dress on evaluation of violin performance evaluation. Journal of Research in Music Education, 46(4), 510–521.

87.

Wapnick

Mazza

J. K.

Darrow

A. A.

(2000). Effects of performer attractiveness, stage behavior, and dress on evaluation of children’s piano performances. Journal of Research in Music Education, 48(4), 323–335.

88.

Wesolowski

B. C.

(2014). Documenting student learning in music performance: A framework. Music Educators Journal, 101, 77–85.

89.

Wesolowski

B. C.

(2015). Assessing jazz big band performance: The development, validation, and application of a facet-factorial rating scale. Psychology of Music. Advance online publication. doi: 10.1177/0305735614567700

90.

Wigglesworth

(1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10, 305–335.

91.

Wigglesworth

(1994). Patterns of rater behavior in te assessment of an oral interction test. Australian Review of Applied Linguistics, 17(2), 77–103.

92.

Wilson

(2005). Constructing measures: An item response modeling approach. New York, NY: Taylor & Francis.

93.

Wilson

Gochyyev

(2013). Psychometrics. In Teo

(Ed.), Handbook of quantitative methods for educational research (pp. 1–53). Rotterdam, the Netherlands: Sense.

94.

Wright

B. D.

(1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281–285.

95.

Wright

B. D.

Stone

M. H.

(1979). Best test design. Chicago, IL: MESA Press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB