Methods for Evaluating Composite Reliability,Classification Consistency,and Classification Accuracy for Mixed-Format Licensure Tests

Abstract

The purpose of this study was to propose extensions of reliability estimation methods that could be used to determine the conditions under which single scoring for constructed-response (CR) items is as effective as double scoring in mixed-format licensure tests. Multivariate generalizability theory methods traditionally used to estimate overall composite score reliability were extended with simulations so that classification consistency and classification accuracy estimates could also be obtained. Composite score reliabilities, classification consistencies, and accuracies were estimated based on the double and single scoring of the CR items of three licensure tests. Composite score reliabilities, classification consistencies, and accuracies were also estimated in decision studies considering varied testing situations such as different numbers of CR items and different CR section weights.

Keywords

generalizability theory rater effects performance assessment classification reliability

Get full access to this article

View all access options for this article.

References

Brennan

R. L.

(2001). Generalizability theory. New York, NY: Springer-Verlag.

Brennan

R. L.

Wan

(2004). A bootstrap procedure for estimating decision consistency for single-administration complex assessments (Research Report No. 7). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa.

Clauser

B. E.

Balog

Harik

Mee

Kahraman

(2009). A multivariate generalizability analysis of history-taking and physical examination scores from the USMLE Step 2 Clinical Skills Examination. Academic Medicine, 84, 586-589.

Feldt

L. S.

Brennan

R. L.

(1989). Reliability. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 105-146). New York, NY: Macmillan.

Fleishman

A. I.

(1978). A method for simulating non-normal distributions. Psychometrika, 43, 521-532.

Gao

Brennan

R. L.

(2001). Variability of estimated variance components and related statistics in a performance assessment. Applied Measurement in Education, 14, 191-203.

Haertel

E. H.

(2006). Reliability. In Brennan

R. L.

(Ed.), Educational measurement (4th ed., pp. 65-110). Westport, CT: American Council on Education.

Jarjoura

Early

Androulakakis

(2004). A multivariate generalizability model for clinical skills assessments. Educational and Psychological Measurement, 64, 22-39.

Joe

G. W.

Woodward

J. A.

(1976). Some developments in multivariate generalizability. Psychometrika, 34, 183-201.

10.

Kim

Moses

(2013). Determining when single scoring for constructed-response items is as effective as double scoring in mixed-format licensure tests. International Journal of Testing, 13, 314-328.

11.

Kim

Walker

M. E.

McHale

(2010). Comparisons among designs for equating mixed-format tests in large scale assessments. Journal of Educational Measurement, 47, 36-53.

12.

Livingston

S. A.

Lewis

(1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179-197.

13.

Lord

F. M.

(1958). An empirical study of the normality and independence of errors of measurement in test scores (Research Bulletin 58-14). Princeton, NJ: Educational Testing Service.

14.

Mollenkopf

W. G.

(1949). Variation of the standard error of measurement. Psychometrika, 14, 189-229.

15.

Shavelson

R. J.

Webb

N. M.

Rowley

G. L.

(1989). Generalizability theory. American Psychologist, 44, 922-932.