The Internal and External Optimality of Decisions Based on Tests

Abstract

In applied measurement, test scores are usually transformed to decisions. Analogous to classical test theory, the reliability of decisions has been de fined as the consistency of decisions on a test and a retest or on two parallel tests. Coefficient kappa (Cohen, 1960) is used for assessing the consistency of decisions. This coefficient has been developed for assessing agreement between nominal scales. It is argued that the coefficient is not suited for as sessing consistency of decisions. Moreover, it is ar gued that the concept consistency of decisions is not appropriate for assessing the quality of a decision procedure. It is proposed that the concept con sistency of decisions be replaced by the concept optimality of the decision procedure. Two types of optimality are distinguished. The internal optimal ity is the risk of the decision procedure with respect to the true score the test is measuring. The external optimality is the risk of the decision procedure with respect to an external criterion. For assessing the optimality of a decision procedure, coefficient delta (van der Linden & Mellenbergh, 1978), which can be considered a standardization of the Bayes risk or expected loss, can be used. Two loss functions are dealt with: the threshold and the linear loss func tions. Assuming psychometric theory, coefficient delta for internal optimality can be computed from empirical data for both the threshold and the linear loss functions. The computation of coefficient delta for external optimality needs no assumption of psy chometric theory. For six tests coefficient delta as an index for internal optimality is computed for both loss functions; the results are compared with coefficient kappa for assessing the consistency of decisions with the same tests.

Get full access to this article

View all access options for this article.

References

Algina, J. , & Noe, M.J. A study of the accuracy of Subkoviak's single administration estimate of the coefficient of agreement using two true score estimates. Journal of Educational Measurement, 1978, 15, 101-110.

Bishop, Y.M.M. , Fienberg, S.E. , & Holland, P.W. Discrete multivariate analysis: Theory and practice. Cambridge, MA: The MIT Press, 1976.

Carver, R.P. Special problems in measuring chance with psychometric devices. In Evaluative research: Strategies and methods. Pittsburgh: American Institutes for Research, 1970.

Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20, 37-46.

Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 1968, 70, 213-220.

Cohen, J. Weighted chi-square: An extension of the kappa method. Educational and Psychological Measurement, 1972, 32, 61-74.

de Bruyne, H.C.D. Blokken in het onderwijs. Groningen : Tjeenk Willink, 1976.

Emrick, J.A. An evaluation model for mastery testing. Journal of Educational Measurement, 1971, 8,321-326.

Emrick, J.A. , & Adams, F.N. An evaluation model for individualized instruction (Report RC2674. Yorktown Hts., NY: IBM, Thomas J. Watson Research Center , October 1969.

10.

Everitt, B.S. Moments of the statistic kappa and weighted kappa. British Journal of Mathematical and Statistical Psychology, 1968, 21, 97-103.

11.

Ferguson, T.S. Mathematical statistics: A decision theoretic approach. New York: Academic Press, 1967.

12.

Fischer, G.H. Einführung in die theorie psychologischer tests. Bern: Verlag Hans Huber, 1974.

13.

Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76, 378-382.

14.

Fleiss, J.L. , & Cohen, J.

The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and

Psychological Measurement, 1973, 33, 613-619.

15.

Fleiss, J.L. , Cohen, J. , & Everitt, B.S. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 1969, 72, 323-327.

16.

Griffiths, D.A. Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number of cases of a disease. Biometrics, 1973, 29, 637-648.

17.

Gross, A.L. , & Su, W.H. Defining a "fair" or "unbaised" selection model: A question of utilities. Journal of Applied Psychology, 1975, 60, 345-351.

18.

Gupta, S.S. , Probability integrals of multivariate and normal t. Annals of Mathematical Statistics, 1963, 34, 792-828.

19.

Hambleton, R.K. , & Novick, M.R. Toward an integration of theory and method for criterion-referenced tests . Journal ofEducational Measurement, 1973, 10, 159-170.

20.

Hubert, L. Kappa revisited. Psychological Bulletin, 1977 , 84, 289-297.

21.

Huynh, H. Statistical considerations of mastery scores. Psychometrika , 1976, 41, 65-79. (a)

22.

Huynh, H. On the reliability ot decisions in domain-referenced testing. Journal of Educational Measurement, 1976,13, 253-264. (b)

23.

Huynh, H. Reliability of multiple classifications. Paper presented at the spring meeting of the Psychometric Society, Murray Hill, NJ, April 1976. (c)

24.

Huynh, H. The kappamax reliability index for decisions in domain-referenced testing . Paper presented at the annual meeting of the American Educational Research Association, New York, April 1977.

25.

Jackson, R. Developing criterion-referenced tests (TM Report No. 1. Princeton, NJ: ERIC Clearing-house on Tests, Measurement, and Evaluation, 1970.

26.

Keats, J.A. , & Lord, F.M. A theoretical distribution for mental test scores. Psychometrika , 1962, 27, 59-72.

27.

Koppelaar, H. , van der Linden, W.J. , & Mellenbergh, G.J. A computer program for classification proportions in dichotomous decisions based on dichotomously scored items. Tijdschrift voor Onderwijsresearch, 1977, 2, 32-36.

28.

Landis, J.R. , & Koch, G.G. A review of statistical methods in the analysis of data arising from observer reliability studies (Part 1. Statistica Neerlandica, 1975, 29, 101-123. (a)

29.

Landis, J.R. , & Koch, G.G. A review of statistical methods in the analysis of data arising from observer reliability studies (Part 2. Statistica Neerlandica, 1975,29,151-161. (b)

30.

Light, R.J. Measures of agreement for qualitative data: Some generalizations and alternatives . Psychological Bulletin, 1971, 76, 365-377.

31.

Loevinger, J. A systematic approach to the construction and evaluation of tests of ability . Psychological Monographs, 1947, 61 (Whole No. 285).

32.

Lord, F.M. Estimating true-score distributions in psychological testing (An empirical Bayes estimation problem. Psychometrika, 1969 , 34, 259-299.

33.

Lord, F.M. , & Novick, M.R. Statistical theories of mental test scores. Reading, MA : Addison-Wesley, 1968.

34.

Marshall, J.L. The mean split-half coefficient of agreement and its relation to other test indices: A study based on simulated data (Technical Report 350. Madison: University of Wisconsin, Research and Development Center for Cognitive Learning, 1975.

35.

Marshall, J.L. , & Haertel, E.H. A single-administration reliability index of criterion-referenced test: The mean split-half coefficient of agreement. Paper presented at the annual meeting of the American Educational Research Association , Washington, DC, April 1975.

36.

Mellenbergh, G.J. , Koppelaar, H. , & van der Linden, W.J. Dichotomous decisions based on dichotomously scored items: A case study. Statistica Neerlandica, 1977, 31, 161-169.

37.

Petersen, N.S. An expected utility model for "optimal" selection. Journal of Educational Statistics, 1976, 1, 333-358.

38.

Spitzer, R.L. , Cohen, J. , Fleiss, J.L. , & Endicott, J. Quantification of agreement in psychiatric diagnosis. Archives of General Psychiatry, 1967, 17, 83-87.

39.

Subkoviak, M.J. Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 1976, 13, 265-276.

40.

Subkoviak, M.J. Empirical estimation of procedures for estimating reliability for mastery tests. Journal of Educational Measurement, 1978, 15, 111-116.

41.

Subkoviak, M.J. Estimating reliability from a single administration of a mastery test (Occasional Paper No. 15. Madison: University of Wisconsin, Department of Educational Psychology, Laboratory of Experimental Design, undated.

42.

Subkoviak, M.J. , & Wilcox, R.R. Estimating the probability of correct classification in mastery testing. Paper presented at the annual meeting of the American Educational Research Association, Toronto, March 1978.

43.

Swaminathan, H. , Hambleton, R.K. , & Algina, J. Reliability of criterion-referenced tests: A decision-theoretic formulation . Journal of Educational Measurement, 1974 , 11, 263-267.

44.

van der Linden, W.J. , & Mellenbergh, G.J. Optimal cutting scores using a linear loss function . Applied Psychological Measurement, 1977, 1, 593-599.

45.

van der Linden, W.J. , & Mellenbergh, G.J. Coefficients for tests from a decision theoretic point of view. Applied Psychological Measurement, 1978, 2, 119-134.

46.

Wingersky, M.S. , Lees, D.M. , Lennon, V. , & Lord, F.M. A computer program for estimating true-score distributions and graduating observed-score distributions. Educational and Psychological Measurement, 1969, 29, 689-692.