Sage Journals: Discover world-class research

Abstract

A critical methodological challenge in standard setting arises in small-sample, high-dimensional contexts where the number of items substantially exceeds the number of examinees. Under such conditions, conventional data-driven methods that rely on parametric models (e.g., item response theory) often become unstable or fail due to unreliable parameter estimation. This study investigates two families of data-driven methods: information-theoretic and unsupervised clustering, offering a potential solution to this challenge. Using a Monte Carlo simulation, we systematically evaluate 15 such methods to establish an evidence-based framework for practice. The simulation manipulated five factors, including sample size, the item-to-examinee ratio, mixture proportions, item quality, and ability separation. Method performance was evaluated using multiple criteria, including Relative Error, Classification Accuracy, Sensitivity, Specificity, and Youden’s Index. Results indicated that no single method is universally superior; the optimal choice depends on the examinee mixture proportion. Specifically, the information-theoretic method QIR (quantile information ratio) excelled in scenarios with a dominant non-competent group, where high specificity was critical. Conversely, in highly selective contexts with balanced proficiency groups, the clustering methods CHI (Calinski-Harabasz index) and sum of squared error (SSE) demonstrated the highest classification effectiveness. Bayesian kernel density estimation (BKDE), however, consistently performed as a robust, balanced method across conditions. These findings provide practitioners with a clear decision framework for selecting a defensible, data-driven standard-setting method when traditional approaches are infeasible.

Keywords

standard setting cut score information theory clustering small sample size

Get full access to this article

View all access options for this article.

References

Angoff

W. H.

(1971). Scales, norms, and equivalent scores. In Thorndike

R. L.

(Ed.), Educational measurement (2nd ed., pp. 508–600). American Council on Education.

Arbelaitz

Gurrutxaga

Muguerza

Pérez

J. M.

Perona

(2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256. https://doi.org/10.1016/j.patcog.2012.07.021

Banerjee

Pujari

A. K.

Panigrahi

C. R.

Pati

(2021). Entropy-based cluster selection. In Panigrahi

C. R.

Pati

Pattanayak

B. K.

Amic

K. C.

(Eds.), Advances in intelligent systems and computing (pp. 313–321). Springer. https://doi.org/10.1007/978-981-33-4299-6_26

Bennett

R. E.

(2015). The changing nature of educational assessment. Review of Research in Education, 39(1), 370–407.

Berger

(2013). Criterion-referenced testing. In Volkmar

F. R.

(Ed.), Encyclopedia of autism spectrum disorders (pp. 823–823). Springer. https://doi.org/10.1007/978-1-4419-1698-3_146

Birnbaum

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.

Boli

B. I. A.-D.

Wei

C. H.

(2024). Bayesian classifier based on Robust Kernel density estimation and Harris Hawks optimisation. International Journal of Internet and Distributed Systems, 6(1), 1–23. https://doi.org/10.4236/ijids.2024.61001

Calinski

Harabasz

(1974). A dendrite method for cluster analysis. Communications in Statistics—Theory and Methods, 3(1), 1–27. https://doi.org/10.1080/03610927408827101

Chalmers

(2011). mirt: Multidimensional item response theory (p. 1.44.0) [Dataset]. https://doi.org/10.32614/CRAN.package.mirt

10.

Cheng

Liu

Behrens

(2015). Standard error of ability estimates and the classification accuracy and consistency of binary decisions. Psychometrika, 80(3), 645–664. https://doi.org/10.1007/s11336-014-9407-z

11.

Cizek

G. J.

(1993). Reconsidering standards and criteria. Journal of Educational Measurement, 30(2), 93–106. https://doi.org/10.1111/j.1745-3984.1993.tb01068.x

12.

Cizek

G. J.

Bunch

M. B.

(2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage.

13.

Cohen

. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

14.

Davies

D. L.

Bouldin

D. W.

(1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI- 1(2), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909

15.

de Ayala

R. J

. (2009). The theory and practice of item response theory (pp. xv, 448). The Guilford Press.

16.

Deborah

L. J.

Baskaran

Kannan

(2010). A survey on internal validity measure for cluster validation. International Journal of Computer Science & Engineering Survey, 1(2), 85–102. https://doi.org/10.5121/ijcses.2010.1207

17.

Dunn

J. C.

(1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104. https://doi.org/10.1080/01969727408546059

18.

Ebel

R. L.

(1972). Essentials of educational measurement. Prentice-Hall.

19.

Escobar

M. D.

West

(1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430), 577–588. https://doi.org/10.1080/01621459.1995.10476550

20.

Farcomeni

Geraci

(2024). Quantile ratio regression. Statistics and Computing, 34(2), 94. https://doi.org/10.1007/s11222-024-10406-8

21.

Gasmalla

H. E. E.

Ibrahim

A. A. M.

Wadi

M. M.

Taha

M. H.

(Eds.). (2023). Written assessment in medical education. Springer. https://doi.org/10.1007/978-3-031-11752-7

22.

Glaser

(1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18(8), 519–521. https://doi.org/10.1037/h0049294

23.

Green

D. M.

Swets

J. A.

(1966). Signal detection theory and psychophysics. Wiley.

24.

Halkidi

Batistakis

Vazirgiannis

(2002). Clustering validity checking methods: Part II. SIGMOD Record, 31(3), 19–27. https://doi.org/10.1145/601858.601862

25.

Halkidi

Vazirgiannis

Batistakis

(2000). Quality scheme assessment in the clustering process. In Zighed

D. A.

Komorowski

Żytkow

(Eds.), Principles of data mining and knowledge discovery (PKDD 2000) (Vol. 1910, pp. 265–276). Springer. https://doi.org/10.1007/3-540-45372-5_26

26.

Hambleton

R. K.

Eignor

D. R.

(1979). A practitioner’s guide to criterion-referenced test development, validation, and test score usage [2nd ed., Laboratory of Psychometric and Evaluative Research Report No. 70]. School of Education, University of Massachusetts.

27.

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

28.

Hofstee

W. K. B.

(1983). The case for compromise in educational selection and grading. In Anderson

S. B.

Helmick

J. S.

(Eds.), On educational testing (pp. 109–127). Jossey-Bass.

29.

Inman

H. F.

Bradley

E. L.

(1989). The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities. Communications in Statistics—Theory and Methods, 18(10), 3851–3874. https://doi.org/10.1080/03610928908830127

30.

Jacobson

N. S.

Truax

(1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59(1), 12–19.

31.

Jaeger

R. M.

(1989). Certification of student competence. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 485–514). Macmillan.

32.

Jiang

Ouyang

Shi

Pan

Cai

(2024). Forming intervals of predicted total scores for cut-off scores evaluation: A generalizability theory application with bootstrapping. Current Psychology, 43, 27778–27792. https://doi.org/10.1007/s12144-024-06306-9

33.

Jiao

Lissitz

R. W.

Macready

Wang

Liang

(2011). Exploring levels of performance using the mixture Rasch model for standard setting. Psychological Test and Assessment Modeling, 53(4), 499–522.

34.

Kane

M. T.

(1992). The assessment of professional competence. Evaluation & the Health Professions, 15(2), 163–182. https://doi.org/10.1177/016327879201500203

35.

Kane

M. T.

(2001). So much remains the same: Conception and status of validation in setting standards. In Cizek

G. J.

(Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 53–88). Lawrence Erlbaum Associates Publishers.

36.

Kim

(2017). Analysis of clustering evaluation considering features of item response data using data mining technique for setting cut-off scores. Symmetry, 9(5), Article 5. https://doi.org/10.3390/sym9050062

37.

Kraemer

H. C.

(1992). Evaluating medical tests: Objective and quantitative guidelines. Sage.

38.

Lavine

(1995). On an approximate likelihood for quantiles. Biometrika, 82(1), 220–222.

39.

Lewis

D. M.

Mitzel

H. C.

Green

D. R.

(1996). Standard setting: A bookmark approach [Paper presentation]. National Conference on Large-scale Assessment, Council of Chief State School Officers, Boulder, CO, United States.

40.

Lewis

Collishaw

Thapar

Harold

G. T.

(2014). Parent–child hostility and child and adolescent depression symptoms: The direction of effects, role of genetic factors and gender. European Child & Adolescent Psychiatry, 23(5), 317–327. https://doi.org/10.1007/s00787-013-0460-4

41.

Cohen

Kim

S.-H.

Cho

S.-J.

(2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33(5), 353–373.

42.

Livingston

Zieky

(1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Educational Testing Service.

43.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Erlbaum.

44.

MacQueen

J. B.

(1967). Some methods for classification and analysis of multivariate observations. In Le Cam

L. M.

Neyman

(Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). University of California Press.

45.

Nedelsky

(1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14(1), 3–19. https://doi.org/10.1177/001316445401400101

46.

Novoselova

Tom

(2012). Entropy-based cluster validation and estimation of the number of clusters in gene expression data. Journal of Bioinformatics and Computational Biology, 10(6), 1250011. https://doi.org/10.1142/S0219720012500114

47.

Pastore

Calcagnì

(2019). Measuring distribution similarities between samples: A distribution-free overlapping index. Frontiers in Psychology, 10, Article 1089. https://doi.org/10.3389/fpsyg.2019.01089

48.

Pitoniak

M. J.

(2003). Standard setting methods for complex licensure examinations [Unpublished doctoral dissertation]. University of Massachusetts Amherst.

49.

Reise

S. P.

(1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27(2), 133–144. https://doi.org/10.1111/j.1745-3984.1990.tb00738.x

50.

Rousseeuw

P. J.

(1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

51.

Sadler

B. P.

Stokes

S. L.

(2022). Item response theory and Fisher information for small tests. In Ng

H. K. T.

Heitjan

D. F.

(Eds.), Recent advances on sampling methods and educational statistics (pp. 233–250). Springer. https://doi.org/10.1007/978-3-030-92007-6_13

52.

Sen

Cohen

A. S.

(2023). The impact of sample size and various other factors on estimation of dichotomous mixture IRT models. Educational and Psychological Measurement, 83(3), 520–555. https://doi.org/10.1177/00131644221094325

53.

Shannon

C. E.

(1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

54.

Silverman

B. W.

(1986). Density estimation for statistics and data analysis. Chapman and Hall.

55.

Sireci

S. G.

(2001). Standard setting using cluster analysis. In Cizek

C.J.

(Ed.), Standard setting: Concepts, methods, and perspectives (pp. 339–354). Lawrence Erlbaum.

56.

Sireci

S. G.

Robin

Patelis

(1999). Using cluster analysis to facilitate standard setting. Applied Measurement in Education, 12, 301–325.

57.

Swanwick

Forrest

O’Brien

B. C

., & Association for the Study of Medical Education. (Eds.). (2018). Understanding medical education: Evidence, theory and practice (3rd ed.). Wiley-Blackwell.

58.

Verma

R. K.

Tiwari

Thakur

P. S.

(2023). Partition coefficient and partition entropy in fuzzy C-means clustering. Journal of Scientific Research and Reports, 29(12), 1–6. https://doi.org/10.9734/jsrr/2023/v29i121812

59.

Youden

W. J.

(1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3

60.

Zheng

Cheon

Katz

C. M.

(2020). Using machine learning methods to develop a short tree-based adaptive classification test: Case study with a high-dimensional item pool and imbalanced data. Applied Psychological Measurement, 44(7–8), 499–514.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.74 MB

Empowering Expert Judgment: A Data-Driven Decision Framework for Standard Setting in High-Dimensional and Data-Scarce Assessments

Abstract

Keywords

Get full access to this article

References

Supplementary Material