Evidence That Selecting an Appropriate Item Response Theory–Based Approach to Scoring Surveys Can Help Avoid Biased Treatment Effect Estimates

Abstract

Considerable thought is often put into designing randomized control trials (RCTs). From power analyses and complex sampling designs implemented preintervention to nuanced quasi-experimental models used to estimate treatment effects postintervention, RCT design can be quite complicated. Yet when psychological constructs measured using survey scales are the outcome of interest, measurement is often an afterthought, even in RCTs. The purpose of this study is to examine how choices about scoring and calibration of survey item responses affect recovery of true treatment effects. Specifically, simulation and empirical studies are used to compare the performance of sum scores, which are frequently used in RCTs in psychology and education, to that of approaches rooted in item response theory (IRT) that better account for the longitudinal, multigroup nature of the data. The results from this study indicate that selecting an IRT model that matches the nature of the data can significantly reduce bias in treatment effect estimates and reduce standard errors.

Keywords

randomized control trials measurement item response theory (IRT)survey scales statistical power

Get full access to this article

View all access options for this article.

References

Baker

F. B.

(1992). Item response theory: Parameter estimation techniques. Marcel Dekker.

Bauer

Curran

(2016). The discrepancy between measurement and modeling in longitudinal data analysis. In Harring

J. R.

Stapleton

L. M.

Beretvas

S. N.

(Eds.), Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications (pp. 3-38). CILVR series on latent variable methodology. IAP Information Age.

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431-444. https://doi.org/10.1177/014662168200600405

Bolt

D. M.

Hare

R. D.

Vitale

J. E.

Newman

J. P.

(2004). A multigroup item response theory analysis of the psychopathy checklist-revised. Psychological Assessment, 16(2), 155-168. https://doi.org/10.1037/1040-3590.16.2.155

Cai

(2010a). High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika, 75(1), 33-57. https://doi.org/10.1007/s11336-009-9136-x

Cai

(2010b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307-335. https://doi.org/10.3102/1076998609353115

Cai

(2017). flexMIRT 3.5 Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Vector Psychometric Group.

Cai

Choi

Kuhfeld

(2016). On the role of multilevel item response models in multisite evaluation studies for serious games. In O’Neil

H. F.

Baker

E. L.

Perez

R. S.

(Eds.), Using games and simulations for teaching and assessment: Key issues (pp. 280-301). Routledge.

Donohoe

Topping

Hannah

(2012). The impact of an online intervention (Brainology) on the mindset and resiliency of secondary school pupils: A preliminary mixed methods study. Educational Psychology, 32(5), 641-655. https://doi.org/10.1080/01443410.2012.675646

10.

Embretson

S. E.

Reise

S. P.

(2013). Item response theory. Psychology Press. https://doi.org/10.4324/9781410605269

11.

Flake

J. K.

Pek

Hehman

(2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370-378. https://doi.org/10.1177/1948550617693063

12.

Fokkema

Smits

Kelderman

Cuijpers

(2013). Response shifts in mental health interventions: An illustration of longitudinal measurement invariance. Psychological assessment, 25(2), 520-531. https://doi.org/10.1037/a0031669

13.

Goble

Pianta

R. C.

Sabol

T. J.

(2019). Forecasting youth adjustment at age 15 from school readiness profiles at 54 months. Applied Developmental Science, 23(4), 353-370.

14.

Gorter

Fox

J. P.

Apeldoorn

Twisk

(2016). Measurement model choice influenced randomized controlled trial results. Journal of Clinical Epidemiology, 79(1), 140-149. https://doi.org/10.1016/j.jclinepi.2016.06.011

15.

Gorter

Fox

J. P.

Eekhout

Heymans

M. W.

Twisk

J. W. R.

(2020a). Missing item responses in latent growth analysis: Item response theory versus classical test theory. Statistical Methods in Medical Research, 29(4), 996-1014. https://doi.org/10.1177/0962280219897706

16.

Gorter

Fox

J. P.

Riet

G. T.

Heymans

M. W.

Twisk

J. W. R.

(2020b). Latent growth modeling of IRT versus CTT measured longitudinal latent variables. Statistical Methods in Medical Research, 29(4), 962-986. https://doi.org/10.1177/0962280219856375

17.

Gorter

Fox

J. P.

Twisk

J. W.

(2015). Why item response theory should be used for longitudinal questionnaire data analysis in medical research. BMC Medical Research Methodology, 15(1), 1-12. https://doi.org/10.1186/s12874-015-0050-x

18.

Henry

D. B.

Cartland

Ruchross

Monahan

(2004). A return potential measure of setting norms for aggression. American Journal of Community Psychology, 33(3-4), 131-149. https://doi.org/10.1023/B:AJCP.0000027001.71205.dd

19.

Henry

D. B.

Farrell

A. D.

Multisite Violence Prevention Project. (2004). The study designed by a committee: Design of the Multisite Violence Prevention Project. American Journal of Preventive Medicine, 26(1 Suppl.), 12-19. https://doi.org/10.1016/j.amepre.2003.09.027

20.

Henry

D. B.

Tolan

P. H.

Gorman-Smith

Schoeny

M. E.

(2012). Risk and direct protective factors for youth violence: Results from the Centers for Disease Control and Prevention’s Multisite Violence Prevention Project. American Journal of Preventive Medicine, 43(2), S67-S75. https://doi.org/10.1016/j.amepre.2012.04.025

21.

Huo

de la Torre

Mun

E.-Y.

Kim

S.-Y.

Ray

A. E.

Jiao

White

H. R.

(2015). A hierarchical multi-unidimensional IRT approach for analyzing sparse, multi-group data for integrative data analysis. Psychometrika, 80(3), 834-855. https://doi.org/10.1007/s11336-014-9420-2

22.

Kolen

M. J.

Tong

(2010). Psychometric properties of IRT proficiency estimates. Educational Measurement: Issues and Practice, 29(3), 8-14. https://doi.org/10.1111/j.1745-3992.2010.00179.x

23.

Kolen

Brennan

(2010). Test Equating, Scaling, and Linking. Methods and Practices. 2nd edition. Springer.

24.

Koran

(2009). An integrated item response model for evaluating individual students’ growth in educational achievement [Unpublished doctoral dissertation]. University of Maryland. https://drum.lib.umd.edu/handle/1903/9502?show=full

25.

Kuhfeld

Soland

(2020). Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000367

26.

Lindley

D. V.

Smith

A. F. M.

(1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society Series B (Methodological), 34(1), 1-41.

27.

Maydeu-Olivares

Drasgow

Mead

A. D.

(1994). Distinguishing among paranletric item response models for polychotomous ordered data. Applied Psychological Measurement, 18(3), 245-256. https://doi.org/10.1177/014662169401800305

28.

McNeish

Wolf

M. G.

(2020). Thinking twice about sum scores. Behavior Research Methods, 52(6), 2287-2305. https://doi.org/10.3758/s13428-020-01398-0

29.

Oort

F. J.

(2005). Using structural equation modeling to detect response shifts and true change. Quality of Life Research, 14(3), 587-598. https://doi.org/10.1007/s11136-004-0830-y

30.

Oort

F. J.

Visser

M. R.

Sprangers

M. A.

(2005). An application of structural equation modeling to detect response shifts and true change in quality of life data from cancer patients undergoing invasive surgery. Quality of Life Research, 14(3), 599-609. https://doi.org/10.1007/s11136-004-0830-y

31.

Paek

Park

H.-J.

(2016). Specifying ability growth models using a multidimensional item response model for repeated measures categorical ordinal item response data. Multivariate Behavioral Research, 51(4), 569-580. https://doi.org/10.1080/00273171.2016.1178567

32.

Paek

Park

H.-J.

Cai

Chi

(2014). A comparison of three IRT approaches to examinee ability change modeling in a single-group anchor test design. Educational and Psychological Measurement, 74(4), 659-676. https://doi.org/10.1177/0013164413507062

33.

Simon

T. R.

Ikeda

R. M.

Smith

E. P.

Reese

L. E.

Rabiner

D. L.

Miller

Winn

D.-M.

Dodge

K. A.

Asher

S. R.

Horne

A. M.

(2009). The ecological effects of universal and selective violence prevention programs for middle school students: A randomized trial. Journal of Consulting and Clinical Psychology, 77(3), 526-542. https://doi.org/10.1037/a0014395

34.

Smith

E. P.

Gorman-Smith

Quinn

W. H.

Rabiner

D. L.

Tolan

P. H.

Winn

D.-M.

(2004). Community-based multiple family groups to prevent and reduce violent and aggressive behavior: the GREAT Families Program. American Journal of Preventive Medicine, 26(1 Suppl.), 39-47. https://doi.org/10.1016/j.amepre.2003.09.018

35.

Soland

(2019). Modeling academic achievement and self-efficacy as joint developmental processes: Evidence for education, counseling, and policy. Journal of Applied Developmental Psychology, 65(1), 79-101. https://doi.org/10.1016/j.appdev.2019.101076

36.

Soland

(in press). Is measurement noninvariance a threat to inferences drawn from randomized control trials? Evidence from empirical and simulation studies. Applied Psychological Measurement.

37.

Tolan

Gorman-Smith

Henry

(2004). Supporting families in a high-risk setting: Proximal effects of the SAFE Children preventive intervention. Journal of Consulting and Clinical Psychology, 72(5), 855-869. https://doi.org/10.1037/0022-006X.72.5.855

38.

West

M. R.

Buckley

Krachman

S. B.

Bookman

(2018). Development and implementation of student social-emotional surveys in the CORE Districts. Journal of Applied Developmental Psychology, 55, 119-129. https://doi.org/10.1016/j.appdev.2017.06.001

39.

Wirth

R. J.

Edwards

M. C.

(2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12(1), 58-79. https://doi.org/10.1037/1082-989X.12.1.58

40.

Woods

C. M.

Cai

Wang

(2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73(3), 532-547. https://doi.org/10.1177/0013164412464875

41.

Yeager

D. S.

Dweck

C. S.

(2012). Mindsets that promote resilience: When students believe that personal characteristics can be developed. Educational Psychologist, 47(4), 302-314. https://doi.org/10.1080/00461520.2012.722805

42.

Yen

W. M.

Fitzpatrick

A. R.

(2006). Item response theory. In Brennan

R. L.

(Ed.), Educational measurement (4th ed., pp. 111-153). Praeger.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB