A Propensity Score Method for Investigating Differential Item Functioning in Performance Assessment

Abstract

This study introduces a novel differential item functioning (DIF) method based on propensity score matching that tackles two challenges in analyzing performance assessment data, that is, continuous task scores and lack of a reliable internal variable as a proxy for ability or aptitude. The proposed DIF method consists of two main stages. First, propensity score matching is used to eliminate preexisting group differences before the test, ideally creating equivalent groups as in a randomized experimental study. Then, linear mixed effects models are adopted to perform DIF analysis based on the matched data set. We demonstrate this propensity DIF method using a high-stakes functional English language proficiency test. DIF due to education was investigated in the writing component, which consists of two continuously scored performance-based tasks. Although the proposed method is demonstrated in the context of language testing, it can be applied to other types of performance assessments.

Keywords

differential item functioning (DIF)performance assessment propensity score matching mixed effects model validation writing assessment

Get full access to this article

View all access options for this article.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). The standards for educational and psychological testing. Washington, DC: Author.

Arikan

van de Vijver

F. J.

Yagmur

(2018). Propensity score matching helps to understand sources of DIF and mathematics performance differences of Indonesian, Turkish, Australian, and Dutch students in PISA. International Journal of Research in Education and Science, 4, 69-81.

Austin

P. C.

(2008). A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003. Statistics in Medicine, 27, 2037-2049.

Austin

P. C.

(2014). A comparison of 12 algorithms for matching on the propensity score. Statistics in Medicine, 33, 1057-1069.

Bai

(2015). Methodological considerations in implementing propensity score matching. In Pan

Bai

(Eds.), Propensity score analysis: Fundamentals and developments (pp. 74-88). New York, NY: Guilford Press.

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 1-48.

Bolt

Stout

(1996). Differential item functioning: Its multidimensional model and resulting SIBTEST detection procedure. Behaviormetrika, 23, 67-95. doi:10.2333/bhmk.23.67

Bowen

D. F.

(2011). The effects of controlling for distributional differences on the Mantel-Haenszel procedure (Unpublished master’s thesis). University of North Carolina at Chapel Hill.

Bowling

(2005). Just one question: If one question works, why ask several? Journal of Epidemiology & Community Health, 59, 342-345.

10.

Breland

Lee

Y. W.

(2007). Investigating uniform and non-uniform gender DIF in computer-based ESL writing assessment. Applied Measurement in Education, 20, 377-403.

11.

Broer

Lee

Y. W.

Rizavi

Powers

(2005). Ensuring the fairness of GRE writing prompts: Assessing differential difficulty (ETS Research Report, RR 05-11). Princeton, NJ: Educational Testing Service.

12.

Chen

M. Y.

Lam

Zumbo

B. D.

(2016). Testing for differential item functioning with no internal matching variable and continuous item ratings. Poster presented at the Language Testing Research Colloquium, Palermo, Italy.

13.

Cochran

W. G.

Rubin

D. B.

(1973). Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 35, 417-446.

14.

Cuong

N. V.

(2013). Which covariates should be controlled in propensity score matching? Evidence from a simulation study. Statistica Neerlandica, 67, 169-180.

15.

Diener

Emmons

R. A.

Larsen

R. J.

Griffin

(1985). The Satisfaction With Life Scale. Journal of Personality Assessment, 49, 71-75.

16.

Enders

C. K.

Tofighi

(2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychological Methods, 12, 121-138.

17.

Gibson-Davis

C. M.

Foster

E. M.

(2006). A cautionary tale: Using propensity scores to estimate the effect of food stamps on food insecurity. Social Service Review, 80, 93-126.

18.

Graham

S. E.

Kurlaender

(2011). Using propensity scores in educational research: General principles and practical applications. Journal of Educational Research, 104, 340-353.

19.

X. S.

Rosenbaum

P. R.

(1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2, 405-420.

20.

Guo

Fraser

W. M.

(2014). Propensity score analysis: Statistical methods and applications (2nd ed.). Thousand Oaks, CA: Sage.

21.

Hansen

B. B.

Klopfer

S. O.

(2006). Optimal full matching and related designs via network flows. Journal of Computational and Graphical Statistics, 15, 609-627.

22.

D. E.

Imai

King

Stuart

E. A.

(2011). MatchIt: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, 42(8), 1-28.

23.

Joldersma

Bowen

(2010). Application of propensity models in DIF studies to compensate for unequal ability distributions. Paper presented at the annual meeting of National Council on Measurement in Education, Denver, CO.

24.

Lee

Geisinger

K. F.

(2014). The effect of propensity scores on DIF analysis: Inference on the potential cause of DIF. International Journal of Testing, 14, 313-338.

25.

Lee

Y.-W.

Breland

Muraki

(2004). Comparability of TOEFL CBT essay prompts for different native language group (ETS Research Report, TOEFL RR-77). Princeton, NJ: Educational Testing Service.

26.

Liu

Kim

A. D.

Gustafson

Kroc

Zumbo

B. D.

(2019). Investigating the performance of propensity score approaches for differential item functioning analysis. Journal of Modern Applied Statistical Methods. doi: 10.1177/0013164419878861.

27.

Liu

Zumbo

B. D.

Gustafson

Huang

Kroc

A. D.

(2016). Investigating causal DIF via propensity score methods. Practical Assessment, Research & Evaluation, 21(13). Retrieved from http://pareonline.net/getvn.asp?v=21&n=13

28.

Lüdecke

(2019, September 9). sjstats: Collection of convenient functions for common statistical computations (R package version 0.17.5). Retrieved from https://cran.r-project.org/web/packages/sjstats/sjstats.pdf

29.

Macias

Gold

P. B.

Öngür

Cohen

B. M.

Panch

(2015). Are single-item global ratings useful for assessing health status? Journal of Clinical Psychology in Medical Settings, 22, 251-264.

30.

Mantel

(1963). Chi-square tests with one degree of freedom: Extension of the Mantel-Haenszel procedure. Journal of the American Statistical Association, 58, 690-700.

31.

Mantel

Haenszel

(1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748.

32.

Nakagawa

Schielzeth

(2013). A general and simple method for obtaining R² from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4, 133-142.

33.

Oshima

T. C.

Morris

S. B.

(2008). Raju’s differential functioning of items and tests (DFIT). Educational Measurement: Issues and Practice, 27(3), 43-50.

34.

Oshima

T. C.

Raju

N. S.

Nanda

A. O.

(2006). A new method for assessing the statistical significance in the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43, 1-17.

35.

Pan

Bai

(2015). Propensity score analysis: Fundamentals and developments. New York, NY: Guilford Press.

36.

Penfield

R. D.

Lam

T. C.

(2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19(3), 5-15.

37.

Raju

N. S.

(1988). The area between two item characteristic curves. Psychometrika, 53, 495-502.

38.

Raju

N. S.

(1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207.

39.

Raju

N. S.

van Der Linden

W. J.

Fleer

P. F.

(1995). An IRT-based internal measure of test bias with applications for differential item functioning. Applied Psychological Measurement, 19, 353-368.

40.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Newbury Park, CA: Sage.

41.

Rosenbaum

P. R.

(1989). Optimal matching for observational studies. Journal of the American Statistical Association, 84, 1024-1032.

42.

Rosenbaum

P. R.

(1991). A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society, 53, 597-610.

43.

Rosenbaum

P. R.

(1995). Sensitivity to hidden bias. In Observational studies (pp. 87-135). New York, NY: Springer.

44.

Rosenbaum

P. R.

(2007). Sensitivity analysis for m-estimates, tests and confidence intervals in matched observational studies. Biometrics, 63, 456-464.

45.

Rosenbaum

P. R.

(2013). Impact of multiple matched controls on design sensitivity in observational studies. Biometrics, 69, 118-127.

46.

Rosenbaum

P. R.

(2014). Weighted m-statistics with superior design sensitivity in matched observational studies with multiple controls. Journal of the American Statistical Association, 109, 1145-1158.

47.

Rosenbaum

P. R.

(2017). Sensitivityfull: Sensitivity analysis for full matching in observational studies (R package Version 1.5.6). Retrieved from https://CRAN.R-project.org/package=sensitivityfull

48.

Rosenbaum

P. R.

Rubin

D. B.

(1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.

49.

Schuler

(2015). Overview of implementing propensity score analyses in statistical software. In Pan

Bai

(Eds.), Propensity score analysis: Fundamentals and developments (pp. 20-46). New York, NY: Guilford Press.

50.

Shealy

Stout

(1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DTF. Psychometrika, 58, 159-194.

51.

Stuart

E. A.

Rubin

D. B.

(2008). Best practices in quasi-experimental designs: Matching methods for causal inferences. In Osborne

J. W.

(Ed.), Best practices in quantitative methods (pp. 155-176). Thousand Oaks, CA: Sage.

52.

Swaminathan

Rogers

H. J.

(1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370.

53.

Thoemmes

F. J.

Kim

E. S.

(2011). A systematic review of propensity score methods in the social sciences. Multivariate Behavioral Research, 46, 90-118.

54.

Welch

C. J.

Miller

T. R.

(1995). Assessing differential item functioning in direct writing assessments: Problems and an example. Journal of Educational Measurement, 32, 163-178.

55.

Zhao

(2008). Sensitivity of propensity score methods to the specifications. Economics Letters, 98, 309-319.

56.

Zigler

C. M.

Dominici

(2014). Uncertainty in propensity score estimation: Bayesian methods for variable selection and model-averaged causal effects. Journal of the American Statistical Association, 109(505), 95-107.

57.

Zumbo

B. D.

(1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense.

58.

Zumbo

B. D.

(2008). Statistical methods for investigating item bias in self-report measures [The University of Florence Lectures on Differential Item Functioning]. Universita degli Studi di Firenze, Florence, Italy.

59.

Zumbo

B. D.

Padilla

J. L.

(2020). The interplay between survey research and psychometrics, with a focus on validity theory. In Beatty

P. C.

Wilmot

Collins

Kaye

Padilla

J. L.

Willis

(Eds.), Advances in questionnaire design, development, evaluation and testing (pp. 595-614). Hoboken, NJ: Wiley.

60.

Zwick

Donoghue

J. R.

Grima

(1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30, 233-251.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.26 MB