Evaluating subscore uses across multiple levels: A case of reading and listening subscores for young EFL learners

Abstract

Stakeholders of language tests are often interested in subscores. However, reporting a subscore is not always justified; a subscore should provide reliable and distinct information to be worth reporting. When a subscore is used for decisions across multiple levels (e.g., individual test takers and schools), it needs to be justified for its reliability and distinctiveness at every relevant level. In this study, we examined whether reporting seven Reading and Listening subscores of the TOEFL Primary® test, a standardized English proficiency test for young English as a foreign language learners, could be justified for reporting at individual and school levels. We analyzed data collected in pilot administrations, in which 4776 students from 51 schools participated. We employed the classical test theory (CTT) based approaches of Haberman (2008) and Haberman, Sinharay, and Puhan (2009) for the individual and school-level investigations, respectively. We also supplemented the CTT-based approaches with a factor analytic approach for the individual level analysis and a multilevel modeling approach for the school-level analysis. The results differed across the two levels: we found little support for reporting the subscores at the individual level, but strong evidence supporting the added-value of the school-level subscores when the sample size for each school exceeds 50.

Keywords

Assessing young learners classical test theory score use subscores unit of analysis

Get full access to this article

View all access options for this article.

References

Alderson

J. C.

(1993). Judgements in language testing. In Douglas

Chapelle

(Eds.), A new decade of language testing research (pp. 46–57). Alexandria, VA: TESOL.

Alderson

J. C.

(2005). Diagnosing foreign language proficiency: The interface between learning and assessment. London: Continuum. https://doi.org/10.5040/9781474212151

Alderson

J. C.

(2007). The challenge of (diagnostic) testing: Do we know what we are measuring? In Fox

Wesche

(Eds.), Language testing reconsidered: Proceedings of the 27th Language Testing Research Colloquium (LTRC) (pp. 21–39). Ottawa: University of Ottawa Press.

Alderson

J. C.

Brunfaut

Harding

(2015). Towards a theory of diagnosis in second and foreign language assessment: Insights from professional practice across diverse fields. Applied Linguistics, 36, 236–260. https://doi.org/10.1093/applin/amt046

Alderson

J. C.

Kremmel

(2013). Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge. Language Testing, 30, 535–556. https://doi.org/10.1177/0265532213489568

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bachman

L. F.

(2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.

Bachman

L. F.

Palmer

A. S.

(1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

Beaton

A. E.

Allen

N. L.

(1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191–204. https://doi.org/10.3102/10769986017002191

10.

Biancarosa

Kennedy

P. C.

Carlson

S. E.

Yoon

H. J.

Seipel

Liu

Davison

M. L.

(2019). Constructing subscores that add validity: A case study of identifying students at risk. Educational and Psychological Measurement, 79, 65–84. https://doi.org/10.1177/0013164418763255

11.

Bridgeman

Cho

DiPietro

(2016). Predicting grades from an English language assessment: The importance of peeling the onion. Language Testing, 33, 307–318. https://doi.org/10.1177/0265532215583066

12.

Buck

Tatsuoka

Kostin

(1997). The subskills of reading: Rule-space analysis of a multiple-choice test of second language reading comprehension. Language Learning, 47, 423–466. https://doi.org/10.1111/0023-8333.00016

13.

Buck

Tatsuoka

(1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15, 119–157. https://doi.org/10.1177/026553229801500201

14.

Cai

Yang

J. S.

Hansen

(2011). Generalized full-information item bifactor analysis. Psychological Methods, 16, 221–248. https://doi.org/10.1037/a0023350

15.

Casella

Berger

R. L.

(2002). Statistical inference (2nd ed.). Pacific Grove, CA: Duxbury.

16.

Chalmers

R. P.

(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06

17.

Cho

Ginsburgh

Morgan

Moulder

Hauck

M. C.

(2016). Designing the TOEFL^® Primary^TM tests. ETS Research Memorandum (RM-16-02). Princeton, NJ: Educational Testing Service. https://www.ets.org/Media/Research/pdf/RM-16-02.pdf

18.

(2018). A review of subscore estimation methods. ETS Research Report (RR-18-17). Princeton, NJ: Educational Testing Service. https://doi.org/10.1002/ets2.12203

19.

Gibbons

R. D.

Hedeker

D. R.

(1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436. https://doi.org/10.1007/BF02295430

20.

Gomez

P. G.

Noah

Schedl

Wright

Yolkurt

(2007). Proficiency descriptors based on a scale-anchoring study of the new TOEFL iBT reading test. Language Testing, 24, 417–444. https://doi.org/10.1177/0265532207077209

21.

Haberman

S. J.

(2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229. https://doi.org/10.3102/1076998607302636

22.

Haberman

S. J.

Sinharay

(2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209–227. https://doi.org/10.1007/s11336-010-9158-4

23.

Haberman

S. J.

Sinharay

Puhan

(2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79–95. https://doi.org/10.1348/000711007X248875

24.

Harding

Alderson

J. C.

Brunfaut

(2015). Diagnostic assessment of reading and listening in a second or foreign language: Elaborating on diagnostic principles. Language Testing, 32, 317–336. https://doi.org/10.1177/0265532214564505

25.

Harris

D. J.

Hanson

B. A.

(1991, March). Methods of examining the usefulness of subscores. Paper presented at the meeting of the National Council on Measurement in Education, Chicago, IL.

26.

Hsieh

C.-N.

Ionescu

T.-H.

(2018). Out of many, one: Challenges in teaching multilingual Kenyan primary students in English. Language, Culture, and Curriculum, 31, 199–213. https://doi.org/10.1080/07908318.2017.1378670

27.

Jang

E. E.

(2009). Demystifying a Q-Matrix for making diagnostic inferences about L2 reading skills. Language Assessment Quarterly, 6, 210–238. https://doi.org/10.1080/15434300903071817

28.

Kelley

T. L.

(1947). Fundamentals of statistics. Cambridge, MA: Harvard University Press.

29.

Lee

Winke

(2018). Young learners’ response processes when taking computerized tasks for speaking assessment. Language Testing, 35, 239–269. https://doi.org/10.1177/0265532217704009

30.

Lee

Y.-W.

Sawaki

(2009). Application of three cognitive diagnosis models to ESL reading and listening assessments. Language Assessment Quarterly, 6, 239–263. https://doi.org/10.1080/15434300903079562

31.

Longabach

Peyton

(2018). A comparison of reliability and precision of subscore reporting methods for a state English language proficiency assessment. Language Testing, 35, 297–317. https://doi.org/10.1177/0265532217689949

32.

Longford

N. T.

(1990). Multivariate variance component analysis: An application in test development. Journal of Educational Statistics, 15, 91–112. https://doi.org/10.3102/10769986015002091

33.

Lord

F. M.

Novick

M. R.

(1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

34.

McDonald

R. P.

(1999). Test theory: A unified approach. Mahwah, NJ: Lawrence Erlbaum.

35.

McKay

(2005). Assessing young language learners. Cambridge: Cambridge University Press.

36.

Papageorgiou

Choi

(2018). Adding value to second-language listening and reading subscores: Using a score augmentation approach. International Journal of Testing, 18, 207–230. https://doi.org/10.1080/15305058.2017.1407766

37.

Pham

(2018, April). Considering sampling errors in estimating value-added ratios of subscores: A bootstrap method. Paper presented at the annual meeting of the National Council on Measurement in Education, New York.

38.

Pinheiro

Bates

DebRoy

Sarkar

, & R Core Team. (2017). nlme: linear and nonlinear mixed effects models. R package version 3.1–131. https://CRAN.R-project.org/package=nlme

39.

Plakans

Gebril

(2012). A close investigation into source use in integrated second language writing tasks. Assessing Writing, 17, 18–34. https://doi.org/10.1016/j.asw.2011.09.002

40.

Powers

Schedl

Papageorgiou

(2017). Facilitating the interpretation of English language proficiency scores: Combining scale anchoring and test score mapping methodologies. Language Testing, 34, 175–195. https://doi.org/10.1177/0265532215623582

41.

Puhan

Sinharay

Haberman

S. J.

Larkin

(2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266–285. https://doi.org/10.1080/08957347.2010.486287

42.

R Core Team. (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.r-project.org/

43.

Reise

S. P.

Bonifay

W. E.

Haviland

M. G.

(2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95, 129–140. https://doi.org/10.1080/00223891.2012.725437

44.

Ren

Lai

Tong

Aminzadeh

Hou

Lai

(2010). Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37, 1487–1498. https://doi.org/10.1080/02664760903046102

45.

Rodriguez

Reise

S. P.

Haviland

M. G.

(2016). Evaluating bifactor models: Calculating and interpreting statistical indices. Psychological Methods, 21, 137–150. https://doi.org/10.1037/met0000045

46.

Sawaki

Kim

H.-J.

Gentile

(2009). Q-Matrix construction: Defining the link between constructs and test items in large-scale reading and listening comprehension assessments. Language Assessment Quarterly, 6, 190–209. https://doi.org/10.1080/15434300902801917

47.

Sawaki

Sinharay

(2013). Investigating the value of section scores for the TOEFL iBT^® test (TOEFL iBT Research Report No. 21). Princeton, NJ: Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2013.tb02342.x

48.

Sawaki

Stricker

L. J.

Oranje

A. H.

(2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26, 5–30. https://doi.org/10.1177/0265532208097335

49.

Sinharay

(2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174. https://doi.org/10.1111/j.1745-3984.2010.00106.x

50.

Sinharay

(2013). A note on assessing the added value of subscores. Educational Measurement: Issues and Practice, 32(4), 38–42. https://doi.org/10.1111/emip.12021

51.

Sinharay

Haberman

S. J.

(2008). Reporting subscores: A survey (ETS Research Memorandum RM-08-18). Princeton, NJ: Educational Testing Service. https://www.ets.org/Media/Research/pdf/RM-08-18.pdf

52.

Sinharay

Haberman

S. J.

Puhan

(2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28. https://doi.org/10.1111/j.1745-3992.2007.00105.x

53.

Sinharay

Haberman

S. J.

Wainer

(2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71, 789–797. https://doi.org/10.1177/0013164410391782

54.

Sinharay

Puhan

Haberman

(2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practices, 30(3), 29–40. https://doi.org/10.1111/j.1745-3992.2011.00208.x

55.

Stout

(1987). A nonparametric approach for assessing latent trait unidimensionality. Psycho-metrika, 52, 589–617. https://doi.org/10.1007/BF02294821

56.

Wainer

Vevea

J. L.

Camacho

Reeve

B. B.

Rosa

Nelson

. . . Thissen

. (2001). Augmented scores – “borrowing strength” to compute scores based on small numbers of items. In Thissen

Wainer

(Eds.), Test scoring (pp. 343–387). Mahwah, NJ: Lawrence Erlbaum Associates.

57.

Weir

(1993). Understanding and developing language tests. New York: Prentice Hall.

58.

Winne

P. H.

Belfry

M. J.

(1982). Interpretive problems when correcting for attenuation. Journal of Educational Measurement, 19, 125–134. https://doi.org/10.1111/j.1745-3984.1982.tb00121.x

59.

Wolf

M. K.

Butler

Y. G.

(2017). An overview of English language proficiency assessments for young learners. In Wolf

M. K.

Butler

Y. G.

(Eds.), English language proficiency assessments for young learners (pp. 3–21). New York: Routledge.

60.

(2007). Validating TOEFL^® iBT Speaking and setting score requirements for ITA screening. Language Assessment Quarterly, 4, 318–351. https://doi.org/10.1080/15434300701462796

61.

Yao

Boughton

K. A.

(2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31, 83–105. https://doi.org/10.1177/0146621606291559

62.

Zhang

Stout

(1999). The theoretical DETECT index of dimensionality and its application to approximate simple structure. Psychometrika, 64, 213–249. https://doi.org/10.1007/BF02294536

63.

Moulder

Morgan

(2017). A field test study for the TOEFL Primary reading and listening tests. In Wolf

M. K.

Butler

Y. G.

(Eds.), English language proficiency assessments for young learners (pp. 99–117). New York: Routledge.

64.

Zwick

Senturk

Wang

Loomis

S. C.

(2001). An investigation of alternative method for item mapping on the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 20(2), 15–25. https://doi.org/10.1111/j.1745-3992.2001.tb00059.x

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.20 MB