The use of generalizability theory in investigating the score dependability of classroom-based L2 reading assessment

Abstract

Among the variety of selected response formats used in L2 reading assessment, multiple-choice (MC) is the most commonly adopted, primarily due to its efficiency and objectiveness. Given the impact of assessment results on teaching and learning, it is necessary to investigate the degree to which the MC format reliably measures learners’ L2 reading comprehension in the classroom context. While researchers have claimed that the longer the reading test (i.e., more test items and passages), the higher its overall reliability, few studies have investigated the optimal number of items and passages required for reliable classroom-based L2 reading assessment.

To address this research gap, I adopted generalizability (G) theory to investigate the score reliability of the MC format in classroom-based L2 reading tests. A total of 108 ESL students at an American college completed an English reading test that included four passages, each of which was accompanied by five MC comprehension questions. The results showed that the score reliability of the L2 reading test was critically influenced by the number of items and passages, inasmuch as a different combination of the number of passages and items altered the degree of reliability. Implications for practitioners and educational researchers are discussed.

Keywords

Academic reading generalizability theory L2 reading assessment question format score reliability

Get full access to this article

View all access options for this article.

References

Alderson

J. C.

(2000). Assessing reading. Cambridge University Press.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Anderson

N. J.

Bachman

Perkins

Cohen

(1991). An exploratory study into the construct validity of a reading comprehension test: Triangulation of data sources. Language Testing, 8(1), 41–66. https://doi.org/10.1177/026553229100800104

Bachman

L. F.

(1990). Fundamental considerations in language testing. Oxford University Press.

Bachman

L. F.

Lynch

B. K.

Mason

(1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12(2), 238–257. https://doi.org/10.1177/026553229501200206

Bachman

L. F.

Palmer

A. S.

(1982). The construct validation of some components of communicative proficiency. TESOL Quarterly, 16(4), 449–465. https://doi.org/10.2307/3586464

Bachman

L. F.

Palmer

A. S.

(1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press.

Bennett

R. E.

Ward

W. C.

Rock

D. A.

LaHart

(1990). Toward a framework for constructed response items. ETS Research Report Series, 1990(1), i–35. https://doi.org/10.1002/j.2333-8504.1990.tb01348.x

Blanchard

(2004). Get ready to read: Test booklet (1st ed.). Pearson Education.

10.

Blanchard

(2005). Ready to read now: Test booklet (1st ed.). Pearson Education.

11.

Bolus

R. E.

Hinofotis

F. B.

Bailey

K. M.

(1982). An introduction to generalizability theory in second language research. Language Learning, 32(2), 245–258. https://doi.org/10.1111/j.1467-1770.1982.tb00970.x

12.

Brennan

R. L.

(1983). Elements of generalizability theory. American College Testing Program.

13.

Brennan

R. L.

(2001). Generalizability theory. Springer.

14.

R. L.

Brennan

(2021). Generalizability theory. In Clauser

B. E.

Bunch

M. B.

(Eds.), The history of educational measurement: Key advancements in theory, policy, and practice (pp. 206–231). Routledge.

15.

Brennan

R. L.

Gao

Colton

D. A.

(1995). Generalizability analyses of work keys listening and writing tests. Educational and Psychological Measurement, 55(2), 157–176. https://doi.org/10.1177/0013164495055002001

16.

Brown

J. D.

(1984). A norm-referenced engineering reading test. In Pugh

A. K.

Ulijn

J. M.

(Eds.), Reading for professional purposes: Studies and practices in native and foreign languages (pp. 213–222). Heinemann Educational Books.

17.

Brown

J. D.

(1990). Short-cut estimators of criterion-referenced test consistency. Language Testing, 7(1), 77–97. https://doi.org/10.1177/026553229000700106

18.

Brown

J. D.

(1993). A comprehensive criterion-referenced testing project. In Douglas

Chapelle

(Eds.), A new decade of language testing research (pp. 95–125). TESOL.

19.

Brown

J. D.

(1999). The relative importance of persons, items, subtests and languages to TOEFL test variance. Language Testing, 16(2), 217–238. https://doi.org/10.1177/026553229901600205

20.

Brown

J. D. B.

Bailey

K. M.

(1984). A categorical instrument for scoring second language writing skills. Language Testing, 34(4), 21–38. https://doi.org/10.1111/j.1467-1770.1984.tb00350.x

21.

Buck

(2001). Assessing listening. Cambridge University Press.

22.

Chapelle

C. A.

(2008). The TOEFL validity argument. In Chapelle

C. A.

Enright

Jamieson

(Eds.), Building a validity argument for the Test of English as a foreign language (pp. 319–352). Routledge.

23.

Clark

J. L. D.

(1978). Interview testing research at Educational Testing Service. In Clark

J. L. D.

(Ed.), Direct testing of speaking proficiency: Theory and application (pp. 211–228). Educational Testing Service.

24.

Clauser

B. E.

Harik

Margolis

M. J.

(2006). A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. Journal of Educational Measurement, 43(3), 173–191. https://doi.org/10.1111/j.1745-3984.2006.00012.x

25.

Cohen

A. D.

Upton

T. A.

(2007). I want to go back to the text: Response strategies on the reading subtest of the new TOEFL®. Language Testing, 24(2), 209–250. https://doi.org/10.1177/0265532207076364

26.

Crick

J. E.

Brennan

R. L.

(1983). GENOVA: A general purpose analysis of variance system (Version 3.1) [Computer software]. Center for Advanced Studies in Measurement and Assessment (CASMA), University of Iowa. http://www.education.uiowa.edu/casma

27.

Davey

(1988). Factors affecting the difficulty of reading comprehension items for successful and unsuccessful readers. The Journal of Experimental Education, 56(2), 67–76. https://doi.org/10.1080/00220973.1988.10806468

28.

Douglas

(2010). Understanding language testing. Routledge.

29.

Downing

S. M.

(2006). Selected-response item formats in test development. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 287–301). Lawrence Erlbaum Associates.

30.

Engineer

(1977). Proficiency in reading English as a second language (Unpublished PhD thesis). University of Edinburgh.

31.

Flesch

. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233. https://doi.org/10.1037/h0057532

32.

Frederiksen

(1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39(3), 193–202.

33.

Gebril

(2009). Score generalizability of academic writing tasks: Does one test method fit it all? Language Testing, 26(4), 507–531. https://doi.org/10.1177/0265532209340188

34.

Gebril

(2010). Bringing reading-to-write and writing-only assessment tasks together: A generalizability analysis. Assessing Writing, 15(2), 100–117. https://doi.org/10.1016/j.asw.2010.05.002

35.

Grabe

(2009). Reading in a second language: Moving from theory to practice. Cambridge University Press.

36.

Green

(2020). Exploring language assessment and testing: Language in action. Routledge.

37.

Grier

J. B.

(1975). The number of alternatives for optimum test reliability. Journal of Educational Measurement, 12(2), 109–112. https://www.jstor.org/stable/1434035

38.

Grier

J. B.

(1976). The optimal number of alternatives at a choice point with travel time considered. Journal of Mathematical Psychology, 14(1), 91–97. https://doi.org/10.1016/0022-2496(76)90016-X

39.

Haberman

S. J.

Sinharay

(2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75(2), 209–227. https://doi.org/10.1007/s11336-010-9158-4

40.

Haberman

S. J.

Sinharay

(2013). Does subgroup membership information lead to better estimation of true subscores? British Journal of Mathematical and Statistical Psychology, 66(3), 452–469. https://doi.org/10.1111/j.2044-8317.2012.02061.x

41.

Haladyna

T. M.

(1997). Writing test items to measure higher level thinking. Allyn & Bacon.

42.

Haladyna

T. M.

Downing

S. M.

(1993). How many options is enough for a multiple-choice test item? Educational and Psychological Measurement, 53(4), 999–1010. https://doi.org/10.1177/0013164493053004013

43.

Haladyna

T. M.

Rodriguez

M. C.

(2013). Developing and validating test items. Routledge.

44.

Haladyna

T. M.

Rodriguez

M. C.

Stevens

(2019). Are multiple-choice items too fat? Applied Measurement in Education, 32(4), 350–364. https://doi.org/10.1080/08957347.2019.1660348

45.

Hughes

(2003). Testing for language teachers. Cambridge University Press.

46.

Impara

J. C.

Foster

(2006). Item and test development strategies to minimize test fraud. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 91–114). Lawrence Erlbaum Associates.

47.

Kunnan

A. J.

(1992). An investigation of a criterion-referenced test using G-theory, and factor analysis. Language Testing, 9(1), 30–49. https://doi.org/10.1177/026553229200900104

48.

Lakin

J. M.

Lai

E. R.

(2012). Multigroup generalizability analysis of verbal, quantitative, and nonverbal ability tests for culturally and linguistically diverse students. Educational and Psychological Measurement, 72(1), 139–158. https://doi.org/10.1177/0013164411408074

49.

Lane

Raymond

Haladyna

(2015). Test development process. In Lane

Raymond

Haladyna

(Eds.), Handbook of test development (2nd ed., pp. 3–18). Routledge.

50.

Lee

Y. W.

Kantor

(2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. ETS Research Report Series, 2005(1), i–76. https://doi.org/10.1002/j.2333-8504.2005.tb01991.x

51.

Liao

R. J. T.

(2021). Exploring task-completion processes in L2 reading assessments: Multiple-choice vs. short-answer questions. Reading in a Foreign Language, 33(2), 168–190. http://hdl.handle.net/10125/67399

52.

Lin

C. K.

Zhang

(2018). Detecting nonadditivity in single-facet generalizability theory applications: Tukey’s test. Journal of Educational Measurement, 55(1), 78–89. https://doi.org/10.1111/jedm.12164

53.

Linn

R. L.

(2006). The standards for educational and psychological testing: Guidance in test development. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 27–38). Lawrence Erlbaum Associates.

54.

Liu

(2021). Does questioning strategy facilitate second language (L2) reading comprehension? The effects of comprehension measures and insights from reader perception. Journal of Research in Reading, 44(2), 339–359. https://doi.org/10.1111/1467-9817.12339

55.

Lord

F. M.

(1944). Reliability of multiple-choice tests as a function of the number of choices per item. Journal of Educational Psychology, 35(3), 175–180. https://doi.org/10.1037/h0061025

56.

Lord

F. M.

(1977). Optimal number of choices per item: A comparison of four approaches. Journal of Educational Measurement, 14(1), 33–38. http://www.jstor.org/stable/1433853

57.

McNamara

(2000). Language testing. Oxford University Press.

58.

Newsom

R. S.

Gaite

A. J. H.

(1971). Prose learning: Effects of pretesting and reduction of passage length. Psychological Reports, 28(1), 123–129. http://dx.doi.org/10.2466/pr0.1971.28.1.123

59.

Ohta

Plakans

L. M.

Gebril

(2018). Integrated writing scores based on holistic and multi-trait scales: A generalizability analysis. Assessing Writing, 38, 21–36. https://doi.org/10.1016/j.asw.2018.08.001

60.

Plakans

(2009). The role of reading strategies in integrated L2 writing tasks. Journal of English for Academic Purposes, 8(4), 252–266. https://doi.org/10.1016/j.jeap.2009.05.001

61.

Plakans

Gebril

(2015). Assessment myths: Applying second language research to classroom teaching. University of Michigan Press.

62.

Rodriguez

M. C.

(2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24(2), 3–13. https://doi.org/10.1111/j.1745-3992.2005.00006.x

63.

Rupp

A. A.

Ferne

Choi

(2006). How assessing reading comprehension with multiple choice questions shapes the construct: A cognitive processing perspective. Language Testing, 23(4), 441–474. https://doi.org/10.1191/0265532206lt337oa

64.

Sawaki

Sinharay

(2013). Investigating the value of section scores for the TOEFL iBT® Test. ETS Research Report Series, 2013(2), i–113. https://doi.org/10.1002/j.2333-8504.2013.tb02342.x

65.

Shavelson

R. J.

Webb

N. M.

(1981). Generalizability theory: 1973–1980. British Journal of Mathematical and Statistical Psychology, 34(2), 133–166. https://doi.org/10.1111/j.2044-8317.1981.tb00625.x

66.

Shavelson

R. J.

Webb

N. M.

(1991). Generalizability theory: A primer. SAGE.

67.

Sinharay

(2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2), 150–174. https://doi.org/10.1111/j.1745-3984.2010.00106.x

68.

Sinharay

Haberman

Puhan

(2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28. https://doi/10.1111/j.1745-3992.2007.00105.x

69.

Sireci

S. G.

Zenisky

A. L.

(2006). Innovative item formats in computer-based testing: In pursuit of improved construct representation. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 329–347). Lawrence Erlbaum Associates.

70.

Smith

Turner

(2016). The common European framework for reference for languages (CEFR) and the Lexile framework for reading. MetaMetrics. https://metametricsinc.com/wp-content/uploads/2018/01/CEFR_1.pdf

71.

Solano–Flores

(2006). The use of generalizability (G) theory in the testing of linguistic minorities. Educational Measurement: Issues and Practice, 25(1), 13–22. https://doi.org/10.1111/j.1745-3992.2006.00048.x

72.

The National Center for Fair & Open Testing (FairTest). (2007, August 17). Multiple-choice tests. http://www.fairtest.org/facts/mctfcat.html

73.

Tversky

(1964). On the optimal number of alternatives as a choice point. Journal of Mathematical Psychology, 1(2), 386–391. https://doi.org/10.1016/0022-2496(64)90010-0

74.

Webb

N. M.

Schlackman

Sugrue

(2000). The dependability and interchangeability of assessment methods in science. Applied Measurement in Education, 13(3), 277–301. https://doi.org/10.1207/S15324818AME1303_4

75.

Webb

N. M.

Shavelson

R. J.

(2008). Generalizability theory. In Salkind

N. J.

(Ed.), Encyclopedia of educational psychology (pp. 437–438). SAGE. https://doi.org/10.4135/9781412963848.n118

76.

Webb

N. M.

Shavelson

R. J.

Haertel

E. H.

(2006). Reliability coefficients and generalizability theory. Handbook of Statistics, 26, 81–124. https://doi.org/10.1016/S0169-7161(06)26004-8

77.

Weir

C. J.

(2005). Language testing and validation: An evidence-based approach. Palgrave Macmillan.

78.

Zhang

(2006). Investigating the relative effects of persons, items, sections, and languages on TOEIC score dependability. Language Testing, 23(3), 351–369. https://doi.org/10.1191/0265532206lt332oa

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

0.00 MB