Sage Journals: Discover world-class research

Abstract

Value added models (VAMs) attempt to estimate the causal effects of teachers and schools on student test scores. We apply Generalizability Theory to show how estimated VA effects depend upon the selection of test items. Standard VAMs estimate causal effects on the items that are included on the test. Generalizability demands consideration of how estimates would differ had the test included alternative items. We introduce a model that estimates the magnitude of item-by-teacher/school variance accurately, revealing that standard VAMs can overstate reliability and overestimate differences between units. Using 16 academic outcomes from 8 studies with item-level data, we show how standard VAMs overstate reliability by a median of 0.04 on the 0 to 1 reliability scale (mean = 0.09, SD = 0.10) and provide standard deviations of teacher/school effects that are a median of 3% too large (mean = 12%, SD = 23% points). We discuss how imprecision due to heterogeneous VA effects across items attenuates effect sizes, complicates comparisons across studies, and contributes to temporal instability, though these effects are reduced when the number of items is high. Our results suggest that accurate estimation and interpretation of VAMs may be improved using item-level data, including qualitative data about how items represent the content domain.

Keywords

value-added model Generalizability Theory reliability education policy accountability

Get full access to this article

View all access options for this article.

References

Aaronson

Barrow

Sander

(2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25(1), 95–135. https://doi.org/10.1086/508733

Ahmed

Bertling

Zhang

A. D.

Loyalka

Xue

Rozelle

Benjamin

(2024). Heterogeneity of item-treatment interactions masks complexity and generalizability in randomized controlled trials. Journal of Research on Educational Effectiveness. Advance online publication. https://doi.org/10.1080/19345747.2024.2361337

American Educational Research Association. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, 44(8), 448–452. https://doi.org/10.3102/0013189X15618385

Amrein-Beardsley

(2014). Rethinking value-added models in education (1st ed.). Routledge. https://doi.org/10.4324/9780203409909

Amrein-Beardsley

Pivovarova

Geiger

T. J.

(2016). Value-added models: What the experts say. Phi Delta Kappan, 98(2), 35–40. https://doi.org/10.1177/0031721716671904

Angrist

J. D.

Hull

Pathak

P. A.

Walters

(2024). Credible school value-added with undersubscribed school lotteries. Review of Economics and Statistics, 106(1), 1–19. https://doi.org/10.1162/rest.a.01149

Angrist

J. D.

Hull

P. D.

Pathak

P. A.

Walters

C. R.

(2017). Leveraging lotteries for school value-added: Testing and estimation. The Quarterly Journal of Economics, 132(2), 871–919. https://doi.org/10.1093/qje/qjx001

Antonakis

Bastardoz

Rönkkö

(2021). On ignoring the random effects assumption in multilevel models: Review, critique, and recommendations. Organizational Research Methods, 24(2), 443–483. https://doi.org/10.1177/1094428119877457

Aslantas

(2020). Impact of contextual predictors on value-added teacher effectiveness estimates. Education Sciences, 10(12), 390. https://doi.org/10.3390/educsci10120390

10.

Austin

P. C.

Merlo

(2017). Intermediate and advanced topics in multilevel logistic regression analysis. Statistics in Medicine, 36(20), 3257–3277. https://doi.org/10.1002/sim.7336

11.

Bacher-Hicks

Koedel

(2023). Estimation and interpretation of teacher value added in research applications. In Hanushek

Machin

Woessmann

(Eds.), Handbook of the economics of education (Vol. 6, pp. 93–134). Elsevier. https://doi.org/10.1016/bs.hesedu.2022.11.002

12.

Bang

H. J.

Flynn

(2023). Efficacy of an adaptive game-based math learning app to support personalized learning and improve early elementary school students’ learning. Early Childhood Education Journal, 51(4), 717–732.

13.

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

14.

Béland

Falk

C. F.

(2022). A comparison of modern and popular approaches to calculating reliability for dichotomously scored items. Applied Psychological Measurement, 46(4), 321–337. https://doi.org/10.1177/01466216221084210

15.

Bell

Fairbrother

Jones

(2019). Fixed and random effects models: Making an informed choice. Quality & Quantity, 53, 1051–1074.

16.

Bitler

Corcoran

S. P.

Domina

Penner

E. K.

(2021). Teacher effects on student achievement and height: A cautionary tale. Journal of Research on Educational Effectiveness, 14(4), 900–924. https://doi.org/10.1080/19345747.2021.1917025

17.

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444. https://doi.org/10.1177/014662168200600405

18.

Breen

Karlson

K. B.

Holm

(2018). Interpreting and understanding logits, probits, and other nonlinear probability models. Annual Review of Sociology, 44, 39–54.

19.

Brehm

Imberman

S. A.

Lovenheim

M. F.

(2017). Achievement effects of individual performance incentives in a teacher merit pay tournament. Labour Economics, 44, 133–150. https://doi.org/10.1016/j.labeco.2016.12.008

20.

Brennan

(2001). Generalizability theory. Springer.

21.

Briggs

D. C.

Weeks

J. P.

(2011). The persistence of school-level value-added. Journal of Educational and Behavioral Statistics, 36(5), 616–637. https://doi.org/10.3102/1076998610396887

22.

Bürkner

P.-C.

(2017). Brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

23.

Bürkner

P.-C.

(2021). Bayesian item response modeling in R with brms and Stan. Journal of Statistical Software, 100(5), 1–54. https://doi.org/10.18637/jss.v100.i05

24.

Cabell

S. Q.

Kim

J. S.

White

T. G.

Gale

C. J.

Edwards

A. A.

Hwang

Petscher

Raines

R. M.

(2025). Impact of a content-rich literacy curriculum on kindergarteners’ vocabulary, listening comprehension, and content knowledge. Journal of Educational Psychology, 117(2), 153–175. https://doi.org/10.1037/edu0000916

25.

Cárdenas-Hurtado

C. A.

Moustaki

Chen

Marra

(2025). Generalized latent variable models for location, scale, and shape parameters. Psychometrika, 90(3), 932–956. https://doi.org/10.1017/psy.2025.7

26.

Castellano

K. E.

A. D.

(2015). Practical differences among aggregate-level conditional status metrics: From median student growth percentiles to value-added models. Journal of Educational and Behavioral Statistics, 40(1), 35–68. https://doi.org/10.3102/1076998614548485

27.

Castellano

K. E.

Rabe-Hesketh

Skrondal

(2014). Composition, context, and endogeneity in school and teacher comparisons. Journal of Educational and Behavioral Statistics, 39(5), 333–367. https://doi.org/10.3102/1076998614547576

28.

Cawley

Heckman

Vytlacil

(1999). On policies to reward the value added by educators. Review of Economics and Statistics, 81(4), 720–727. https://doi.org/10.1162/003465399558436

29.

Chan

Hedges

L. V.

(2022). Pooling interactions into error terms in multisite experiments. Journal of Educational and Behavioral Statistics, 47(6), 639–665. https://doi.org/10.3102/10769986221104800

30.

Chetty

Friedman

J. N.

Rockoff

J. E.

(2014a). Discussion of the American Statistical Association’s statement (2014) on using value-added models for educational assessment. Statistics and Public Policy, 1(1), 111–113. https://doi.org/10.1080/2330443X.2014.955227

31.

Chetty

Friedman

J. N.

Rockoff

J. E.

(2014b). Measuring the impacts of teachers I: Evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. https://doi.org/10.1257/aer.104.9.2593

32.

Chetty

Friedman

J. N.

Rockoff

J. E.

(2014c). Measuring the impacts of teachers II: Teacher value-added and student outcomes in adulthood. American Economic Review, 104(9), 2633–2679. https://doi.org/10.1257/aer.104.9.2633

33.

Amrein-Beardsley

Collins

(2018). State-level assessments and teacher evaluation systems after the passage of the every student succeeds act: Some steps in the right direction (Technical reports). https://eric.ed.gov/?id=ED591993

34.

Cognia & Massachusetts Department of Elementary and Secondary Education. (2024). 2023 Next-generation MCAS and MCAS-Alt technical report (Technical reports). Massachusetts Department of Elementary and Secondary Education. https://www.doe.mass.edu/mcas/tech/2023-nextgen-tech-report.pdf

35.

Cohen

(1983). The cost of dichotomization. Applied Psychological Measurement, 7(3), 249–253. https://doi.org/10.1177/014662168300700301

36.

Cohodes

S. R.

(2016). Teaching to the student: Charter school effectiveness in spite of perverse incentives. Education Finance and Policy, 11(1), 1–42. https://doi.org/10.1162/EDFP.a.00175

37.

Cowan

Goldhaber

Jin

Theobald

(2023). Assessing licensure test performance and predictive validity for different teacher subgroups. American Educational Research Journal, 60(6), 1095–1138. https://doi.org/10.3102/00028312231192365

38.

Davenport

J. L.

Kao

Y. S.

Johannes

K. N.

Hornburg

C. B.

McNeil

N. M.

(2023). Improving children’s understanding of mathematical equivalence: An efficacy study. Journal of Research on Educational Effectiveness, 16(4), 615–642.

39.

De Boeck

(2008). Random item IRT models. Psychometrika, 73(4), 533–559. https://doi.org/10.1007/s11336-008-9092-x

40.

Dee

T. S.

Wyckoff

(2015). Incentives, selection, and teacher performance: Evidence from IMPACT: Incentives, selection, and teacher performance. Journal of Policy Analysis and Management, 34(2), 267–297. https://doi.org/10.1002/pam.21818

41.

De Maeyer

. (2021). Generalizability theory with a Bayesian flavour. https://svendemaeyer.netlify.app/posts/2021-04-Generalizability/

42.

Dey

(2025). Texas is officially replacing STAAR. Here is what schools’ new standardized tests will look like. Texas Tribune. https://www.texastribune.org/2025/09/02/texas-staar-standardized-test-accountability/

43.

Domingue

B. W.

Braginsky

Caffrey-Maffei

L. A.

Gilbert

Kanopka

Kapoor

Liu

Nadela

Pan

Zhang

Frank

M. C.

(2025). An introduction to the Item Response Warehouse (IRW): A resource for enhancing data usage in psychometrics. Behavior Research Methods, 57, Article 276. https://doi.org/10.3758/s13428-025-02796-y

44.

Domingue

B. W.

Kanopka

Trejo

Rhemtulla

Tucker-Drob

E. M.

(2022). Ubiquitous bias and false discovery due to model misspecification in analysis of statistical interactions: The role of the outcome’s distribution and metric properties. Psychological Methods, 29(6):1164–1179. https://doi.org/10.1037/met0000532

45.

Donnellan

Usami

Murayama

(2025). Random item slope regression: An alternative measurement model that accounts for both similarities and differences in association with individual items. Psychological Methods, 30(4), 744–769. https://doi.org/10.1037/met0000587

46.

Durvasula

Netemeyer

R. G.

Andrews

J. C.

Lysonski

(2006). Examining the cross-national applicability of multi-item, multi-dimensional measures using generalizability theory. Journal of International Business Studies, 37(4), 469–483. https://doi.org/10.1057/palgrave.jibs.8400210

47.

Everson

K. C.

(2017). Value-added modeling and educational accountability: Are we answering the real questions? Review of Educational Research, 87(1), 35–70. https://doi.org/10.3102/0034654316637199

48.

Gilbert

J. B.

(2024). Modeling item-level heterogeneous treatment effects: A tutorial with the glmer function from the lme4 package in R. Behavior Research Methods, 56(5), 5055–5067. https://doi.org/10.3758/s13428-023-02245-8

49.

Gilbert

J. B.

(2025). How measurement affects causal inference: Attenuation bias is (usually) more important than outcome scoring weights. Methodology, 21(2), 91–122. https://doi.org/10.5964/meth.15773

50.

Gilbert

J. B.

Domingue

B. W.

Kim

J. S.

(2025). Estimating causal effects on psychological networks using item response theory. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000764

51.

Gilbert

J. B.

Himmelsbach

Soland

Joshi

Domingue

B. W.

(2025). Estimating heterogeneous treatment effects with item-level outcome data: Insights from item response theory. Journal of Policy Analysis and Management, 44(4), 1417–1449. https://doi.org/10.1002/pam.70025

52.

Gilbert

J. B.

Kim

J. S.

Miratrix

L. W.

(2023). Modeling item-level heterogeneous treatment effects with the explanatory item response model: Leveraging large-scale online assessments to pinpoint the impact of educational interventions. Journal of Educational and Behavioral Statistics, 48(6), 889–913. https://doi.org/10.3102/10769986231171710

53.

Gilbert

J. B.

Miratrix

L. W.

Joshi

Domingue

B. W.

(2025). Disentangling person-dependent and item-dependent causal effects: Applications of item response theory to the estimation of treatment effect heterogeneity. Journal of Educational and Behavioral Statistics, 50(1), 72–101. https://doi.org/10.3102/10769986241240085

54.

Gilbert

J. B.

Soland

(2024). Mechanisms of effect size differences between researcher developed and independently developed outcomes: A meta-analysis of item-level data (EdWorkingPaper: 24-1082). Annenberg Institute at Brown University. https://doi.org/10.26300/8AXS-Y713

55.

Gilbert

J. B.

Young

W. S.

Himmelsbach

Ulitzsch

Domingue

B. W.

(2025). Conditional dependencies between response time and item discrimination: An item-level meta-analysis. https://doi.org/10.31234/osf.io/rp34w.v1

56.

Gilbert

J. B.

Zhang

Ulitzsch

Domingue

B. W.

(2025). Polytomous explanatory item response models for item discrimination: Assessing negative-framing effects in social-emotional learning surveys. Behavior Research Methods, 57(4), 1–21. https://doi.org/10.3758/s13428-025-02625-2

57.

Goldhaber

D. D.

Goldschmidt

Tseng

(2013). Teacher value-added at the high-school level: Different models, different answers? Educational Evaluation and Policy Analysis, 35(2), 220–236. https://doi.org/10.3102/0162373712466938

58.

Halpin

Gilbert

(2024). Testing whether reported treatment effects are unduly dependent on the specific outcome measure used. https://doi.org/10.48550/ARXIV.2409.03502

59.

Hanushek

E. A.

(2011). The economic value of higher teacher quality. Economics of Education Review, 30(3), 466–479. https://doi.org/10.1016/j.econedurev.2010.12.006

60.

Hanushek

E. A.

Rivkin

S. G.

(2010). Generalizations about using value-added measures of teacher quality. American Economic Review, 100(2), 267–271. https://doi.org/10.1257/aer.100.2.267

61.

Harris

D. N.

(2009). Would accountability based on teacher value added be smart policy? An examination of the statistical properties and policy alternatives. Education Finance and Policy, 4(4), 319–350. https://doi.org/10.1162/edfp.2009.4.4.319

62.

Hawley

L. R.

Bovaird

J. A.

(2017). Stability of teacher value-added rankings across measurement model and scaling conditions. Applied Measurement in Education, 30(3), 196–212. https://doi.org/10.1080/08957347.2017.1316273

63.

Hedges

L. V.

Hedberg

E. C.

(2007). Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis, 29(1), 60–87. https://doi.org/10.3102/0162373707299706

64.

A. D.

(2008). The problem with “proficiency”: Limitations of statistics and policy under no child left behind. Educational Researcher, 37(6), 351–360.

65.

A. D.

(2024). Measurement must be qualitative, then quantitative, then qualitative again. Educational Measurement: Issues and Practice, 43(4), 137–145. https://doi.org/10.1111/emip.12662

66.

A. D.

C. C.

(2015). Descriptive statistics for modern test score distributions: Skewness, kurtosis, discreteness, and ceiling effects. Educational and Psychological Measurement, 75(3), 365–388. https://doi.org/10.1177/0013164414548576

67.

Holland

P. W.

Dorans

N. J.

(2006). Linking and equating. In R. Brennan (Eds.), Educational measurement (4th ed., pp. 187–220). Praeger.

68.

Jackson

C. K.

(2018). What do test scores miss? The importance of teacher effects on non–test score outcomes. Journal of Political Economy, 126(5), 2072–2107. https://doi.org/10.1086/699018

69.

Jackson

C. K.

Porter

S. C.

Easton

J. Q.

Blanchard

Kiguel

(2020). School effects on socioemotional development, school-based arrests, and educational attainment. American Economic Review: Insights, 2(4), 491–508. https://doi.org/10.1257/aeri.20200029

70.

Jacob

B. A.

(2005). Accountability, incentives and behavior: The impact of high-stakes testing in the Chicago Public Schools. Journal of Public Economics, 89(5–6), 761–796. https://doi.org/10.1016/j.jpubeco.2004.08.004

71.

Jensen

Rice

Soland

(2018). The influence of rapidly guessed item responses on teacher value-added estimates: Implications for policy and practice. Educational Evaluation and Policy Analysis, 40(2), 267–284. https://doi.org/10.3102/0162373718759600

72.

Jeon

M.-J.

Lee

Hwang

J.-W.

Kang

S.-J.

(2009). Estimating reliability of school-level scores using multilevel and generalizability theory models. Asia Pacific Education Review, 10(2), 149–158. https://doi.org/10.1007/s12564-009-9014-3

73.

Jiang

Raymond

DiStefano

Shi

Liu

Sun

(2022). A Monte Carlo study of confidence interval methods for generalizability coefficient. Educational and Psychological Measurement, 82(4), 705–718. https://doi.org/10.1177/00131644211033899

74.

Jiang

Skorupski

(2018). A Bayesian approach to estimating variance components within a multivariate generalizability theory framework. Behavior Research Methods, 50(6), 2193–2214. https://doi.org/10.3758/s13428-017-0986-3

75.

Kane

T. J.

McCaffrey

D. F.

Miller

Staiger

D. O.

(2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment (Technical Report No. ED540959). ERIC. https://eric.ed.gov/?id=ED540959

76.

Kim

J. S.

Burkhauser

M. A.

Mesite

L. M.

Asher

C. A.

Relyea

J. E.

Fitzgerald

Elmore

(2021). Improving reading comprehension, science domain knowledge, and reading engagement through a first-grade content literacy intervention. Journal of Educational Psychology, 113(1), 3–26.

77.

Kim

J. S.

Burkhauser

M. A.

Relyea

J. E.

Gilbert

J. B.

Scherer

Fitzgerald

Mosher

McIntyre

(2023). A longitudinal randomized trial of a sustained content literacy intervention from first to second grade: Transfer effects on students’ reading comprehension. Journal of Educational Psychology, 115(1), 73–98.

78.

Kim

J. S.

Gilbert

J. B.

Relyea

J. E.

Rich

Scherer

Burkhauser

M. A.

Tvedt

J. N.

(2024). Time to transfer: Long-term effects of a sustained and spiraled content literacy intervention in the elementary grades. Developmental Psychology, 60(7), 1279–1297.

79.

Kline

R. B.

(2023). Principles and practice of structural equation modeling (5th ed.). Guilford Publications.

80.

Knox

(2024). Introducing a shorter, “more flexible” ACT. Inside Higher Ed. https://www.insidehighered.com/news/admissions/traditional-age/2024/07/22/act-announces-shorter-cheaper-science-optional-exam

81.

Koedel

Betts

(2010). Value added to what? How a ceiling in the testing instrument influences value-added estimation. Education Finance and Policy, 5(1), 54–81. https://doi.org/10.1162/edfp.2009.5.1.5104

82.

Koedel

Mihaly

Rockoff

J. E.

(2015). Value-added modeling: A review. Economics of Education Review, 47, 180–195. https://doi.org/10.1016/j.econedurev.2015.01.006

83.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking: Methods and practices. Springer. https://doi.org/10.1007/978-1-4939-0317-7

84.

Konstantopoulos

(2014). Teacher effects, value-added models, and accountability. Teachers College Record: The Voice of Scholarship in Education, 116(1), 1–21. https://doi.org/10.1177/016146811411600109

85.

Koretz

(2005). Alignment, high stakes, and the inflation of test scores. Teachers College Record, 107(14), 99–118.

86.

Koretz

(2008). Measuring up. Harvard University Press.

87.

Leckie

Parker

Goldstein

Tilling

(2024). Mixed-effects location scale models for joint modeling school value-added effects on the mean and variance of student achievement. Journal of Educational and Behavioral Statistics, 49(6), 879–911. https://doi.org/10.3102/10769986231210808

88.

Lee

Y. R.

Hong

(2019). The impact of omitting random interaction effects in cross-classified random effect modeling. The Journal of Experimental Education, 87(4), 641–660. https://doi.org/10.1080/00220973.2018.1507985

89.

Levy

Brunner

Keller

Fischbach

(2019). Methodological issues in value-added modeling: An international review from 26 countries. Educational Assessment, Evaluation and Accountability, 31(3), 257–287. https://doi.org/10.1007/s11092-019-09303-w

90.

Levy

Brunner

Keller

Fischbach

(2023). How sensitive are the evaluations of a school’s effectiveness to the selection of covariates in the applied value-added model? Educational Assessment, Evaluation and Accountability, 35(1), 129–164. https://doi.org/10.1007/s11092-022-09386-y

91.

Liu

Loeb

(2021). Engaging teachers: Measuring the impact of teachers on student attendance in secondary school. Journal of Human Resources, 56(2), 343–379. https://doi.org/10.3368/jhr.56.2.1216-8430R3

92.

Lockwood

J. R.

Castellano

K. E.

(2015). Alternative statistical frameworks for student growth percentile estimation. Statistics and Public Policy, 2(1), 1–9. https://doi.org/10.1080/2330443X.2014.962718

93.

Lockwood

J. R.

McCaffrey

D. F.

(2020). Recommendations about estimating errors-in-variables regression in Stata. The Stata Journal: Promoting communications on statistics and Stata, 20(1), 116–130. https://doi.org/10.1177/1536867X20909692

94.

Lockwood

McCaffrey

D. F.

(2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39(1), 22–52. https://doi.org/10.3102/1076998613509405

95.

LoPilato

A. C.

Carter

N. T.

Wang

(2015). Updating generalizability theory in management research: Bayesian estimation of variance components. Journal of Management, 41(2), 692–717. https://doi.org/10.1177/0149206314554215

96.

Lord

F. M.

Novick

M. R.

(1968). Statistical theories of mental test scores. IAP.

97.

Manzi

San Martín

Van Bellegem

(2014). School system evaluation by value added analysis under endogeneity. Psychometrika, 79(1), 130–153. https://doi.org/10.1007/s11336-013-9338-0

98.

McCaffrey

D. F.

Sass

T. R.

Lockwood

J. R.

Mihaly

(2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572–606. https://doi.org/10.1162/edfp.2009.4.4.572

99.

McNeish

Wolf

M. G.

(2020). Thinking twice about sum scores. Behavior Research Methods, 52, 2287–2305.

100.

Meijer

R. R.

Nering

M. L.

(1999). Computerized adaptive testing: Overview and introduction. Applied Psychological Measurement, 23(3), 187–194. https://doi.org/10.1177/01466219922031310

101.

Monroe

Cai

(2015). Examining the reliability of student growth percentiles using multidimensional IRT. Educational Measurement: Issues and Practice, 34(4), 21–30. https://doi.org/10.1111/emip.12092

102.

Mood

(2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26(1), 67–82.

103.

Morganstein

Wasserstein

(2014). ASA statement on value-added models. Statistics and Public Policy, 1(1), 108–110. https://doi.org/10.1080/2330443X.2014.956906

104.

Mundlak

(1978). On the pooling of time series and cross section data. Econometrica: Journal of the Econometric Society, 69–85.

105.

Naumann

Hochweber

Hartig

(2014). Modeling instructional sensitivity using a longitudinal multilevel differential item functioning approach. Journal of Educational Measurement, 51(4), 381–399.

106.

NWEA. (2019). MAP growth technical report (Technical reports). https://www.nwea.org/uploads/2021/11/MAP-Growth-Technical-Report-2019_NWEA.pdf

107.

Olson

Toch

(2024). None of the above: A new vision for state standardized testing (Technical reports). FutureEd. https://www.future-ed.org/wp-content/uploads/2024/05/FutureEd-Report-None-of-the-Above.pdf

108.

Page

G. L.

San Martín

Irribarra

D. T.

Van Bellegem

(2024). Temporally dynamic, cohort-varying value-added models. Psychometrika, 89(3), 1074–1103. https://doi.org/10.1007/s11336-024-09979-0

109.

Papay

J. P.

Taylor

E. S.

Tyler

J. H.

Laski

M. E.

(2020). Learning job skills from colleagues at work: Evidence from a field experiment using teacher performance data. American Economic Journal: Economic Policy, 12(1), 359–388.

110.

Pearson. (2022). New York State Regents Examination in English Language Arts 2022 Technical Report (Technical reports). https://www.nysed.gov/sites/default/files/programs/state-assessment/english-language-arts-technical-report-2022.pdf

111.

Petscher

Compton

D. L.

Steacy

Kinnon

(2020). Past perspectives and new opportunities for the explanatory item response model. Annals of Dyslexia, 70, 160–179. https://doi.org/10.1007/s11881-020-00204-y

112.

Pivovarova

Amrein-Beardsley

Broatch

(2016). Value-added models (VAMs): Caveat emptor. Statistics and Public Policy, 3(1), 1–9. https://doi.org/10.1080/2330443X.2016.1164641

113.

Polikoff

M. S.

(2010). Instructional sensitivity as a psychometric property of assessments. Educational Measurement: Issues and Practice, 29(4), 3–14.

114.

Prowker

Camilli

(2007). Looking beyond the overall scores of NAEP assessments: Applications of generalized linear mixed modeling for exploring value-added item difficulty effects. Journal of Educational Measurement, 44(1), 69–87. https://doi.org/10.1111/j.1745-3984.2007.00027.x

115.

Rabe-Hesketh

Skrondal

(2022). Multilevel and longitudinal modeling using Stata. STATA Press.

116.

Raudenbush

S. W.

(2004). What are value-added models estimating and what does this imply for statistical practice? Journal of Educational and Behavioral Statistics, 29(1), 121–129. https://doi.org/10.3102/10769986029001121

117.

Raudenbush

S. W.

Bryk

A. S.

(1986). A hierarchical model for studying school effects. Sociology of Education, 59(1), 1–17. https://doi.org/10.2307/2112482

118.

Raudenbush

S. W.

Willms

(1995). The estimation of school effects. Journal of Educational and Behavioral Statistics, 20(4), 307–335. https://doi.org/10.3102/10769986020004307

119.

Reardon

S. F.

Raudenbush

S. W.

(2009). Assumptions of value-added models for estimating school effects. Education Finance and Policy, 4(4), 492–519. https://doi.org/10.1162/edfp.2009.4.4.492

120.

Revelle

Condon

D. M.

(2019). Reliability from alpha to omega: A tutorial. Psychological Assessment, 31(12), 1395–1411. https://doi.org/10.1037/pas0000754

121.

Rivkin

S. G.

Hanushek

E. A.

Kain

J. F.

(2005). Teachers, schools, and academic achievement. Econometrica, 73(2), 417–458. https://doi.org/10.1111/j.1468-0262.2005.00584.x

122.

Robitzsch

(2020). Why ordinal variables can (almost) always be treated as continuous variables: Clarifying assumptions of robust continuous and ordinal factor analysis estimation methods. Frontiers in Education, 5. https://doi.org/10.3389/feduc.2020.589965

123.

Rothstein

(2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571. https://doi.org/10.1162/edfp.2009.4.4.537

124.

Rothstein

(2010). Teacher quality in educational production: Tracking, decay, and student achievement. Quarterly Journal of Economics, 125(1), 175–214. https://doi.org/10.1162/qjec.2010.125.1.175

125.

Rubin

D. B.

Stuart

E. A.

Zanutto

E. L.

(2004). A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics, 29(1), 103–116. https://doi.org/10.3102/10769986029001103

126.

Schielzeth

Dingemanse

N. J.

Nakagawa

Westneat

D. F.

Allegue

Teplitsky

Réale

Dochtermann

N. A.

Garamszegi

L. Z.

Araya-Ajoy

Y. G.

(2020). Robustness of linear mixed-effects models to violations of distributional assumptions. Methods in Ecology and Evolution, 11(9), 1141–1152. https://doi.org/10.1111/2041-210X.13434

127.

Schochet

P. Z.

Chiang

H. S.

(2013). What are error rates for classifying teacher and school performance using value-added models? Journal of Educational and Behavioral Statistics, 38(2), 142–171. https://doi.org/10.3102/1076998611432174

128.

Shi

Leite

Algina

(2010). The impact of omitting the interaction between crossed factors in cross-classified random effects modelling. British Journal of Mathematical and Statistical Psychology, 63(1), 1–15. https://doi.org/10.1348/000711008X398968

129.

Sørensen

. (2024). Multilevel semiparametric latent variable modeling in R with “galamm.” Multivariate Behavioral Research, 59(5), 1098–1105. https://doi.org/10.1080/00273171.2024.2385336

130.

Thai

K.-P.

Bang

H. J.

(2022). Accelerating early math learning with research-based personalized learning games: A cluster randomized controlled trial. Journal of Research on Educational Effectiveness, 15(1), 28–51. https://doi.org/10.1080/19345747.2021.1969710

131.

Vispoel

W. P.

Lee

Hong

Chen

(2023). Applying multivariate generalizability theory to psychological assessments. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000606

132.

Wells

C. S.

Sireci

S. G.

(2020). Evaluating random and systematic error in student growth percentiles. Applied Measurement in Education, 33(4), 349–361. https://doi.org/10.1080/08957347.2020.1789139

133.

Wiedermann

Zhang

Reinke

Herman

K. C.

Von Eye

(2024). Distributional causal effects: Beyond an “averagarian” view of intervention effects. Psychological Methods, 29(6), 1046–1061. https://doi.org/10.1037/met0000533

134.

Wulff

S. S.

(2008). The equality of REML and ANOVA estimators of variance components in unbalanced normal classification models. Statistics & Probability Letters, 78(4), 405–411. https://doi.org/10.1016/j.spl.2007.07.013

135.

Daniel

(2017). The impact of inappropriate modeling of cross-classified data structures on random-slope models. Journal of Modern Applied Statistical Methods, 16(2), 458–484. https://doi.org/10.22237/jmasm/1509495900

Item-Level Heterogeneity in Value Added Models: Implications for Reliability,Cross-Study Comparability,and Effect Sizes

Abstract

Keywords

Get full access to this article

References