Sage Journals: Discover world-class research

Abstract

Contemporary teacher evaluation systems use multiple measures of performance to construct ratings of teacher quality. While the properties of constituent measures have been studied, little is known about whether composite ratings themselves are sufficiently reliable to support high-stakes decision making. We address this gap by estimating the consistency of composite ratings of teacher quality from New Mexico’s teacher evaluation system from 2015 to 2016. We estimate that roughly 40% of teachers would receive a different composite rating if reevaluated in the same year; 97% of teachers would receive ratings within ±1 level of their original rating. We discuss mechanisms by which policymakers can improve rating consistency, and the implications of those changes to other properties of teacher evaluation systems.

Keywords

teacher accountability performance assessment evaluation simulation measurement

Get full access to this article

View all access options for this article.

References

American Educational Research Association. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, 44, 448–452.

Bacher-Hicks

Chin

M. J.

Kane

T. J.

Staiger

D. O.

(2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys (Working Paper No. 23478). Cambridge, MA: National Bureau of Economic Research.

Baker

E. L.

Barton

P. E.

Darling-Hammond

Haertel

Ladd

H. F.

Linn

R. L.

. . . Shepard

L. A.

(2010). Problems with the use of student test scores to evaluate teachers (EPI Briefing Paper No. 278). Retrieved from http://www.epi.org/files/page/-/pdf/bp278.pdf

Balch

(2012). The validation of a study survey on teacher practice. Retrieved from https://www.researchgate.net/publication/265225866_THE_VALIDATION_OF_A_STUDENT_SURVEY_ON_TEACHER_PRACTICE

Bill & Melinda Gates Foundation. (2010). Working with teachers to develop fair and reliable measures of effective teaching. Retrieved from https://docs.gatesfoundation.org/Documents/met-framing-paper.pdf

Bock

R. D.

Petersen

A. C.

(1975). A multivariate correction for attenuation. Biometrika, 62, 673–678.

Borko

(2004). Professional development and teacher learning: Mapping the terrain. Educational Researcher, 33, 3–15.

Braun

(2005). Using student progress to evaluate teachers: A primer on value-added models. Princeton, NJ: Educational Testing Service. Retrieved from https://www.ets.org/Media/Research/pdf/PICVAM.pdf

Brennan

R. L.

(2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38, 295–317.

10.

Brennan

R. L.

Wan

(2004, June). A bootstrap procedure for estimating decision consistency for single-administration complex assessments. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Diego, CA.

11.

Cantrell

Kane

T. J.

(2013). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET project’s three-year study. Retrieved from http://k12education.gatesfoundation.org/download/?Num=2572&filename=MET_Ensuring_Fair_and_Reliable_Measures_Practitioner_Brief.pdf

12.

Cherng

H.-Y. S.

Halpin

P. F.

(2016). The importance of minority teachers: Student perceptions of minority versus white teachers. Educational Researcher, 45, 407–420.

13.

Chester

M. D.

(2003). Multiple measures and high-stakes decisions: A framework for combining measures. Educational Measurement: Issues and Practice, 22(2), 32–41.

14.

Chetty

Friedman

J. N.

Rockoff

J. E.

(2014). Measuring the impacts of teachers II: Teacher value-added and student outcomes in adulthood. American Economic Review, 104, 2633–2679.

15.

Cochran-Smith

(2003). The unforgiving complexity of teaching: Avoiding simplicity in the age of accountability. Journal of Teacher Education, 54(1), 3–5.

16.

Corcoran

Goldhaber

(2013). Value added and its uses: Where you stand depends on where you sit. Education Finance and Policy, 8, 418–434.

17.

Cronbach

L. J.

Linn

R. L.

Brennan

R. L.

Haertel

E. H.

(1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological Measurement, 57, 373–399.

18.

Cullen

J. B.

Koedel

Parsons

(2016). The compositional effect of rigorous teacher evaluation on workforce quality. Cambridge, MA: National Bureau of Economic Research.

19.

Danielson

(2007). Enhancing professional practice: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development.

20.

Dee

T. S.

Wyckoff

(2015). Incentives, selection, and teacher performance: Evidence from IMPACT. Journal of Policy Analysis and Management, 34, 267–297.

21.

Douglas

K. M.

Mislevy

R. J.

(2010). Estimating classification accuracy for complex decision rules based on multiple scores. Journal of Educational and Behavioral Statistics, 35, 280–306.

22.

English

Burniske

Meibaum

Lachlan-Haché

(2016). Using student surveys as a measure of teaching effectiveness. Washington, DC: American Institutes for Research.

23.

Ferguson

R. F.

(2010). Student perceptions of teaching effectiveness (Discussion Brief). Cambridge, MA: National Center for Teacher Effectiveness and the Achievement Gap Initiative.

24.

Goe

Holdheide

L. R.

Miller

(2011). A practical guide to designing comprehensive teacher evaluation systems: A tool to assist in the development of teacher evaluation systems. Washington, DC: National Comprehensive Center for Teacher Quality. Retrieved from http://www.lauragoe.com/LauraGoe/practicalGuideEvalSystems.pdf

25.

Goldhader

D. D.

Brewer

D. J.

Anderson

D. J.

(1999). A three-way error components analysis of educational productivity. Education Economics, 7, 199–208.

26.

Grissom

J. A.

Loeb

(2017). Assessing principals’ assessments: Subjective evaluations of teacher effectiveness in low- and high-stakes environments. Education Finance and Policy, 12, 369–395.

27.

Grossman

Loeb

Cohen

Wyckoff

(2013). Measure for measure: The relationship between measures of instructional practice in middle school English language arts and teachers’ value-added scores. American Journal of Education, 119, 445–470.

28.

Hill

H. C.

Umland

K. L.

Litke

Kapitula

(2012). Teacher quality and quality teaching: Examining the relationship of a teacher assessment to practice. American Journal of Education, 118, 489–519.

29.

A. D.

Kane

T. J.

(2013). The reliability of classroom observations by school personnel. Retrieved from http://k12education.gatesfoundation.org/download/?Num=2520&filename=MET_Reliability-of-Classroom-Observations_Research-Paper.pdf

30.

Jackson

Mackler

(2016). Assessing effectiveness: How urban teachers evaluates its new teachers. Baltimore, MD: Urban Teachers.

31.

Kane

(2011). The errors of our ways. Journal of Educational Measurement, 48(1), 12–30.

32.

Kane

Case

S. M.

(2004). The reliability and validity of weighted composite scores. Applied Measurement in Education, 17, 221–240.

33.

Kane

T. J.

McCaffrey

D. F.

Miller

Staiger

D. O.

(2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment. Retrieved from http://k12education.gatesfoundation.org/download/?Num=2676&filename=MET_Validating_Using_Random_Assignment_Research_Paper.pdf

34.

Kane

T. J.

Taylor

E. S.

Tyler

J. H.

Wooten

A. L.

(2011). Identifying effective classroom practices using student achievement data. Journal of Human Resources, 46, 587–613.

35.

Koedel

Mihaly

Rockoff

J. E.

(2015). Value-added modeling: A review. Economics of Education Review, 47, 180–195.

36.

Kraft

M. A.

Gilmour

A. F.

(2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46, 234–249.

37.

Kyriakides

(2005). Extending the comprehensive model of educational effectiveness by an empirical investigation. School Effectiveness and School Improvement, 16, 103–152.

38.

Lee

W. C.

Hanson

B. A.

Brennan

R. L.

(2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412–432.

39.

Leinhardt

Greeno

J. G.

(1986). The cognitive skill of teaching. Journal of Educational Psychology, 78, 75–95.

40.

Livingston

S. A.

Lewis

(1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179–197.

41.

Martínez

J. F.

Schweig

Goldschmidt

(2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38, 738–756.

42.

Mashburn

A. J.

Meyer

J. P.

Allen

J. P.

Pianta

R. C.

(2014). The effect of observation length and presentation order on the reliability and validity of an observational measure of teaching quality. Educational and Psychological Measurement, 74, 400–422.

43.

McCaffrey

D. F.

Lockwood

Koretz

D. M.

Hamilton

L. S.

(2004). Evaluating value-added models for teacher accountability. Santa Monica, CA: RAND Corporation.

44.

McCaffrey

D. F.

Sass

T. R.

Lockwood

J. R.

Mihaly

(2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4, 572–606.

45.

Messick

(1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 5D, 741–749.

46.

Mihaly

McCaffrey

D. F.

(2014). Grade level variation in observational measures of teacher effectiveness. In Kerr

Pianta

Kane

(Eds.), Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching project (pp. 9–49). San Francisco, CA: Jossey-Bass.

47.

Mihaly

McCaffrey

D. F.

Staiger

D. O.

Lockwood

J. R.

(2013). A composite estimator of effective teaching. Seattle, WA: Bill & Melinda Gates Foundation.

48.

Nye

Konstantopoulos

Hedges

L. V.

(2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26, 237–257.

49.

Panorama. (2015). Validity brief: Panorama student survey. Retrieved from https://panorama-www.s3.amazonaws.com/files/panorama-student-survey/validity-brief.pdf

50.

Papay

J. P.

Kraft

M. A.

(2015). Productivity returns to experience in the teacher labor market: Methodological challenges and new evidence on long-term career improvement. Journal of Public Economics, 130, 105–119.

51.

Pianta

R. C.

Hamre

B. K.

Haynes

N. J.

Mintz

La Paro

K. M.

(2006). Classroom Assessment Scoring System (CLASS) manual: Middle/secondary version pilot. Charlottesville: Curry School of Education, University of Virginia.

52.

Rivkin

S. G.

Hanushek

E. A.

Kain

J. F.

(2005). Teachers, schools, and academic achievement. Econometrica, 73, 417–458.

53.

Rothstein

Mathis

W. J.

(2013). Review of two culminating reports from the MET Project. Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013

54.

Schweig

J. D.

(2018). Pilot to policy: Reconsidering the reliability of observation-based ratings from New Mexico’s statewide teacher evaluation system. Paper presented at the Annual Meeting of the American Educational Research Association, New York, NY.

55.

Shrout

P. E.

Fleiss

J. L.

(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.

56.

Steele

J. L.

Hamilton

L. S.

Stecher

B. M.

(2010). Incorporating student performance measures into teacher evaluation systems. Santa Monica, CA: RAND Corporation. Retrieved from https://www.rand.org/pubs/technical_reports/TR917.html

57.

Steinberg

M. P.

Donaldson

M. L.

(2016). The new educational accountability: Understanding the landscape of teacher evaluation in the post-NCLB era. Education Finance and Policy, 11, 340–359.

58.

Steinberg

M. P.

Garrett

(2016). Classroom composition and measured teacher performance: What do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38, 293–317.

59.

Steinberg

M. P.

Kraft

M. A.

(2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46, 378–396.

60.

Strunk

K. O.

Weinstein

T. L.

Makkonen

(2014). Sorting out the signal: Do multiple measures of teachers’ effectiveness provide consistent information to teachers and principals? Education Policy Analysis Archives, 22, 1–41.

61.

Taylor

E. S.

Tyler

J. H.

(2012). The effect of evaluation on teacher performance. American Economic Review, 102, 3628–3651.

62.

U.S. Department of Education. (2010). Race to the top program guidance and frequently asked questions. Retrieved from https://www2.ed.gov/programs/racetothetop/faq.pdf

63.

Wallace

T. L.

Kelcey

Ruzek

(2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the tripod student perception survey. American Educational Research Journal, 53, 1834–1868.

64.

Weisberg

Sexton

Mulhern

Keeling

Schunck

Palcisco

Morgan

(2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. Retrieved from https://tntp.org/assets/documents/TheWidgetEffect_2nd_ed.pdf

65.

Wise

E. A.

Darling-Hammond

McLaughlin

M. W.

Bernstein

H. T.

(1985). Teacher evaluation: A study of effective practices. Santa Monica, CA: RAND Corporation. Retrieved from https://www.rand.org/content/dam/rand/pubs/reports/2006/R3139.pdf

66.

Worrell

F. C.

Kuterbach

L. D.

(2001). The use of student ratings of teacher behaviors with academically talented high school students. Journal of Secondary Gifted Education, 12, 236–247.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB

The Consistency of Composite Ratings of Teacher Effectiveness: Evidence From New Mexico

Abstract

Keywords

Get full access to this article

References

Supplementary Material