Abstract
To ensure that new teachers enter classrooms poised for success, we need more evidence regarding supports that can expedite skill development during teacher education. In this study, we capitalize on mixed-reality simulations as a setting for evaluating such supports. Using a randomized control trial, we examine the extent to which coaching improves text-focused instruction compared to the more typical practice of self-reflection. Then, employing a mixed-methods sequential explanatory design, we qualitatively and quantitatively explore the extent to which candidates’ individual characteristics might influence the effectiveness of those supports. We find that, on average, coaching is the more effective support for skill development, but we also surface potential drivers of heterogeneous effects of both coaching and self-reflection, including self-efficacy, extraversion, and prior skills. We conclude with implications for teacher education programs and researchers.
Keywords
Hundreds of thousands of new teachers enter classrooms each year, many of whom struggle to employ effective teaching skills. Preservice preparation could be an important site for developing these skills, but there is very little causal research detailing methods for promoting such development (Hill et al., 2024). Beyond identifying promising practices for supporting all preservice teachers (termed “candidates”), we also need more evidence about differential effects for different candidates, who have a range of observed needs and would likely benefit from different supports. Just as effective K–12 instruction differs across students, so, too, might effective instructional support differ across candidates. Unfortunately, most preparation programs rely on a “one size fits all” approach, providing a predetermined sequence of coursework and learning opportunities. More flexible and tailored approaches to preparation could provide candidates with targeted experiences to develop skills with which they most struggle and supports that benefit them most. However, we need more research that explores the degree to which and ways in which candidates respond differently to distinct types of support. This is the central goal of this paper.
New technologies provide opportunities for both supporting and understanding candidates’ skill development. For example, thousands of candidates from scores of programs use a mixed-reality simulation (MRS) interface developed by Mursion (Ade-Ojo et al., 2022; Dalinger et al., 2020; Mikeska et al., 2023). The platform is “mixed-reality” because there is a virtual classroom populated by student avatars remotely controlled by a trained actor (a “simulation specialist”). This virtual classroom creates a space to repeatedly try challenging teaching skills without practicing on real children. It also affords the opportunity for teacher educators to observe candidates and provide coaching before candidates “try again” in ways that are logistically challenging in clinical placements. Simulated platforms also afford a standardized platform for assessing skill development and determining whether different supports expedite development (Cohen et al., 2020, 2024).
In this study, we use sequential explanatory mixed methods (Ivankova et al., 2006) to evaluate two potentially promising supports—self-reflection and coaching—for helping candidates develop their skills, using MRS as a practice space and a standardized assessment platform. Self-reflection is a low-cost, practical tool frequently used in teacher education to promote improvement (Goddard & Foster, 2001; Gorski & Dalton, 2020; Yost, 2006). Though some descriptive work suggests that it can reduce teacher biases and promote commitments to social justice (Lin & Lucey, 2010; Sleeter, 2018), there is no causal work to our knowledge that suggests that it enhances teachers’ skills. In contrast, research consistently demonstrates that coaching has very large causal effects on teachers’ skills and student outcomes (Kraft et al., 2018). However, coaching is resource intensive, making it more challenging to employ in resource-constrained preparatory programs. Given these tradeoffs, programs would benefit from insight into whether some candidates might need more resource-intensive support like coaching, while others could benefit from more cost-efficient interventions like self-reflection.
The foundation of the study is a randomized controlled trial (RCT) designed to assess whether coaching or self-reflection better increases, on average, candidate readiness in text-focused instruction (Reznitskaya et al., 2009). Candidates at a public preparation program were randomly assigned to receive either directive coaching or structured self-reflection between simulations. During the first phase of the study, we quantitatively analyze the results of the RCT, identifying the causal impact of coaching compared to self-reflection on candidates’ skill development. Though scores of preparation programs use simulation technology to prepare candidates, our study is one of few to experimentally evaluate the benefits of different supports for promoting candidates’ skill development (Cohen et al., 2020, 2024; Ireland, 2021). However, we do not just need to know what supports work, but also for whom they are most helpful and potentially why. Thus, in the second phase, we use both quantitative and qualitative analyses to begin building an empirical base about how and why coaching and self-reflection may have differential utility for candidates. We answer the following research questions:
1) What is the differential impact of coaching compared to self-reflection on the quality of feedback provided in a simulated text-based discussion?
2) How might candidate characteristics or baseline skills enhance or mitigate improvement within coaching and self-reflection conditions?
Our mixed-methods evaluation makes several contributions. First, numerous scholars have called on teacher education researchers to better delineate the experiences and supports that causally impact new teacher development (Grossman, 2008; Hill et al., 2024). This study answers this call, contributing experimental evidence about the utility of coaching compared to self-reflection in a field where there is little causal work. Though there is an ever-growing research base demonstrating the value of coaching teachers (Kraft et al., 2018, Pianta et al., 2021), there is little work in the preservice space, where there is an urgent need for evidence-based approaches to rapidly develop teaching skills. Second, we generate empirically based hypotheses about how candidate characteristics and skills might influence responses to coaching and self-reflection. Rather than assuming all candidates benefit equally from the same supports, we provide early evidence to suggest preparation programs may be well served to consider the differential utility of different scaffolds for candidates with distinct needs.
Background
Preparing Teachers to Orient Students to Texts When Making Arguments and Inferences
Our simulations focus on a key teaching practice used across grade levels: helping students make inferences from texts (Castles et al., 2018). A wealth of research foregrounds the importance of teachers helping students make text-based arguments (Dewitz & Graves, 2021). In less productive interactions, teachers may accept minimal or unclear responses or employ perfunctory feedback (e.g., “Great job!” or “Try again.”) that does not move students towards deep textual understanding (Cazden & Beck, 2003). In more productive interactions, teachers support students in closely reading texts by probing contributions (Snow & O’Connor, 2016), asking them to clarify and elaborate their responses (Nystrand, 2006), and providing descriptive feedback naming strong elements of responses (Tunstall & Gsipps, 1996). Teachers who affirm the use of textual evidence have students who subsequently provide more text-based evidence (Jadallah et al., 2011).
In recent years, many have argued that preservice teacher preparation is a crucial—and underutilized—site for cultivating such knowledge and skills (Hudson et al., 2021). While preparation courses provide opportunities to learn how students learn how to read or about pedagogical methods, there are far fewer opportunities to try out methods (Hindman et al., 2020). Simulated practice might be especially useful as candidates develop these skills.
Coaching and Self-Reflection as Levers for Improvement
We evaluate the differential utility of two supports for improving the quality of text-based feedback in the MRS: self-reflection and coaching. As far back as Dewey (1933), scholars have advocated for reflection as a tool to help teachers analyze their beliefs and skills. The premise that reflection is a sufficiently powerful lever for improvement is central to much literature on teacher preparation (Calderhead & Gates, 2003; Hatton & Smith, 1995), and is the primary skill assessed in consequential teacher licensure exams such as edTPA (Sato, 2014). Though there are numerous, small-scale, qualitative studies that detail the evolution of candidates’ beliefs and commitments to social justice with self-reflection (Acquah & Commins, 2015; Çimer et al., 2013; Lin & Lucey, 2010), there is no causal work showing that reflection changes teachers’ knowledge, skills, or beliefs.
While self-reflection is an oft-used strategy, there is also increasingly robust evidence about the benefits of coaching teachers (Kraft et al., 2018). A coach can illuminate blind spots and provide concrete strategies for improvement (Coburn & Woulfin, 2012). Though candidates sometimes receive coaching from mentors, most clinical experiences rely on an apprenticeship model where candidates learn through observation (Matsko et al., 2020). Thus, there is limited evidence of the utility of coaching during preservice preparation.
Candidate Characteristics and the Development of Teaching Skills
Beyond testing the causal impact of coaching and self-reflection, we are also interested in the reasons why teachers differentially respond to such supports. Individuals respond differently to learning experiences based on perceptions, beliefs, and characteristics (Horn et al., 2008). For example, teacher self-efficacy is linked to increased persistence with challenging learning opportunities, as well as classroom innovation and risk-taking (Hoy et al., 2009). Teachers with higher self-efficacy tend to report less stress from classroom experiences and less job-related burnout (Yost, 2006). They are also more likely to apply a coach’s suggestions (Stewart et al., 2008). Similarly, measures of personality, including extraversion, conscientiousness, and openness to new experiences, have been shown to be consistent over time and across contexts, and also predict individuals’ engagement with and learning from professional experiences (Asendorpf & Wilpers, 1998; Hampson & Friedman, 2008). Extraverted individuals tend to thrive on input from others and prefer engaging in conversation to solve problems (Jensen-Campbell et al., 2002). Perhaps because coaching necessitates social interaction and external feedback, studies outside of education demonstrate a strong positive relationship between extraversion and responsiveness to coaching (Jackson et al., 2011; Jones et al., 2014).
Teachers’ prior knowledge and skill development should also, theoretically, play a role in responses to coaching and self-reflection. Candidates begin preparation programs with varying content knowledge in literacy (Hindman et al., 2020) and general pedagogy (McDiarmid & Clevinger-Bright, 2008). Candidates’ knowledge and skills also develop at different rates, even within the same preparation program (Boguslav & Cohen, 2024; Cohen et al., 2020). We theorize that these different trajectories influence candidates’ responses to reflection and coaching, though no studies to our knowledge have explored this empirically. A teacher with robust skills in text-focused instruction might be less motivated to take advantage of either support (Tschannen-Moran & Johnson, 2011). Alternatively, one who has struggled with text-focused instruction might be more inclined to leverage supports, particularly coaching, that provide insights from a more “expert” other (Gibson, 2006). Taken together, evidence suggests that teacher beliefs and personal characteristics, as well as existing knowledge and skill, might influence the development of instructional quality, although there is less evidence about the mechanisms underlying such relationships (Hill et al., 2008).
Study Design
Participants (n = 93) at a large, public university were enrolled in a joint Bachelor of Arts and Sciences/Masters of Teaching (BAMT) program or a Masters of Teaching (MT) program that required the same sequence of courses and clinical experiences. Participants were mostly white (80%), female (90%), and middle class (72%), like the broader teaching profession. Simulations were integrated into a methods courses focused on principles of curriculum and instruction across content areas and were designed to help candidates apply methods discussed in the course. This course occurred in the first semester for both programs; candidates took courses together. All candidates were in concurrent clinical experiences for five hours a week, intended to provide opportunities to apply concepts and instructional strategies.
Our simulation scenario was designed to approximate teaching experiences at the middle of the K–12 band (5th–6th grade), making it relevant to both elementary and secondary candidates, with a goal of improving candidate feedback during text-focused discussions. We selected a 6th grade text entitled “A Dangerous Game,” in which Lisa, the protagonist, applies to be a student intern at a technology company. The company is run by a mysterious genius, Pizmo, who makes prospective employees take a lie detector test. The text supports inferences that Lisa is not an intern, but a corporate spy collecting insider information.
Simulation Procedures
Candidates engaged in three parallel forms of the simulation. The first was a baseline measure (“Baseline”), early in the semester. Before the second simulation (“Second Simulation”), candidates were randomly assigned within course sections to coaching (n = 49) or self-reflection (n = 44). After receiving coaching or self-reflection, candidates immediately tried the scenario one more time (“Final Simulation”). The second and final simulations took place two months after baseline (for details, see Figure A1 in the Supplemental Appendix).
Each simulation was five minutes. Before practicing in the simulator, candidates were provided the text and two text-based questions: 1) How does Lisa most likely feel when Pizmo brings up her lie detector results? 2) Who do you think Lisa really is? What makes you think that? Candidates were prompted to generate their own responses, anticipate student responses, and describe how they might support students to develop inferences using textual evidence.
During each simulation, student avatars (termed “students”) provided five pre-planned responses that represented either a partial understanding (e.g., a literal interpretation in response to an inferential question: “Lisa is an intern”), a claim unsupported by text evidence (e.g., “Lisa is not afraid of being found out”), or a claim supported by text evidence (e.g., “Lisa’s heart is racing, so I think she’s afraid Pizmo will discover who she is”). If a candidate did not ask follow-up questions, simulation specialists were trained to have students not elaborate their thinking. If a candidate scaffolded development of a response, students provided additional information.
Self-Reflection and Coaching Protocols
Candidates assigned to self-reflection were given five minutes to respond to three prompts: 1) What are some ways you think you were able to support students in making text-based claims during this discussion? Please list at least two items; 2) In what ways could you improve your responses to students during this discussion so they would be better able to provide text-based claims? Please list at least two items; and 3) What are you going to try to do differently in your next session? Please explain what strategies you will use to support students in text-based instruction and why you think they will be helpful.
The other candidates received five minutes of directive coaching from a doctoral student in education with prior K–12 teaching and coaching experience. Coaches were trained and certified to use a protocol which prompted them to: 1) elicit candidate reflection (e.g., “How did you feel about that simulation?”), 2) label a strength (e.g., “I was impressed with how you probed student responses for text evidence”), 3) label a focus based on the skill progression (e.g., “I want you to focus on scaffolding student comprehension”), 4) practice the skill with a candidate-coach role-play (e.g., “You ask question one; I’ll pretend to be a student.”), and 5) remind the candidate to use the focal skill in the next session (See Figure A2 in Supplemental Appendix).
Data and Measures
Quantitative Data
Outcome Measure
Our primary outcome of interest is an observation-based measure of the quality of feedback candidates provided to students during simulated text-focused instruction. Scores range from 1 to 10, with a higher score indicating greater use of high-quality feedback (see Supplemental Appendix B). Put simply, lower scores indicate a higher prevalence of perfunctory feedback and non-text probing; high scores indicate a higher prevalence of text-based probing, descriptive feedback, and extending and re-voicing student contributions (Snow & O’Connor, 2016). Though these rubrics were researcher-developed, we have analyzed the relationship between scores on our rubrics and other related measures of how teachers engage with student contributions during academic discussions (Demszky et al., 2021). These include automated measures of teacher uptake and human-scored measures of how teachers’ respond to student ideas (Demszky et al., 2024). Our measure was significantly correlated with both measures, providing important convergent validity evidence (Cohen et al., 2024).
Candidate Characteristics
All students in this preparation program complete a battery of nine externally developed instruments on their experiences, attitudes, beliefs, characteristics, and practices. From this battery, we focus on a subset of instruments that the literature suggests would be most theoretically relevant to the current study on the utility of coaching and self-reflection for teachers (Asendorpf & Wilpers, 1998; Hampson & Friedman, 2008; Horn et al., 2008; Hoy et al., 2009; Stewart et al., 2008; Yost, 2006). Specifically, we focus our analyses on three available instruments and their relevant subscales: the efficacy for instructional strategies (TSES-IS) subscale of the Teacher Sense of Efficacy Scale (TSES) (Tschannen-Moran & Hoy, 2001), the Depression Anxiety and Stress Scale (Lovibond & Lovibond, 1995), and the NEO-5 Factor Inventory (Costa & McCrae, 1992), which provides a measure of five domains of personality: neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness. Additional demographic data are used to increase the precision of our experimental estimates.
Qualitative Data
We analyze several sources of qualitative data: transcribed simulation sessions (three per candidate, n = 279), transcribed coaching conversations (n = 49), and written self-reflections (n = 44). We also analyze candidates’ self-assessments after the second simulation to gain insight into whether candidate conceptions of their text-focused instruction aligned with the research team. After the final simulation, candidates again self-assessed their performance and responded to a survey about their response to the simulator. These data provide insight into candidate experience and any perceived improvements in performance.
Evaluation Methodology
We use a mixed-methods sequential explanatory design to evaluate simulation supports in such a way that we capture both trends (with quantitative estimation) and details (with in-depth qualitative analysis). Figure 1 displays the sequence of our approach which proceeds in four stages: We 1) estimate the average causal impact of coaching versus self-reflection on improvement; 2) use an algorithmic machine learning approach (regression trees) to identify which characteristics best distinguish between candidates who are likely to grow more and less in both conditions; 3) purposively sample one high growth and low growth case in each condition for in-depth qualitative analysis; and 4) integrate qualitative and quantitative findings.

Sequential Explanatory Design.
This study includes one prespecified confirmatory test: was the impact of coaching, compared to self-reflection, on overall quality of feedback in the simulator greater than zero? The remaining research questions and analyses are exploratory and designed for hypothesis generation. Thus, we interpret the findings with caution, underscoring the importance of future research with larger samples.
Confirmatory Quantitative Analysis
We use a blocked randomized design where candidates were randomly assigned to coaching or self-reflection within elementary or secondary course sections. Then, to estimate the causal effects of coaching after three simulations (baseline, second, final), we estimate the following model:
Here,
Exploratory Quantitative Analysis
Following causal analysis, we use exploratory quantitative analysis to elicit hypotheses about which candidates benefit most/least from coaching and self-reflection. The basis of our approach is a regression tree, which splits candidates into subsets based on characteristics that predict growth within treatment conditions. We use regression trees to answer: if we wanted to split candidates into two groups—one which we predict will experience more growth with coaching/self-reflection and one which we predict will experience less growth—which characteristic and cut-point should we use? To answer this question, we grow two separate trees: one among coached candidates and one among those who self-reflected. Each tree predicts
The trees split candidates using the predictor and cut score that minimizes the mean squared error (MSE) within resulting subsets, termed leaves. Here, MSE is defined as the average squared difference between the growth of candidates, and their predicted growth given the leaf in which they are placed. Formally,
The regression tree approach to identifying distinguishing characteristics offers a few advantages. First, tree-based classifiers are distinctly suited to optimally distinguishing between groups based on predictors (Loh et al., 2015). If our aim is to predict a subset of individuals who will experience the most growth (after coaching or self-reflection) based on one or more characteristics, a regression tree answers this question. Thus, if programs want to target coaching for a few candidates or recommend that some candidates engage in self-reflection, a regression tree can preemptively identify a subset of target candidates based on their characteristics (Athey & Imbens, 2016). Second, a regression tree is non-parametric and can identify non-linear relationships (Hastie et al., 2009). Thus, if we believe that a candidate characteristic has a non-linear relationship with growth—for example, perhaps there is a floor such that a candidate simply needs to be extraverted enough to benefit from coaching—a regression tree is well suited to identifying this relationship.
However, there are two key disadvantages of our regression tree approach. First, by design, the regression tree will answer which characteristics are most predictive, but it will not answer whether the relationship is substantial or statistically significant. In other words, the most predictive characteristic may or may not have a significant relationship with growth. Given our interest in exploration and hypothesis generation, we do not conduct any null statistical hypothesis testing using our findings. As a measure of variability, however, we do estimate the average root mean squared error (RMSE) using ten-fold cross validation. Second, regression trees can be quite sensitive to small variations in a sample, particularly when they are allowed to continue splitting candidates into smaller and smaller subsets. For this reason, given our limited sample size, we only allow the tree to branch once, creating two leaves. We also test the robustness of the identified characteristics across random samples, again using ten-fold cross validation (see Appendix D).
Explanatory Qualitative Analysis
In our qualitative analysis, we select four candidates whose characteristics were aligned with the exploratory quantitative trends. Our goal is to add nuance to the quantitative findings and build theory about how and why individual characteristics might predict responses to supports in ways that are impossible with quantitative variables and statistical relationships (Tashakkori & Teddlie, 2003). As with any mixed-methods work, the rationale is rooted in the idea that neither quantitative nor qualitative data are “sufficient . . . to capture [both] the trends and details of the situation,” and the integration of the two facilitates richer and more robust analyses (Ivankova et al., 2006, p.3).
We focus on four illustrative cases of two candidates in each condition (self-reflection and coaching), one who exhibits high levels of growth or one who exhibits stagnation/decline (for detailed characteristics, see Table 2). We intentionally selected extreme instantiations of growth, stagnation, and decline because research suggests that outliers can be especially helpful in bringing into relief mechanisms that might be at play, to a lesser degree, for all candidates (Hill et al., 2011). For each candidate, multiple researchers analyzed transcripts of simulations, coaching, and self-reflections (Miles & Huberman, 1994). We collaborated to develop detailed memos describing individual characteristics and shifts in teaching, with a focus on generating hypotheses about links between the two (Dyson & Genishi, 2005). These cases are not intended to be representative of all candidate experiences; rather, they provide nuanced insights and illustrative examples of how individual characteristics might influence variation in responses to coaching and reflection. In providing these qualitative descriptions, we hope to help teacher educators better understand what it actually “looks like” when a more efficacious candidate engages in self-reflection, or a less extroverted candidate receives coaching. We hope these more nuanced qualitative insights promote more targeted implementation of both self-reflection and coaching across the preparatory curriculum.
Results
Confirmatory Quantitative Results
Table 1 presents our experimental results of the differential impact of coaching compared to self-reflection. On average, we find that candidates who received coaching scored 1.39 points higher on the overall quality scale compared to peers who self-reflected. Candidates who engaged in self-reflection averaged about 4.9 points out of 10; candidates who were coached scored about 6.3 points. These effects are both substantial and statistically significant; for additional context on the interpretation of these values, see Appendix A.
Coaching Effects on Teacher Candidates’ Performance Outcomes.
Note. Coefficients from separate regressions of the outcome on treatment. ***p < 0.001 **p < 0.01, *p < 0.05.
The final five outcomes in Table 1 pertain to the percentage of response type provided. These may be thought of as substitutes, as more time spent on one type of response (e.g., descriptive feedback) should reduce the time available to engage in other responses (e.g., perfunctory feedback, see Table A1). We observe an increase in text-based probing and descriptive feedback alongside a decrease in the proportion of perfunctory feedback and non-text probing.
Exploratory Quantitative Results
Figure 2 presents the results of our exploratory regression tree analysis. Within the self-reflection condition, the regression tree identified the second simulation score (which immediately precedes self-reflection) as the best binary predictor of growth (a finding which is robust to variation across ten random folds; see Appendix D). The figure shows candidates in self-reflection who scored below 4.25 on the second simulation grew by an average of 0.91 points; those who scored above declined by -0.70 (RMSE = 1.26). Given variability in the sample (see the bootstrapped confidence intervals), we provide these means solely to demonstrate the direction of the observed relationship: in our sample, lower baseline skills are associated with growth in self-reflection while higher baseline skills are associated with decline.

Regression Tree Results.
Appendix D tests the robustness of the regression tree finding to the minimum leaf size constraint; the second simulation score is the best binary predictor of growth within self-reflection when the minimum leaf size is as small as 4 or as large as 18. However, when we require that the tree splits candidates into roughly equal sized groups, self-efficacy of instruction becomes the better predictor.
Within the coaching condition, the regression tree identifies extraversion as the single best binary predictor (a finding which is again robust across folds; Appendix D). In our sample, coached candidates who scored below a 3.46 on the extraversion factor of the NEO Five Factor Inventory grew an average of 0.72 points, while those who scored above grew by 1.5 points (though, again, there is uncertainty surrounding these means as demonstrated by the confidence intervals; RMSE = 1.22). As with self-reflection, findings are relatively robust to minimum tree size. It is not until the minimum tree size is set to 19 that extraversion is no longer the best predictor. At that point, candidates can best be split into low-growth and high-growth groups using their overall score on the Depression, Anxiety, and Stress Scale (See Appendix D for more details).
Finally, to explore potential ceiling effects, we produce a scatter plot of second simulation performance and growth among both conditions, where the gray dots indicate self-reflection candidates while the black dots indicate coaching (see Figure 3). This indicates a key finding: while candidates with a higher second score can improve with coaching, no candidates scoring above a five improve with self-reflection.

Scatter Plot of Second Simulation Score with Growth, by Treatment Status.
Explanatory Qualitative Results
Finally, to better understand why growth, stagnation, and decline occur after self-reflection and coaching, we present four illustrative cases (all names are pseudonyms), focusing on the characteristics identified in the exploratory quantitative findings. We present Kathleen to help explain potential mechanisms of decline for those in the self-reflection condition, Victoria to help explain potential mechanisms of growth for self-reflection, Alex to explain mechanisms of stagnation or decline with coaching, and Liz to explain mechanisms of substantial growth with coaching (see Table 2 for descriptive characteristics).
Descriptive Characteristics of Illustrative Cases in Percentiles.
Self-Reflection and Decline: Kathleen
Kathleen is a white, female candidate in the elementary program whose score drops from 8 in the second simulation to 4 after self-reflection. Kathleen also scores far below average on the teacher self-efficacy instruction scale (3rd percentile) and below average on the extroversion dimension of the NEO-5 Factor Inventory (46th percentile).
In her second simulation, Kathleen successfully probes student responses for text evidence and provides descriptive feedback to students. When a student, Ethan, states Lisa is excited Pizmo brought up the lie detector test, Kathleen pushes Ethan for evidence: “Where do you see that in the text?” When Ethan highlights that he would be excited if he were working at a “fancy technology company like Lisa,” Kathleen seems to recognize the source of his emerging thinking—that he sees excitement in a different part of the text—and pushes him to consider more relevant textual evidence: “That makes sense you would be [excited] if you were a student intern, but what about when Pizmo brings up the lie detector results?” Ethan revises his thinking, and Kathleen provides additional descriptive feedback, “I’m really impressed you were able to go back to the text, reconsider your claim, and come to a different conclusion.” This exchange highlights her ability to support Ethan towards a more text-based response, as well as her recognition of his process for making a claim.
Yet, Kathleen rates her quality of feedback a four out of ten, dramatically underestimating the research team’s assessment (8/10). We take this as evidence of her general uncertainty about her skills, which is triangulated by her self-reflection transcript. When prompted to name what went well, Kathleen writes, “I think I tried to use some feedback loops, however I wasn’t always able to continue them for very long. I wanted it to be a longer conversation but wasn’t sure what to say.” Kathleen sets the goal of praising students more, struggling to identify strategies for improvement. She rates reflection as unhelpful, writing “coaching would’ve helped me think of ways to engage students.”
Kathleen does provide more praise in her final simulation but does not consistently engage in practices known to support students during text-focused instruction, scoring a 4/10. Though Kathleen again helps Ethan revise his thinking, she remains neutral about his response, “Okay.” Rather than providing descriptive feedback, she solicits other students, “Ava or Savannah, do you have thoughts on the first question? No? Alright, let’s move on.” Throughout the final simulation, Kathleen abstains from engaging with student thinking as she had previously.
Despite evidence of high-quality text-focused instruction in the second simulation, Kathleen relies on lower-level supports following self-reflection. We take this case as suggestive evidence that candidates who enter a simulation with high baseline skills, yet lower self-efficacy, may exhibit uncertainty during structured self-reflection and can regress in performance, absent an external observer to support them in identifying their strengths and areas for growth. Her uncertainty is reflected in both her performance scores and written reflections, where she repeatedly returns to the idea that coaching would have been more helpful.
Self-Reflection and Improvement: Victoria
Victoria is a white, female candidate in the secondary education program who improves by 1.5 points following self-reflection (from a second simulation score of 5). Victoria scores above average on the teacher self-efficacy scale (82nd percentile) and above average on the extraversion dimension of the NEO-5 factor inventory (94th percentile).
In her second session, Victoria engages in a mix of high- and low-quality responses to students, resulting in a mid-range score (5/10). She uses text-based probing to elicit evidence, and in one instance provides descriptive feedback, “Ethan, I like how you tried to put yourself in Lisa’s shoes. Our prior experiences can help us make inferences.” However, Victoria also provides perfunctory responses to students, such as, “Yeah, that’s a good job.” She understands that the second simulation did not go as well as it could have, rating herself low (2/10) and citing generic responses as her main issue: “I would say things like ‘I appreciate how you used text evidence,’ rather than directing students specifically to the text.”
In her self-reflection, Victoria recognizes students need more support to refine their responses. She writes, “I noticed I wasn’t really saying tons to each student. Rather I was just saying a quick ‘I like how you used the text to support that statement,’ then moving on, instead of diving in deeper.” This highlights her understanding of the importance of students providing textual evidence, in addition to how she will engage with them. Here, Victoria rates her self-reflection experience positively, writing, “it was helpful to be able to reflect before teaching the same lesson.”
In the final simulation, Victoria engages more directly with student ideas as described in her self-reflection. When Ethan demonstrates a partial understanding about Lisa’s excitement related to her lie detector results, Victoria does not simply point him to the text. Instead, she asks, “she definitely seemed excited at the beginning. Do you think that excitement carried through, even when Pizmo brought up the lie detector results?” This specific scaffolding provides Ethan with more support in revising his thinking by pointing him to a particular section of the text. When he refines his response, she provides descriptive feedback to reinforce his process, “I like how you went back and rethought the question with specific text evidence.” Later, when Ava demonstrates a partial understanding about Lisa’s identity, Victoria probes, “I know you thought that because it says she is a new student intern. But, after reading about her lie detector results, it makes us wonder who she is, right?” Victoria notes this improvement in her self-evaluation, rating herself three points higher than before (5/10), albeit lower than the research team (7/10).
Victoria’s case reflects how a candidate with moderate, but not high, existing skill with text-focused instruction and high teacher self-efficacy can leverage self-reflection prompts to drive improvements. We theorize her improvements may have been facilitated by both her productive beliefs about her ability to grow and her substantial room for growth. Her concluding thoughts on the experience reflect her stance that she possesses capabilities that can be unlocked through practice and reflection: “I need to think more about responses to students that would be applicable to multiple texts so students can utilize the feedback I am currently giving them in later lesson[s].”
Coaching and Stagnation: Alex
Alex is a white female candidate in the elementary program who received coaching but declined by one point between the second and final simulation (7 to 6). She scored well below average on the teacher self-efficacy scale (10th percentile) and the extraversion dimension of the NEO-5 factor inventory (26th percentile).
Alex’s performance in her second simulation suggests a reasonably robust understanding of how to support students in making text-focused arguments. She repeatedly probes students for text-based evidence. In one instance, she connects a student’s text-based response to their initial claim: “That’s a good point. Normally when your heart is pounding, it means that you’re worried about something or you’re nervous.” At multiple points, Alex also provides higher-level descriptive feedback, such as, “I like how you all are going back into the text and finding specific parts to emphasize your points.”
In her self-assessment, Alex writes that she did a “nice job asking students to bring examples from the text to support their answers,” but also notes struggles with one student: “I was not sure how to bring him around to the correct answer.” She rates herself a 6/10 on this session, aligned with the research team’s rating (7/10).
In her coaching session, Alex’s coach encourages her to engage with partial student understandings more directly. Alex says she wasn’t sure how to respond to Ethan’s unsupportable claim: “I was trying to figure out how to respond to that without saying ‘that’s not quite true.’” Alex’s coach helps her find textual evidence to redirect Ethan’s thinking and provides an avenue for probing his partial understanding: “You can point Ethan back to the specific place in the text about the lie detector results and have him respond there.” After this interaction, however, Alex largely responds to the coach’s prompts with one-word answers, suggesting some potential discomfort. Though Alex is positive in her post-coaching survey about her coach who “helped [her] figure out how to address Ethan’s misconceptions,” she also notes, “it was hard in the moment . . . to get critiqued and then to know what to do with it.” She goes on to say, “I felt like I needed more time to think about what she was saying and to plan for the next [simulation].”
In her final simulation, Alex leverages the coaching to improve her provision of specific text-based feedback. When Ethan provides the partial understanding that Lisa is excited, instead of simply asking “what makes you think that?,” Alex uses the coach’s technique to point to additional evidence that would support stronger inferences: “But in this lie detector section, that suggests maybe Lisa was nervous?” However, she uses less descriptive feedback, instead responding to students with more generic comments “I like how you all are going back into the text and finding specific parts.” In this final session, she scores a 6/10, a one-point decrease from the prior session. Alex assesses her performance highly (8/10), noting how she probed Ethan for details.
Alex’s performance stagnated after coaching. Though she was able to improve in the area targeted by her coach, she didn’t consistently engage in higher-level descriptive feedback in her final session. Again, we cannot make conclusive claims about why Alex stagnated, but we theorize that her initial high score may have made it difficult to continue improving, representing a potential “ceiling effect” for this teaching practice. Another theory is that Alex’s more introverted personality may have influenced her response to coaching. Research suggests this kind of performative experience might be intimidating for introverted candidates (Asendorpf & Wilpers, 1998; Jones et al., 2014). Feedback without a strong coaching relationship may be challenging for these candidates. Alex’s concluding thoughts reflect her hesitations about coaching: “It’s always hard for me to know what I’m walking into with the coaching in the simulator . . . the situation makes me uncomfortable . . . I liked the coach, but it was also hard to use that feedback immediately after.”
Coaching and Improvement: Liz
Liz is a white female in the elementary program, who had a low second simulation score of a 3. Liz scored above average on teacher self-efficacy (58th percentile) and well above average on the extraversion dimension of the NEO-5 factor inventory (74th percentile). After coaching, Liz improved her overall quality score by 4 points, an effect size of nearly 4 standard deviations.
In her second simulation, Liz offers largely perfunctory feedback like, “good thinking,” only occasionally asking students to elaborate their reasoning. In her post-simulation survey, Liz writes her simulation went “really well” because she had students “elaborate on all of their answers.” Although the research team assigned a low rating (3/10), Liz rated herself highly (7/10), perhaps reflecting misconceptions about what high-quality feedback entails, but also a positive sense about her skills. In the subsequent coaching session, Liz’s coach suggests, “To make it even stronger next time, really [try] probing for text-based evidence.” The coach underscores students’ need to reference the text and pushes Liz to engage with the text herself. For example, when the coach asks what evidence answers the question about Lisa’s emotions, Liz doesn’t remember. The coach directs Liz to related textual evidence before rehearsing a high-quality response together. In contrast to Alex who responded monosyllabically to the coach’s prompts, Liz’s responses are lengthy, often followed by inquiries for specific guidance. The conversation is punctuated by laughter, and Liz notes in her post coaching survey, “My coach was super nice and made it feel easy and natural.”
Liz seems to draw on the coach’s guidance in the final simulation. She pushes for text evidence when a student misidentifies Lisa’s affect as excited, asking “You think she’s excited? Where in the text does it say she’s excited?” When the student cannot provide evidence, she refers him to the text: “So, what might she be experiencing then? If you look at the text, where it was talking about her reactions to asking about the lie detector, what might she be experiencing?” Here, she probes the student for a complete, text-based response, directing him to specific evidence to support his conclusion. Liz rates her final teaching performance highly (8/10), aligned with the researcher rating of 7/10. She notes she improved by pushing the students to “give me information explicitly from the text that supported their claims.”
We take the case of Liz as an example of a candidate who seems to have entered the simulation with limited knowledge for text-focused instruction; however, her blind spots about text-based responses and appropriate text evidence were seemingly bolstered through directive coaching. We theorize Liz’s improvements may have been facilitated both by her productive beliefs about her potential to improve and her comfort in the coaching session. In her written reflections, Liz rates the experience highly, selecting “strongly disagree” in response to questions about whether it made her nervous. Her concluding thoughts reflect this positive perception without reservation: “I haven’t had the opportunity to do a lesson like this in real life in my [clinical] placements, so the simulator was a really good experience for me to practice with ‘real’ kids, and the coaching was especially helpful.”
Conclusion and Implications
Given the short window for preparing candidates, we need innovative approaches for providing all candidates the support they need to begin teaching poised for success. On average, we find that coaching was more effective than self-reflection at increasing our primary outcome of interest—an observation-based measure of text-focused instructional quality. Counter to a central tenet of most teacher preparation programs (Gay & Kirkland, 2003; Sato, 2014; Schön, 1983), repeated practice with reflection did not leverage substantial improvements on candidates’ practice. In exploratory analyses, we tease apart this average effect to elicit new hypotheses about the role of teacher characteristics in moderating the effectiveness of both coaching and self-reflection. Of the available characteristics, we identified lower baseline skills as the best predictor of candidate growth after self-reflection and identified higher extraversion as the best predictor of candidate growth after coaching. In our explanatory qualitative analyses, we identify how a candidate’s perception of their own performance and skills can potentially influence their response to self-reflection and coaching. Finally, across conditions, we find those who start lower grow more on average, but coaching can help candidates push beyond the performance ceiling we observe with self-reflection.
There are some limitations to these analyses. We observe a single teacher preparation program and a particular population of candidates. We need more research with larger and more diverse candidates, working across a range of preparatory contexts. We also recognize that our primary outcome is tightly aligned to the intervention and observed immediately after coaching or self-reflection. Many coaching studies suggest the need for sustained and ongoing support—distinct from the brief, directive coaching employed here—for teachers to change their practice more broadly in the longer-term (Kraft & Hill, 2020). That said, in other research using this directive coaching model, we have found evidence of the persistence of coaching effects months later (Boguslav & Cohen, 2024). Finally, we are underpowered to formally test heterogeneity in self-reflection and coaching effects based on candidate characteristics and baseline skills. However, we explore this variation quantitatively and qualitatively, generating hypotheses to test in future research with larger and more diverse samples.
Supporting students in making text-focused inferences is a complex undertaking for novice and veteran teachers alike (Snow & O’Connor, 2016). Given the typical ways teachers scaffold students during text-focused instruction (Cazden & Beck, 2003), candidates may not observe high-quality text-focused instruction in clinical placements and need other opportunities to develop these skills. Simulations represent one potential opportunity. Our data suggest that simulated practice can support substantial improvement across candidates, at least in the short term, and returns to these experiences are meaningfully augmented with coaching. Preparation programs may want to develop a range of simulations focused on instrumental skills for supporting student learning that are less likely to be demonstrated in clinical placements.
Time is perhaps the scarcest resource in teacher education, and our data also suggest that, on average, coaching is the more efficient method for leveraging simulations as a practice space. Coaching can support improvement for candidates who continue to struggle with text-based feedback after several months of coursework and clinical experience, as well as those with robust skills prior to coaching. That said, we are not arguing that self-reflection has no utility to support teacher learning. Indeed, Victoria demonstrates just how powerful reflection can be. However, those with stronger than average skills may not benefit from self-reflection. Understanding more about the relationship between prior skills and responses to coaching and self-reflection is an important area for future research.
Self-reflection is both less resource-intensive and more commonly used in teacher preparation, so we wanted to understand for whom such opportunities might be particularly beneficial. Though education researchers acknowledge the relationship between individual characteristics, beliefs, and prior performance in children’s learning, we rarely discuss the same interplay in teachers, who are learners, too (Rimm-Kaufman & Hamre, 2010). For example, while Kathleen’s second simulation suggests a strong understanding of how to support students, she does not leverage self-reflection as a tool for improvement (Hatton & Smith, 1995). We theorize that her improvement potential may have been mitigated by her higher initial performance and lower self-efficacy. In contrast, Victoria, who is more efficacious, substantially improves after structured self-reflection. Research has established many benefits of self-efficacy, both for teacher career persistence (Skaalvik & Skaalvik, 2017) and motivation (Roeser et al., 2002). So, too, might efficacy beliefs promote more productive reflection. Some have argued that teacher preparation provides far too few opportunities for candidates to enhance their self-efficacy (Tschannen-Moran et al., 1998), especially as we know such beliefs are most malleable during the first few years of teaching (Tschannen-Moran & Hoy, 2007). Evidence suggests experiences that provide a sense of mastery may well support the evolution of self-efficacy. For candidates like Kathleen, repeated practice in the simulator with outside coaching could enhance self-efficacy, contingent upon it providing the sense of mastery it provided Liz. It might be especially helpful to target coaching resources for candidates like Kathleen, particularly as they develop early mastery experiences.
Second, our findings suggest that extraversion may make it easier for candidates to incorporate coach feedback. While Liz, who is quite extroverted, enjoyed coaching and improved dramatically after it, Alex provides a counterexample of how less extroverted candidates might struggle with the intensity and immediacy of this kind of coaching. We are not arguing that such personality traits preclude improvement. Rather, candidates like Alex may need support expressly targeted at individuals with similar profiles. Programs might consider how to ensure that less extroverted candidates have time to develop substantive relationships with coaches, or have coaches provide written feedback, which may be easier to digest in the moment. Time constraints or performing in front of peers may additionally exacerbate the stress associated with practice opportunities for less extroverted candidates (Paskins & Peile, 2010). These are adjustments that programs would do well to consider in promoting more uniformly positive simulation-based experiences. The intersection of personality and performance is understudied in the teacher preparation literature. Ongoing research on this front could contribute to more tailored support for all candidates.
Finally, we assert that experimental analysis in tandem with nuanced qualitative work will continue to bridge the gap between the potential for candidate growth in preservice spaces and the provision of supports that facilitate growth. Here we demonstrate the potential for investigating not just average effects, but also potential drivers for heterogeneity. Developing theory about mediators and moderators will refine not just the design of supports, but also the conditions necessary for replication in field settings. We see promise for the potential of simulations in teacher education, and only through continuing to surface drivers of differential effects will we fully understand the potential for these technologically mediated practice opportunities.
Supplemental Material
sj-docx-1-ero-10.1177_23328584241289876 – Supplemental material for Tailoring Teacher Supports: A Mixed-Methods Analysis of Responses to Coaching and Self-Reflection
Supplemental material, sj-docx-1-ero-10.1177_23328584241289876 for Tailoring Teacher Supports: A Mixed-Methods Analysis of Responses to Coaching and Self-Reflection by Julie Cohen, Kylie Anglin and Emily Wiseman in AERA Open
Footnotes
Acknowledgements
We appreciate feedback on earlier versions of this paper from discussants and participants at the Association for Public Policy Analysis and Management (APPAM) annual meeting and from colleagues at the University of Virginia, including Vivian Wong, Jim Wyckoff, and Judy Paulick. Mike Gurlea provided invaluable research assistance.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant #R305B140026 to the Rectors and Visitors of the University of Virginia. In addition, this research was supported by the Jefferson Trust at the University of Virginia and the Spencer Foundation-National Academy of Education postdoctoral fellowship. The opinions expressed are those of the authors and do not represent views of the funders, including the Institute or the U.S. Department of Education.
Open Practices Statement
The data and analysis files for this article can be found at https://deposit.icpsr.umich.edu/deposit/claimOwnership?tenant=openicpsr&claimId=137173
Authors
JULIE COHEN is the Charles S. Robb Associate Professor of Curriculum and Instruction at the University of Virginia School of Education and Human Development. Email:
KYLIE ANGLIN an Assistant Professor in Research Methods, Measurement, and Evaluation at the University of Connecticut. Email:
EMILY WISEMAN is a director in the education practice at EY-Parthenon. Email:
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
