Abstract
Achieving high-quality preschool at scale is challenging; to do so likely entails a combination of program standards, teacher qualifications and compensation, on-site quality monitoring, and professional development (PD). This study aims to examine the impact of investments in PD within the context of an expansion of universal preschool in one of the nation’s largest school districts. We leverage the opportunity provided by a “natural experiment” to estimate PD’s effects that embeds an evidence-based math curriculum in interdisciplinary units of study with coaching support on teacher math practices. A total of 95 schools participated in this study (51 treatment and 44 comparison schools). Treatment sites implemented more teacher-led math activities for a longer period compared to control sites. The size and magnitude of the impacts of a curriculum and PD program implemented at scale were comparable to results from studies of small-scale efficacy trials.
With growing support among policymakers, universal, publicly funded preschool has expanded over the past decade. In 2016, nationwide, state-funded preschool enrollment reached an all-time high, serving nearly 1.5 million children (Barnett et al., 2017). Proponents of preschool highlight benefits to children’s school readiness and later life outcomes, citing research that preschool produces impressive long-term impacts on educational attainment, criminal activity, and health, decades following participation (Belfield et al., 2006; Campbell et al., 2012).
Among all the factors influencing a preschool success, nothing is more important than program quality, namely, teacher-child interactions (Yoshikawa et al., 2013). However, achieving quality at scale is challenging: Evidence indicates that preschool quality may be lower when implemented across a large district, city, or state, especially when compared to those programs implemented under the developers’ close supervision or in the context of small, efficacy trials (Dodge, 2009; Dusenbury et al., 2003). High-quality preschool implementation is possible, with a few promising examples in contexts as diverse as Boston, Massachusetts, and Tulsa, Oklahoma (Gormley et al., 2005; Weiland & Yoshikawa, 2013). Results from these studies indicate that assuring high-quality programs at scale entails (1) a professional development (PD) system to support the workforce and (2) the use of an evidence-based curriculum (Gulamhussein, 2013; Weiland et al., 2018). Nonetheless, the effectiveness of this two-pronged approach to improving teacher practices may depend on the quality of programs before implementation.
This study aims to examine improvements to program quality due to PD investments within the context of a universal preschool program serving an ethnically-, racially-, and linguistically diverse sample of children across a range of auspices (e.g., schools, centers) in the nation’s largest school district. We leverage a “natural experiment” to estimate the first-year effects of a PD program that embeds an evidence-based math curriculum with coaching support. To understand whether differences in program quality may influence PD impacts on teacher practices, we test for the moderation of program quality before implementation. This study’s context—implementation by a major urban school district—represents the real-world conditions of an effectiveness trial and provides evidence regarding whether PD can produce measurable impacts on teacher outcomes amid the complexity of a large public school system. This work is vital in the current early learning policy context where researchers are examining the best way to scale up early learning interventions and the conditions under which preschool is most effective (e.g., Bloom & Weiland, 2015; Duncan & Vandell, 2012).
Improving Preschool Quality Through PD
For preschool programs to eliminate educational opportunity gaps, the programs must be high-quality (Barnett, 2011). Among all the factors that influence preschool quality, providing PD to improve teachers’ knowledge and practice is particularly important (Hamre et al., 2017). However, the types of PD that most preschool teachers likely receive are not focused on evidence-based teaching practices, are insufficient in terms of duration and intensity, and are not presented in a format that will support sustained changes in teaching practices. Nonetheless, several PD models demonstrated impacts on teaching practice, not only as part of research-led studies but in practice-led, scaled-up implementations as well (e.g., Early et al., 2017). Some examples of PD models tested in smaller efficacy trials and then at scale include Making the Most of Classroom Interactions, My Teaching Partner, and Opening the World of Learning (e.g., Early et al., 2017; Pianta, Mashburn, et al., 2008; Weiland & Yoshikawa, 2013). Implementation and impact results from these studies and others helped define the critical components of a PD program needed to ensure PD is effective.
Specifically, a combination of training and coaching and the use of developmentally appropriate curricula are the key components that produce the largest improvements in teacher’s practices, classroom quality, and a range of child outcomes when expanding a PD program to scale (Sarama et al., 2012; Weiland & Yoshikawa, 2013; Weiland et al., 2018). To improve teaching practices and to support gains in children’s learning, a PD program should target specific teaching practices (Desimone & Garet, 2015; Zaslow et al., 2010) using an evidence-based curriculum. Furthermore, PD should include a didactic instruction (e.g., workshops) with weekly/biweekly support from coaches (e.g., Bierman et al., 2008; Clements & Sarama, 2011; Morris et al., 2013). The literature suggests that fidelity to the PD program matters, but it is hard to achieve (Durlak & DuPre, 2008). Three dimensions of PD program fidelity are (1) dosage (an index of the quantity of delivery), (2) quality (a measure of the skill with which teachers deliver the material and interact with children), and content (the extent to which the PD was delivered as prescribed).
We examine a PD program, Explore, which combines curricular materials and supports for teachers and leaders. The curricular materials include an evidence-based math curriculum called Building Blocks (BB; Clements & Sarama, 2008; Sarama et al., 2008) and a research-based Pre-K for All (PKA) Interdisciplinary Units of Study (Units 1 ) developed by the Division of Early Childhood Education at New York’s Department of Education (DECE-DOE). The decision to create a PD track that incorporated a math-focused curriculum was based on evidence that (1) preschool children’s math skills are foundational for a broader set of outcomes, including language, reading, and executive function (e.g., Duncan et al., 2007; Watts et al., 2014); (2) preschool math instruction is likely minimal—in terms of the dosage of math instruction and the quality of instruction (Ginsburg et al., 2008); and (3) preschool children’s math competencies can increase by training teachers using an evidence-based curriculum with supports (Clements & Sarama, 2007; Presser et al., 2012). Furthermore, because a researcher-led efficacy study of BB called Making Pre-K Count (MPC) in New York City (NYC) took place a few years earlier, there was increased interest in expanding to more sites. Several teachers, leaders, and coaches who participated in the researcher-led efficacy trial played a leadership role in developing and implementing the Explore track. Despite constraints of taking the PD program to scale, many implementation decisions for Explore were based on lessons learned from MPC (e.g., wait until the second year of implementation to focus on math learning trajectories due to the content’s complexity) and overseen and delivered by coaches who participated in the efficacy trial. As such, this study serves as an example of how to move a PD program from an efficacy trial to implementation at scale.
The Moderating Role of the Classroom Quality
Although research suggests that a comprehensive PD program, including workshops, coaching, and a curriculum, can change teacher practice, an important question is whether the PD program’s effectiveness depends on teachers’ skills before implementation, namely, the classroom quality. There is a set of general domains of classroom quality (or teacher-child interactions) that reflect responsive teaching, including emotional support, classroom organization, and instructional support (Hamre, 2014). Irrespective of the content of instruction, high-quality teachers use these general domains of support to engage with children, recognize their needs, and respond in individualized ways that foster social, behavioral, and academic development (Pianta, La Paro, & Hamre, 2008). Often, the introduction of a curriculum, particularly a content-specific one, assumes teachers are competent in delivering high-quality teacher-child interactions, and instead, the curriculum provides support for teachers to target children’s specific academic or behavioral skills through instruction in small or large groups, or individually (Wasik & Hindman, 2011). Unclear is whether a PD program’s impacts, including a curriculum, vary across initial teacher practice quality (Hamre et al., 2014).
Plausibly, a high-quality classroom before implementing a content-specific curriculum may enhance the quality of the PD program’s outcome—in the case of this study, teacher’s math practices. That is, teachers who had higher quality classrooms before implementing a math-specific PD program may better incorporate the new curriculum into their existing practice, resulting in higher quality math practices at the end of the year (Goble & Pianta, 2017). Conversely, teachers with initial low levels of quality who struggle with providing classroom organization, emotional, or instructional support to children may find it challenging to implement a new content-specific curriculum, resulting in lower quality math practices. For instance, a teacher who struggles to keep children on task during content instruction may not implement the math practices taught at a PD workshop (e.g., Koth et al., 2008). Furthermore, if a teacher lacks general classroom quality skills, then a coach may spend time helping teachers lay the groundwork (e.g., setting up the classroom) rather than in implementing the new practices (e.g., Morris et al., 2013). Thus, in the context of a PD program at scale and, potentially, limited resources, it is of considerable interest to determine whether the receipt of a PD program that includes a content-specific curriculum is sufficient across initial levels of classroom quality to understand how to support teachers’ practices and who might benefit the most from the PD program.
Challenge of Developing and Evaluating a High-Quality PD System at Scale
Nonetheless, carefully designed, well-funded PD programs may fail to affect teacher practice when implemented at a large scale (e.g., Markussen-Brown et al., 2017; Piasta et al., 2017). For instance, the results from the experimental study of classroom size reductions, Tennessee Star, led to widespread implementation of policies and studies of similar interventions; however, the results from subsequent studies were mixed due to various contextual factors, like lack of qualified teachers (Whitehurst & Chingos, 2011). Similarly, studies of school size reductions sparked by notable effects from several small studies did not produce significant changes in achievement results as hoped (Leithwood & Jantzi, 2009). Both of these examples suggest that educational reforms may not have expected impacts when implemented at scale due to factors like poor implementation or differences in contexts.
A second challenge is how to rigorously assess an education intervention’s impact at scale (Murnane & Willett, 2010). The gold standard design for estimating impacts involves random assignment to a treatment or control group, whereby any differences in outcomes between the two groups can be attributed to the intervention. Randomized control trials in school districts can be challenging to carry out due to the difficulty of obtaining the consent of participants and educational institutions; Thus, randomized designs in large-scale contexts are limited. In recognition of this challenge, researchers investigating preschool impacts use a variety of nonexperimental methods to estimate causal effects, including propensity score matching (e.g., Magnuson et al., 2007) and regression discontinuity (e.g., Gormley et al., 2005; Hustedt et al., 2007; Weiland & Yoshikawa, 2013). Though these nonexperimental designs are not without potential biases (Lipsey et al., 2015), they have the advantage of being recognized as relatively strong designs being easily applied to programs at scale.
Another compelling strategy, in the absence of randomization and regression discontinuity design opportunities, is to rely on natural experiments, namely, a study where researchers take advantage of a situation in which two otherwise identical groups are affected differently (i.e., exposed to a treatment and control condition) by a “natural” event that is exogenous to the treatment and the outcome (Murnane & Willett, 2010). The process governing the exposure to the different conditions resembles random assignment and creates otherwise identical pretreatment groups (Lipsey et al., 2015). Numerous research examples use such “natural” variation in policies or events to produce unbiased estimates of the effects. For example, Hoxby (2001) estimated the effects of school vouchers on school choice using school district boundaries determined by streams. Similar to in this study, researchers took advantage of naturally occurring pockets of randomization within school lotteries that mimic random assignment. For instance, if more students apply than there are seats available, within priority groups established by schools or districts, a lottery is used to choose which students are offered a seat randomly (e.g., Dobbie & Fryer, 2011; Lipsey et al., 2018; Unterman et al., 2016). The use of nonexperimental approaches, including natural experiments, constitutes a research design for addressing casual questions while meeting the methodological and ethical bars that are implicit in research studies at scale.
Present Study
The translation from efficacy studies to implementation across large preschool systems often requires adaptations from nonresearchers based on local constraints (e.g., number of PD days; funds) that may not match what was done in research studies. Although there are other studies of the BB curriculum, using more rigorous designs and measuring child outcomes, this study is one of the few studies that improve our understanding of PD impacts on teachers’ math practices as a function of participation in real-world, large-scale PD effort. Studies like this one are essential in understanding the success, or lack thereof, of educational approaches when adopted and implemented at large scale. We leverage a natural experiment within the NYC system of assigning sites to PD programs that resulted from a delay in funding decisions outside of the DECE-DOE, preschool programs, and researchers’ control. We address the following questions:
Method
Launched in 2014, PKA represents NYC’s commitment to providing free, full-day, high-quality preschool to every 4-year-old (about 68,500 preschool seats in 1,850 sites). As part of their commitment to quality, DECE-DOE began building a system of PD to include the central features linked to quality—training, coaching, and curricula.
Track Assignment Procedures
A set of procedures were put in place that led to the assignment of sites to PD tracks outside of the control of the DECE-DOE, programs, and researchers; this study leverages a natural experiment in the assignment process of sites to PD tracks. In the spring of 2016, there were four PD tracks: (1) Explore, an evidence-based math curriculum, and the Units developed by the DECE-DOE that supports high-quality teacher practice and children’s development across domains by integrating math concepts into the classroom; (2) Create, an arts-based approach to incorporate visual arts, dance, theater, and music into ongoing instruction to promote learning across domains; (3) Thrive, an evidence-based set of strategies for supporting children’s social-emotional development and behavioral regulation, as well as for supporting family engagement; and (4) Inspire, a series of topics aligned with the district’s quality standards that support DECE-DOE instructional and child development goals. The Explore PD track was made possible by a mix of funding sources (public and private/external), which had not yet been finalized at the time of the PD track assignment.
Assignment to PD tracks was based on the following two conditions: (1) site leaders’ rank-order preference for Explore, Create, and Thrive, 2 and (2) recommendations for the PD tracks from a social worker and/or instructional coach working with each site. An algorithm was developed to assign sites to PD tracks. First, the algorithm created six priority groups based on site preference and recommendations for each PD track (see Appendix A for a description of priority groups). A site could be in a different priority group for each track based on their preference and recommendations ranking (e.g., a site could be in priority Group 1 for Explore and priority Group 5 for Create). Second, sites were assigned to a PD track in the order of their priority group until the PD track capacity was met, beginning with the Thrive, Create, and Explore tracks. Each PD track (except Inspire) had a maximum capacity of sites that could be served. When demand for a PD track exceeded the capacity of that PD track (i.e., oversubscribed), the site assignment algorithm randomly assigned sites from within a priority group to either that PD track or the Inspire PD track (essentially a lottery; e.g., if the number of sites in Create’s priority Group 4 exceeded the maximum number of sites allowed in Create, then sites in priority Group 4 were randomly assigned to the Create or Inspire). Sites not assigned to Thrive, Create, or Explore due to oversubscription were also placed in the Inspire track.
How Sites Were Assigned to Tracks in the Spring of 2016
In Spring 2016, with funding for the Explore PD track uncertain, the assignment algorithm was run twice to create two site assignment lists. The first site assignment list (Scenario A) is based on the condition that Explore funding was available and sites could be assigned to Explore, Create, Thrive, and Inspire. The second site assignment list (Scenario B) is based on the condition that Explore funding was not secured and, thus, sites could be assigned only to Create, Thrive, and Inspire (see Figures 1 and 2). Since the site assignment algorithm for Scenario B used the same priority groups from Scenario A, this meant that sites that would have been assigned to Explore in Scenario A tended to be in a lower priority group for Create. Because demand for the Create track, in Scenario B, exceeded the number of spots available, a set of sites were randomly assigned within a priority group to the Create or the Inspire PD track in Scenario B.

Hypothetical track assignment process within priority groups resulting in the natural experiment.

Overview of the process of the natural experiment.
After the site assignment algorithm created the two lists of site assignments (Scenarios A and B) and because funding for the Explore PD track remained uncertain, the DECE-DOE notified sites of their PD track assignment based on the results from Scenario B. However, several weeks later, funding for an additional cohort of Explore was secured. This meant that the sites that would have received Explore in Scenario A needed to now be reassigned from their current track (i.e., the Scenario B track assignment they had already received—Create or Inspire) to the Explore track. Sites already assigned to Create but that would have been assigned to Explore under Scenario A did not change PD track assignment to Explore because of the need for a certain number of sites in the Create track. As a result, only sites assigned to Inspire but that would have been offered Explore under Scenario A were eligible to participate in Explore.
Thus, during the 2016–2017 school year, two groups of sites in the PKA system were eligible for Explore and would have been assigned to Explore, but only one received the offer to participate in Explore (due to late funding). The treatment group included sites that would have been assigned to Explore in Scenario A and were ultimately assigned to receive Explore after all. Our comparison group comprised sites that would have been assigned to Explore in Scenario A but were ultimately placed in Create. This comparison group allowed us to examine what might have happened to our Explore (treatment) group, on average, if they had been assigned to the other tracks.
Examining potential ways that the site assignment process may result in the treatment and control groups that were different from one another in observed and unobserved ways is necessary. As emphasized by Cook and Campbell (1979) and Dunning (2008), assignment to treatment and control conditions—here, the Explore PD track—must be as if random. This implies that the Explore track assignment is independent of observable and unobservable factors that might influence teachers’ math practices. Furthermore, the treatment and control groups must be balanced concerning measurable variables that might explain teachers’ math practices. Particularly important, sites should not appear to self-select into their PD track in ways that might be associated with a propensity to teach math. Perhaps most concerning is that our treatment group is made up of sites that would have received Explore in scenario A but were given the option to be reassigned to Explore; this means that our treatment group is made up of sites that chose to be reassigned to Explore. In contrast, sites that would have gotten Explore in Scenario A but ended up in the Create track (ultimately, our control group) did not have the option to be reassigned because of funding constraints.
Nonetheless, we argue that the site preference data, which predate PD track assignment, reflect their true interest in a track, and as such, sites that ranked Explore as a top choice would have chosen to be reassigned to Explore. Furthermore, in focus groups, site leaders described making decisions about ranking site preference without consulting teachers. Taking that into consideration, and the fact that social workers/instructional coordinators recommendation played a crucial role in the PD track assignment process, suggests that teachers, who are the actual recipients of the treatment and control conditions, were divided into the two PD tracks most often without their choice or knowledge; thus, teachers were unlikely to self-select into either group in a way that might influence their propensity to implement math. A series of robustness checks are conducted to test these assumptions (Appendix B).
Program Services
Two years of program services were provided. The Explore program consisted of curricula, including BB (Clements & Sarama, 2008) and the Units, training, and on-site coaching. The BB curriculum is a multifaceted learning activity sequence targeting numeric and geometric or spatial topics laid out across 30 weeks in an easy-to-read, scripted manual. Curricular activities are organized based on the natural progressions by which children learn and develop math competencies over time (Clements et al., 2013). The following support was provided: (1) professional learning delivered by the developers of BB and expert facilitators trained in the BB curriculum, teachers participated in 4 full-day trainings (6 hours per day), and leaders participated in 3 half-day (4 hours per day) training across the school year, and (2) on-site coaching for teachers by external coaches trained by BB certified trainers. On average, coaches observed the teacher in the classroom once a month for 1 hour and employed strategies such as modeling, providing feedback, and discussing implementation with the teachers. Coaches debriefed with leaders during their visits, and they served as the direct point of contact as for questions about Explore implementation. The amount of PD offered to teachers and leaders in this study was less than other BB studies. In a study in NYC of BB, teachers participated in 6 days of training in the first year of implementation and 3 hours of in-classroom coaching each week (Morris et al., 2016).
Teachers in the non-Explore track attended 4 full-day trainings and leaders attended 4 half-day trainings across the school year that focused on incorporating visual arts, dance, theater, and music to promote children’s learning. Instructional coordinators and/or social workers supported teachers in the classrooms. The dosage of coaching by instructional coordinators and/or social workers was dependent on need (e.g., low site quality scores), and changed across the year.
Sample
Ninety-five schools participated (51 schools in Explore and 44 in non-Explore; see Table 1) in the 2016–2017 school year (the first year of implementation). The 95 PKA programs included 32 public district schools, seven Administration for Children's Services–NYC Early Education Centers (ACS-NYCEECs), 52 DOE-NYCEECs, and four PreK Centers. District and preK center teachers must have a New York State teaching certification in early childhood along with a bachelor’s degree. NYCEEC teachers must have a teaching license or certificate in ECE/ECE students with disabilities or a bachelor’s degree with a plan for obtaining Early Childhood Education certification. NYCEEC teachers must commit to earning a New York State teaching certification within 3 years of their start date as a pre-K lead teacher.
Baseline Equivalence of the Natural Experiment Sample
Note. DOE-NYCEEC = Department of Education–New York City Early Education Center; ACS-NYCEEC = Administration for Children's Services–New York City Early Education Center; CLASS = Classroom Assessment Scoring System.
p < .10. *p < .05. **p < .01. ***p < .001.
The children who attended participating sites were racially, linguistically, and socioeconomically diverse (Table 1). Across the sites, 34% of the children were Hispanic, 28% were Black, 15% were White, 20% were Asian, and 3% were of mixed, or other, race. Fifty-one percent of children were female. Thirty-three percent spoke a language other than English at home. Approximately 53% of children were eligible for free or reduced-price lunch, and 7% had an individualized education plan (IEP). 3
Classroom Observation Protocol
Ten trained graduate student observers (blinded to sites’ intervention status) collected classroom-level data. Observers underwent training with the classroom observational tool developer, including a practice observation in a site not enrolled in the study. The observational tool was collected in one randomly selected classroom per site at one time point in the spring (March–May). Classroom observations were scheduled on days when coaches were not present. Observations were completed “live” on-site over the first 3 hours of the day. When observations were double-coded for reliability purposes, the resultant data utilize the mean scores averaged between coders.
Measures
Classroom Outcomes
The Classroom Observation of Early Mathematics–Environment and Teaching (COEMET), a classroom observation tool that measures math instruction, was developed based on the characteristics and teaching strategies of effective teachers of early childhood mathematics (e.g., Clements & members of the Conference Working Group, 2004). The COEMET has two main sections, Classroom Culture and Specific Math Activities (SMA). Assessors complete the Classroom Culture section once to reflect their entire observation. The Classroom Culture section, which assesses teachers’ general approach to math education, includes items on the mathematical environment and interactions (e.g., environment showed signs of math, children’s math work on display, teacher actively interacted) and on the personal attributes of the teacher (e.g., teacher was knowledgeable about math, showed math learning could be enjoyable, showed curiosity for math). They complete an SMA form for each teacher-led formal math activity, defined as an activity led by a teacher that lasted at least 30 seconds; developed math knowledge; had a discernible topic, goal, and task; and involved multiple conversational turns between a teacher and a child. Observers completed a mini SMA form to document when a “simple” or “routine” math activity (e.g., singing a song about numbers) led by a teacher that does not include an extensive conversation about math content occurred. Interrater reliability for the COEMET, computed via simultaneous classroom visits by pairs of observers (10% of all observations, with pair memberships rotated), was 89%; 96% of the disagreements were the same polarity (i.e., if one agreed, the other was strongly agree).
We created two outcome variables to represent the number of math activities observed in each classroom: (1) the count of teacher-led math activities represents the total number of teacher-led math activities observed (i.e., activities recorded on the SMA and mini SMA that were led by teachers) and (2) the minutes of teacher-math activities represent the total number of minutes of teacher-led math activities observed.
We created three variables that represent the math quality observed: (1) classrooms with at least one observed teacher-led math activity, (2) average math quality scores, and (3) dichotomous moderate math quality score. Ratings on the quality of math instruction are available only in classrooms where math was observed. First, we created a dichotomous variable that indicates whether a classroom had at least one observed teacher-led math activity (0 = no teacher-led math activities were observed; 1 = at least one teacher-led math activity was observed). For classrooms where a teacher-led math activity was observed, the average math quality score was calculated by averaging across the items and then averaging across math activities for the final score to create the average math quality score. However, since the number of classrooms where at least one teacher-led math activity was observed was different between program and control groups (71% vs. 34%), this variable does not represent the true impact; as such, we created a variable that accounts for the fact that some classrooms are missing a math quality score (because they were not observed implementing a teacher-led math activity). This variable, dichotomous moderate math quality score, was calculated for all classrooms, regardless of whether they were observed conducting a math activity. Classrooms were considered to have moderate-quality math instruction, and thus were coded 1, if they had an average math quality score at or above a rating of 2 (median split) on a scale from 1 to 5. Classrooms were coded 0, or low-quality math instruction, if they had an average math quality rating below 2 or no quality rating. Finally, the Mathematical Classroom Culture score was computed by summing the items from the Classroom Culture section.
Treatment Variable
A dummy variable was created to represent assignment to the treatment or comparison condition (Explore = 1; comparison = 0).
Covariates, Moderators, and Descriptive Characteristics
The covariates were entered at the school level (i.e., at the level of assignment to track), as is recommended in the analysis of cluster-randomized trials and as the data were available from administrative records (Raudenbush et al., 2007).
Classroom quality
To control for the variability in sites’ classroom quality, baseline measures of the Early Childhood Environment Rating Scale–Revised (ECERS-R; Harms et al., 2003) was included as a covariate. The Classroom Assessment Scoring System (CLASS; Pianta, La Paro, & Hamre, 2008; collected within the past 3 years) was included as a covariate and a moderator. CLASS observations were conducted once every 3 years. The NYC DOE used a modified version of the CLASS protocol where the number of classrooms and the number of cycles each classroom was observed varied depending on the site’s size. All CLASS scores were analyzed and reported at the site level. Each of the three domains of the CLASS were separately examined as moderators in analyses for Research Question 2. The Emotional Support domain reflects the extent to which teachers support the classroom’s emotional and social functioning. The Classroom Organization domain reflects processes related to children’s behavior, time, and attention. The Instructional Support domain refers to how teachers encourage higher order thinking and facilitate children’s use of language. In the current study, Cronbach’s alphas were the following: Emotional Support (.84), Classroom Organization (.87), and Instructional Support (.93).
Administrative data
From records, we obtained information on child demographics (site proportion), gender (school proportion), percentage of children at the site who received free/reduced-price lunch, percentage of children at the site who had an IEP, and percentage of children who come from homes that spoke a language other than English. We controlled for site characteristics that are considered relevant in the NYC context: borough and site type (district school, DOE-NYCEECs, ACS-NYCEECs, PreKCenter). We used a vector of dichotomous indicators to represent borough and site type, each coded 1 when the site was located in a particular borough or located within a site type, 0 otherwise. These covariates predict children’s early cognitive and educational outcomes in other studies, and there is a consensus in the preschool literature that these should be controlled in impact analyses (Clements et al., 2011; Wong et al., 2008).
Analytic Plan
Our design relies on a treatment-on-treated design, effectively comparing sites that were offered and agreed to take up the offer of a change of assignment of PD track, relative to those who were not offered to change their PD track assignment. As such, our treatment-on-treated design estimates the impact of receiving Explore PD on the “treated” sites. We approach our first research question, about the impact of Explore on the amount and quality of math instruction, with a series of ordinary least squares regression models, with standard error correction (Huber-White) for clustering of classrooms within sites (an approach more commonly used by economists in cluster-randomized trials; Murnane & Willett, 2010). Consistent with consensus in the field, we interpret effects in the 0.10 to 0.30 range as small, effects in the 0.30 to 0.60 range as moderate, and effects in the 0.60 and higher range as large (Hill et al., 2008). For all analyses, we used the STATA statistical software package. For Model 1:
Yi represents the outcome variables of interest at the classroom level, i represents classrooms; DEMOGRAPHICS is a vector of child demographics at the school-level (percentage of children receiving free/reduced-price lunch, percentage of children with IEPs, percentage of children who speak a language other than English at home, percentage of each of these races (Black, Hispanic, Asian, White, and multirace) enrolled in each school, and gender); SITETYPE is a set of dummy variables that represent the type of school setting (ACS-NYCEEC, DOE-NYCEEC, District School, or Prek Center); BOROUGH is a vector of five dummy variables for each of the boroughs that the programs were located in; PRE-TEST QUALITY is a vector of beginning pretest quality scores of CLASS and ECERS; RX is the intervention status. Baseline variables are limited to the few static demographic and school characteristics available in the DECE database. Differences between the baseline characteristics of Explore and non-Explore group are examined, to determine whether randomization “worked.”
To address the aims of Research Question 2, we separately reestimated our models (described earlier) and added an interaction term, calculated as the product of the baseline score on each of the CLASS subscales and intervention status (Explore = 1, non-Explore = 0).
Results
We explored the extent to which the natural experiment sample yielded comparable treatment and control groups. We conducted t tests by treatment at the site level on the following child- and site-level pretest characteristics: children’s race, gender, site type, percentage of children with IEPs, language other than English spoken at home, free/reduced-price lunch, and baseline program quality. These analyses yielded only one statistically significant difference at the .05 level across 17 variables tested. Comparison sites were statistically more likely to have higher ECERS scores compared to the intervention sites; however, notably, this statistically significant difference should bias estimates of treatment impact downward (given they suggest that control sites were of slightly higher quality). All other tested baseline characteristics were similar across groups (Table 1). This pattern suggests that the assignment to conditions was successful in producing comparable groups for assessing treatment impact. Tables 2 and 3 show the correlations across study variables.
Correlations Between Study Variables (Total Sample)
Note. ECERS = Early Childhood Environment Rating Scale–Revised; BL = baseline; FU = follow-up; FRL = free/reduced-price lunch; LOTE = language other than English spoken at home; IEP = individualized education plan.
p < .10. *p < .05. **p < .01. ***p < .001.
Correlations Between Study Variables Broken Down by Intervention Group
Note. ECERS = Early Childhood Environment Rating Scale–Revised; BL = baseline; FU = follow-up; FRL = free/reduced lunch; LOTE = language other than English spoken at home; IEP = individualized education plan.
p < .10. *p < .05. **p < .01. ***p < .001.
What Is the Impact of Explore on the Amount and Quality of Math Instruction?
Table 4 summarizes the impacts on teachers’ math practices at the end of the first year of implementing Explore. We found positive impacts on five out of the six outcome variables. Intervention effects were positive and statistically significant for count of teacher-led math activities (effect size [ES] = 2.73, p < .001), minutes of teacher-led math activities (ES = 1.78, p < .001), % of classrooms with at least one observed teacher-led math activity (ES = 1.58, p < .001), dichotomous moderate math quality score (ES = 0.95, p < .001), and the mathematical classroom culture scale (ES = 2.15, p < .001). The effect sizes were relatively large (with ESs = 0.95–2.73 across measures). In Explore and non-Explore sites, most of the teacher-led math activities were focused on number concepts. The teaching of operations, geometry, patterning, and measurement were at lower levels in both the Explore and non-Explore sites—on average, less than one activity focused on each of these math areas. Explore teachers were observed to deliver statistically significantly more activities within each of the following math content areas compared to non-Explore sites (p < .05): number, operations, patterning, and measurement (see Table 5). The ESs ranged from 0.64 to 1.38.
Primary Classroom-Level Impacts on Math Teaching Practices in the Spring
Note. Effect size is calculated by dividing the impact of the program (the difference between the means for the program group and the control group) by the standard deviation for the control group.
Category is in contrast to classrooms with a low-quality score or no math activity observed. For each teacher-led math activity observed, quality was calculated by averaging across six items rated on a scale from 1 (low) to 5 (high). The scale assesses the extent to which the teacher explains the math concept underlying an activity, asks open-ended questions, and builds on children’s answers, ideas, and strategies to extend their mathematical thinking. Scores at or above 2 were classified as having moderate to high quality. Classrooms were coded 0, or low-quality math instruction, if they had an average math quality rating below 2 or no quality rating. bFor classrooms where a teacher-led math activity was observed, the average math activity quality score is calculated by averaging across nine items and then averaging across math activities for the final score; the score ranges from 1 (strongly disagree) to 5 (strongly agree), and assesses the extent to which teachers expanded children’s conceptual understanding of math and extended children’s mathematical thinking. This does not represent a true impact since the number of classrooms where at least one teacher-led math activity was observed was different between program and control groups (71% vs. 34%).
p < .10. *p < .05. **p < .01. ***p < .001.
Classroom-Level Impacts on the Number of Teacher-Led Math Activities and Informal Math Activities in Different Math Content Areas in the Spring
Note. Effect size is calculated by dividing the impact of the program (the difference between the means for the program group and the control group) by the standard deviation for the control group.
p < .10. *p < .05. **p < .01. ***p < .001.
Does Baseline Classroom Quality Score Moderate the Effects of the Intervention on the Math Practices?
Only two interactions were statistically significant (see Table 6). An interaction was detected between baseline classroom organization and intervention status on the count of math activities (b = 0.28, p = .00). This suggested that sites with lower baseline classroom organizational skills conducted more teacher-led math activities at the end of the first year of implementation if they were in Explore and sites with higher baseline classroom organization skills conducted more teacher-led math activities at the end of the first year of implementation if they were in Explore (see Figure 3). Inspection of the simple slopes revealed that sites with high classroom organization (1 SD above the mean), the average number of teacher-led math activities for those assigned to Explore was 3.44 SD higher than those in the non-Explore condition. For sites with low classroom organization (1 SD below the mean), the average number of teacher-led math activities for those assigned to Explore was 2.93 SD higher than those in the non-Explore condition.
Interaction Results
Note All of these analyses control for baseline classroom quality, child demographics, child gender, % of children who receive free/reduced-price lunch, % of children with an individualized education plan, % of children from homes that speak a language other than English, borough, and site type.
p < .10. *p < .05. **p < .01. ***p < .001.

Interaction effects for the number of teacher-led math activities and baseline classroom organization.
A second interaction was detected between baseline emotional support and intervention status on the mathematical classroom culture score (b = 0.83, p = .00). This suggested that baseline emotional support moderated the relation between intervention status and mathematical classroom culture. Figure 4 illustrates that receiving the Explore PD was significantly related to higher mathematical classroom culture among sites with higher baseline emotional support (0.46 SD higher than those in the non-Explore condition). In contrast, for lower emotional support sites, receiving the Explore intervention did not significantly predict the mathematical classroom culture scores (simple slope = ns).

Interactions effects for the mathematical classroom culture and baseline emotional support.
Discussion
This study investigated the impacts of an at-scale PD program on preschool teacher’s practices. We examined such impacts within the context of a natural experiment, with PD implemented under real-world conditions (Institute of Education Sciences & National Science Foundation, 2013). The advantage of this approach is that we determined whether an at-scale, district-sponsored PD, as authentically implemented, resulted in the intended outcomes. The results are essential to consider, within the context of PD at scale, given that the PD model was developed following research-based recommendations for effective PD and the similar investments (e.g., time, financial) currently being made in PD programs across the country.
We found impacts on the number of minutes (13.26 more minutes in Explore) and the count of math activities (1.55 more math activities in Explore), which were substantially larger than those seen in previous studies of BB, where the program group typically spent 2 to 5 more minutes on math instruction than the control (Clements & Sarama, 2008; Morris et al., 2016). The size and magnitude of our findings are comparable to the MPC study, which took place in NYC in 2013–2015 and utilized a similar observation procedure. The mean number of activities and minutes of math were less for both the Explore and non-Explore groups (2.53 activities,19.67 minutes in Explore; 0.98 activities, 6.09 minutes in non-Explore) compared to the intervention and control groups in MPC (5.94 activities, 46.80 minutes in the treatment group; 4.37 activities, 34.85 minutes in the control group). Our findings suggest that the PD supports did not reach the level of implementation seen in an effectiveness trial within the same school system at scale.
Nonetheless, the impacts’ size is impressive, considering the Explore track had fewer workshops, fewer coaching sessions, and lower attendance rates than other studies of BB. For instance, participating teachers in the MPC project received 2 more training days: 6 days of training in MPC than 4 days in this study. The dosage for in-classroom coaching was much more substantial in the MPC project (3-hour, in-classroom coaching every week) compared to this study (1-hour, in-classroom coaching once a month). Moreover, teacher attendance at the Explore PD sessions was low: Only 18% of Explore sites had the expected number of teachers attending all four PL sessions. 4 This attendance rate was much lower than prior efficacy studies of BB; in the MPC study, teacher attendance at training sessions was high (87%, on average). Such less than ideal PD attendance is similar to what has been seen in other studies and, perhaps, unsurprising for a PD program at scale (Piasta et al., 2017).
The results (including the robustness checks) suggest that Explore’s impact on the quality of instruction was mixed. Although Explore sites provided slightly higher quality math instruction than teachers in non-Explore sites, the differences were not always statistically significant. The lack of consistent statistically significant impacts on the quality scores suggests that the difference in observed math quality may be driven by the difference in the presence of math instruction across the two groups of classrooms. However, in both groups, the degree to which teachers consistently used high-quality instructional strategies during math activities was relatively low overall—below a rating of a 2—meaning that teachers employed these strategies only some of the time.
It is likely that these limited effects on quality are driven by the fact that improving early math instruction can be difficult for teachers (Lee & Ginsberg, 2009); it requires teachers to know the content, understand children’s thinking, engage in pedagogical practices that support learning, and see themselves as capable math teachers (Lee et al., 2009). Most PD studies do not assess impacts until the second year of implementation to allow teachers to learn and immerse themselves in the first year’s curriculum (e.g., Morris et al., 2014). This article reports the results after the first year of a 2-year intervention; thus, we suspect we might not see impacts on math quality until the end of the second year of implementation.
Nonetheless, Explore classrooms had a positive and statistically significant impact on the classroom culture score, namely, the teachers’ general mathematics education approach. Explore’s impact on the classroom culture score suggests that the program successfully altered teachers’ beliefs and dispositions beyond specific curriculum practices. Furthermore, a previous study of BB found that mathematical classroom culture mediated the relationship between the intervention’s receipt with child outcomes (Clements et al., 2011). This is consistent with the literature supporting the connection between academic performance and general classroom features, including signs of mathematical activity, teachers who are knowledgeable and enthusiastic about mathematics, and teachers who frequently interact and respond to children (Clarke & Clarke, 2004; Clements & Sarama, 2007).
The robustness results (see Appendix B) build further confidence that any observed baseline nonequivalence in demographic composition and baseline quality measure of Explore and non-Explore sites is unlikely to be biasing the estimated impacts of Explore on teacher-led math practices as reported. Analyzing Explore’s impact within a public school subsample had no appreciable effect on the observed findings’ pattern, magnitude, or significance. When we subset the sites with overlapping preferences sets and those ranked Explore as “1,” we found comparable results for the amount of math instruction but mixed results on the quality of math instruction.
Classroom Quality as a Moderator
Surprisingly, baseline classroom instructional support did not moderate the relation between the Explore and the amount and quality of math. That is, COEMET assesses the degree to which teachers use such instructional strategies as (1) asking open-ended questions, (2) formally extending children’s math learning, and (3) explaining the math concept during activities—all strategies that encompass the CLASS instructional support domain. We suspect this could be due to instructional support in this sample, as in other samples across the United States, being relatively low (Burchinal, 2018).
The degree to which Explore affected the count of math activities depended on the level of classroom management before Explore. Implementing Explore, regardless of sites’ baseline classroom management, increased the number of math activities compared to non-Explore sites. However, sites with high classroom management were able to conduct more math activities. This suggests that, perhaps, sites with higher classroom management skills before Explore were better equipped to implement. That is, teachers in well-organized classrooms that provide a structured environment where expectations and routines were delineated were able to implement more math activities (Bulotsky-Shearer et al., 2014). The BB curriculum is structured around weekly lesson plans consisting of four main instructional components: Whole Group, Small Group, Hands-On Math Centers, and Computer (Clements & Sarama, 2007). A higher degree of classroom organization, likely, facilitated teachers’ ability to set up and implement multiple BB components throughout the preschool day.
Receiving the Explore PD was significantly related to higher mathematical classroom culture among sites with higher baseline emotional support. The mathematical classroom culture included items about the environment (e.g., math signs), teacher-child interactions (e.g., teachers actively interacting), and personal attributes of the teacher (e.g., showed curiosity about math ideas). Potentially, teachers who were more attuned to their children’s needs were able to create a classroom that made math fun and to engage children in math activities positively. Improving the mathematical classroom culture is vital because it represents not adherence to the curriculum but rather the development of an environment that infuses math at every opportunity (Clements & Sarama, 2007). Implications for these moderation findings suggest that teachers with higher emotional support and classroom organization are better equipped to implement a math-specific PD program.
Limitations
Though this study is marked by numerous strengths, including the observations of classroom quality, rigorous design, and focus on a program at scale, there are several limitations. Most important, random assignment was not used as the method for allocating sites to tracks. Because we are relying on a natural experiment, we conducted several robustness checks to adjust for selection bias, but there is still the potential that our results are not internally valid. Similarly, this study’s results may have limited generalizability to the broader set of sites across NYC’s PKA system. Finally, we conducted our data collection, analyses, and made inferences at the site level. Due to sites being assigned to PD track at the site level and internal DECE-DOE processes for collecting data, we could not account for the multilevel nature of the sites, which have teachers and classrooms clustered within sites. This includes that we have observational data from only one classroom per site, which could lead to an over- or underestimation of effects.
Conclusion
Our results provide further evidence of the ability of a comprehensive PD program to improve teachers’ implementation of math practices—despite the dearth of math instruction and preschool teachers’ reported fear of math (e.g., Lee & Ginsberg, 2009). Our findings parallel other PD studies at scale (albeit mostly language- and literacy-focused PD) concerning the dosage of PD offered and impacts on teacher practice (e.g., Piasta et al., 2017; Weiland & Yoshikawa, 2013). Since child outcomes were not measured in this study, we do not know whether the Explore PD made a substantial enough impact on teacher outcomes to yield effects at the child level. Nonetheless, when interpreted within the extant literature, our findings raise some important questions regarding the field’s approach to PD that have yet to be addressed. Specifically, the current PD system was designed to reflect recommendations for effective PD at scale (Hamre et al., 2017), yet our findings and others in the PD literature suggest that adhering to these general principles may not be sufficient to improve the quality of teacher practices (e.g., Pisata et al., 2017). More research is needed to understand the necessary infrastructure, resources, and other supports to improve the quality of teacher practice.
Footnotes
Appendix A
Priority Groups for Assigning Sites to PD Tracks
| Priority group | How site ranked PD track | Recommended for PD track |
|---|---|---|
| 1st priority | 1st choice | Yes |
| 2nd priority | 2nd choice | Yes |
| 3rd priority | 3rd choice | Yes |
| 4th priority | 1st choice | No |
| 5th priority | 2nd choice | No |
| 6th priority | 3rd choice | No |
Note. PD = professional development.
Appendix B
Acknowledgements
Thank you to all to the following colleagues for your support collecting data: Damaris Rothe, Natalie Spitzer, and Parisa Spitaleri. Finally, we would like to thank the children, teachers, and directors who participated and made this research possible.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by grants from the Spencer Foundation.
Notes
Authors
NATALIA M. ROJAS is a National Science Foundation postdoctoral research fellow at New York University School of Medicine. She studies system-level supports for classroom quality with a specific interest in dual language learners.
PAMELA MORRIS is a professor of applied psychology at New York University’s Steinhardt School of Culture, Education, and Human Development. Having spent a decade in policy research at MDRC before joining the faculty at New York University, Dr. Morris has spent two decades working at the intersection of social policy, practice, developmental psychology, and education.
AMUDHA BALARAMAN is the director of research and evaluation at New York University Department of Education. She is interested in increasing school readiness among children.
