Sage Journals: Discover world-class research

Abstract

Only limited evidence exists on how to best make the effective yet broad concept of formative assessment (FA) accessible to teachers. Thus, this study investigated the effects of two differently structured FA approaches (curriculum-embedded assessment [CE] vs. planned-for-interaction assessment [PI]) on implementation outcomes over time. A total of N = 118 mathematics teachers participated in a six-session professional development program offered over one school year and implemented one of the two approaches in their classrooms. Implementation success was assessed via teachers’ self-reports. Hierarchical linear models for repeated measurement revealed higher ratings for CE for the initial phase of implementation. Over time, differences between the approaches decreased for feasibility and cooperation, but remained quite constant in terms of acceptability, fidelity and perceived learning outcome. The approaches did not significantly differ regarding sustainability. Thus, for implementing FA, it seems worthwhile to provide teachers with clear guidelines and an explicit structure.

Keywords

assessment evaluations mathematics teacher education professional development elementary teacher education implementation formative assessment

Introduction

Educational research has repeatedly proven that several theoretical concepts have the potential to improve teaching and learning (Hattie, 2023; Slavin et al., 2021). One of the most prominent theoretical frameworks for evidence-based innovations is formative assessment (FA), which aims to provide assessment-based, individualized teaching (Lee et al., 2020). Even though the general structure of FA—which involves cycles of assessment, feedback and tailored instruction—seems simple, reviews on FA have concluded that although positive effects are well documented, implementation in the classroom is challenging, and teachers need support in translating the broad concept of FA into instructional action (Klute et al., 2017; Lane et al., 2019; Lee et al., 2020). However, little is known about the most effective ways to make this broad concept accessible to teachers. In particular, to our knowledge, no direct comparison of different FA approaches has been conducted yet that would provide valid insights into how they can best be designed for meaningful and sustainable implementation.

Within the different conceptualizations of FA (Bennett, 2011; Shavelson, 2006; van der Kleij et al., 2015), a key aspect that differs between FA approaches is the degree of formalization and structure of the FA cycle. For example, Shavelson (2006) differentiates between the clearly structured approach of curriculum-embedded assessment (CE) and the more flexible concept of planned-for-interaction assessment (PI). As such, introducing these differently structured FA approaches to teachers might offer specific advantages: Whereas highly structured approaches with clear guidelines for classroom activities could help teachers implement FA initially (Andersson & Palm, 2018; Lane et al., 2019), approaches that are easier to adapt and more flexible to use may be better suited for long-term implementation of FA (Goertzen et al., 2023; Yin et al., 2008).

Thus, this study compared, from the teachers’ perspective, the short- and long-term implementation success of two differently structured FA approaches. To this end, over the course of one school year, N = 118 first- and second-grade teachers took part in a professional development (PD) program aimed at preparing them to implement FA. In a quasi-experimental design, the two conditions only differed with respect to the FA approaches’ degree of structure and formalization and, accordingly, the material provided.

Different Approaches to FA

FA is an effective framework for promoting school learning (Klute et al., 2017; Lane et al., 2019; Lee et al., 2020). The main idea of FA is to use assessment information about a student’s learning level and progress to subsequently inform future decisions in the learning process in terms of individualized feedback and adaptive teaching (Wiliam & Thompson, 2008). Meta-analyses (Klute et al., 2017; Lane et al., 2019; Lee et al., 2020) suggest that the effectiveness of FA strongly depends on how teachers implement it in their day-to-day teaching.

To evaluate how teachers can best be supported in implementing FA successfully and sustainably in their classes, we find it reasonable to differentiate FA approaches according to their degree of formalization and structure. As such, we drew on the classification proposed by Shavelson (2006), who described three anchor points of structuring along a continuum from highly structured to spontaneous. The most structured anchor point with the highest level of prior planning is CE, where assessments are used to elicit “goal-directed teachable moments” (Shavelson et al., 2008, p. 301). While different forms of implementation exist, the main idea is to integrate assessments into the curriculum at prespecified junctures of knowledge acquisition (Heritage, 2007; Shavelson, 2006). The second anchor point, namely PI, is slightly more flexible, as it primarily takes place during the regular course of instruction. Here, teachers deliberately plan which questions to ask during classroom discussions that will best contribute to obtaining valuable information about a student’s learning level (Heritage, 2007; Shavelson, 2006). The third point is described as spontaneous on-the-fly assessment; this, however, is difficult to evaluate due to its unsystematic application.

Measuring Implementation Success From the Teacher’s Perspective

To evaluate the implementation of the differently structured approaches, we oriented toward Proctor et al.’s (2011) conceptual model of implementation research that was adapted for educational contexts by Baffsky et al. (2022). This model suggests that crucial implementation variables include cooperation, feasibility, acceptability, fidelity, sustainability, and perceived learning outcome, which is supported by empirical evidence. For instance, studies have shown that collaboration and cooperation among colleagues constitute a key strategy for successful and sustainable implementation processes (Andersson & Palm, 2018; Hebbecker & Souvignier, 2018; Prenger et al., 2022). Similarly, the acceptability of an innovation is closely linked to teachers’ intentions to implement it (Renko et al., 2020), the frequency and quality of implementation (Hondrich et al., 2016) and actual innovation use (Lakin & Shannon, 2015). Furthermore, higher perceptions of feasibility are associated with increased innovation use (Andersson & Palm, 2018; Hebbecker et al., 2022; State et al., 2017). Fidelity of implementation, however, is related to better student outcomes (Crawford et al., 2012; Pas et al., 2019) and discussed as an explanation for varying or absent effects on student outcomes in the context of FA (Boström & Palm, 2023; Yin et al., 2008).

Implementing FA Into School Practice Needs Support for Teachers

Successfully and sustainably implementing FA into the classroom means that teachers must acquire the theoretical concept, put it into practice, and then reflect on their experiences (Figure 1). In this context, theoretical models of teacher professional growth emphasize that, fundamentally, teachers need to have positive experiences when applying new concepts (Clarke & Hollingsworth, 2002). Furthermore, the long-term goal of implementing a new concept is for teachers to gradually internalize it—adopting and applying it competently and flexibly, while considering the specific teaching context (e.g., Duffy, 1993).

Figure 1.

Illustration of the Implementation Process

Research suggests that for successfully implementing FA, teachers should be provided with support, such as accompanying PD and prepared materials (Andersson & Palm, 2018; Heitink et al., 2016; Lane et al., 2019; Lee et al., 2020; Yin et al., 2008). To ensure teachers feel confident and able to put FA into practice, PD activities should be designed such they can be immediately applied in the classroom while they explain the underlying theoretical mechanisms of FA (Andersson & Palm, 2018). Giving teachers scaffolding in the form of concrete classroom practices, such as ready-to-use material, can encourage teachers to adopt new behaviors (Heitink et al., 2016; Lane et al., 2019; Yin et al., 2008).

However, it lacks evidence of how formalized FA cycles should be to best make the concept accessible to teachers. While highly structured approaches that tightly specify processes and procedures in class could facilitate implementation by making teachers feel competent to carry out FA (Andersson & Palm, 2018; Lane et al., 2019), less structured approaches may allow teachers to adapt the concepts to their own routines and reflect on the core principles of FA, eventually leading to a more intense internalization of the concept (Goertzen et al., 2023; Yin et al., 2008). When studying the implementation success of these approaches, teachers’ perceptions are likely to be influenced by the state of implementation (whether FA is new or familiar to them) and the extent to which they have internalized the concept, making a process perspective particularly relevant.

Expected Short-Term Effects

Focusing on initial implementation processes, an intervention’s higher degree of structure should have a facilitating effect. For instance, innovations that are less complex, easy to handle, and offer a high degree of clarity are more likely to be implemented (Schrader et al., 2020). Furthermore, the less time an intervention requires, the more acceptable and feasible (State et al., 2017) teachers perceive it to be. Focusing on FA elements, it has been shown that, compared with less structured support and feedback materials, a rather structured, online-based diagnostic tool led to comparable levels of acceptability ratings (Ehlert et al., 2022; Hebbecker et al., 2022) but higher feasibility ratings (Hebbecker et al., 2022) as well as greater innovation use (Hebbecker & Souvignier, 2018). Providing ready-to-use materials that can be immediately implemented in the classroom can also help improve implementation fidelity (Hondrich et al., 2016). Likewise, teachers were found to highly value concrete FA teaching activities that made them feel confident. Consequently, activities that were perceived as viable and less costly were more frequently implemented (Andersson & Palm, 2018). Thus, highly structured concepts comprising specific guidelines for realizing FA in the classroom may help teachers act confident in the classroom and feel secure in adopting FA (Andersson & Palm, 2018; Lane et al., 2019).

Expected Long-Term Effects

The ultimate goal of such an implementation effort is to have teachers use an innovation over the long term (Petermann, 2014; Proctor et al., 2011). Sustainable change appears to be challenging, especially in the context of FA (Lane et al., 2019; Lee et al., 2020). From this view, an FA that is more flexible, such as the PI approach, might pay off. An innovation’s adaptability and flexibility are considered key facilitating factors for successful implementation (Schrader et al., 2020). In the context of innovations and school development, the high degree of autonomy and, with greater practice, possibly also an increased experience of competence, appear to be particularly significant (Nguyen et al., 2021; Schellenbach-Zell & Gräsel, 2010). Furthermore, an effective long-term use of FA may benefit from teachers reflecting on the fit between the proposed concept and their own routines, adapting practices flexibly to the respective circumstances at school and ultimately internalizing FA (Bennett, 2011; Goertzen et al., 2023; Yin et al., 2008). Thus, teachers may continue to use FA practices that are presented as more flexible.

The Present Study

Our goal was to examine the effects of two differently structured FA approaches on different outcome variables over the implementation period of one school year. More specifically, we compared the effects of the more structured curriculum-embedded (CE) assessment approach and the less structured planned-for-interaction (PI) assessment approach at the teacher level.

Teachers in both groups implemented one of the two FA approaches in their mathematics classrooms. Both approaches were based on “learning progressions” (Duncan & Rivet, 2018; Furtak, 2012) for primary school mathematics education. These learning progressions represent typical pathways that students usually undergo when acquiring basic mathematical content, and they represent the groundwork upon which students’ competencies were systematically assessed. To effectively prepare teachers for translating the FA cycle into concrete action, they participated in a PD program on assessment-based, individualized mathematics instruction covering six sessions throughout the school year. While this PD program was parallel in both groups, the approaches differed regarding the formalization of the FA cycle. Specifically, in the CE group, the classroom implementation was tightly guided (using prepared, written task sets for assessment that were identical for all students), whereas in the PI group, teachers were given more flexibility in realizing the concept in their mathematics classrooms (holding planned, focused conversations for assessment with students they selected).

Teachers rated the implementation success at six measurement points (for the variables acceptability, feasibility) and five measurement points (for the variables fidelity, cooperation, perceived learning outcome) throughout the school year, and at the end of the year, they also rated the sustainability of the implementation (Figure 2).

Figure 2.

Study Design and Measurement Time Points

Research Question and Hypotheses

This study examines how two differently structured FA approaches (CE vs. PI) affect different implementation outcomes over the implementation process of one school year. The following hypotheses were posed:

Hypothesis 1: For the initial phase of implementation, we expected that effects for all implementation variables would be higher for the more structured CE approach compared with the less structured PI approach. We defined the initial phase as the two first points of measurement (acceptability and feasibility at T1 and T2 and all other variables at T2 and T3).

Hypothesis 2: We expected that the effects of the two approaches on the implementation variables would equalize over time, that is, that teachers in PI would eventually rate the implementation variables more positively. To analyze this change over time, we focused on all points of measurement after the initial phase (acceptability and feasibility at T3–T6 and all other variables at T4–T6).

Hypothesis 3: As a result of Hypothesis 2, we expected no differences for sustainability ratings between CE and PI at the end of the school year (T6).

Method

Sample

Initially, 125 first- and second-grade teachers from public schools in both urban and rural areas in mid-western Germany participated in the study. Only data from teachers who attended at least two PD sessions were analyzed. Therefore, the final sample consisted of N = 118 teachers from 37 schools. Teachers were on average M = 40.95 years old (SD_Age = 10.06, Min_Age = 25, Max_Age = 62), had an average teaching experience of M = 13.14 years (SD_{teaching experience}= 9.34, Min_{teaching experience} = 0, Max_{teaching experience} = 36), and were predominantly female (95.6%). Of the sample, 97.5% had studied elementary education with 87.3% having studied mathematics as a subject; 2.5% (N = 3) were social educationalists. In total, 39.0% of the participants taught in the first grade, 48.3% in the second grade, 5.1% in inter-year grades, and 7.6% could not be assigned to a specific grade. An overview of the sociodemographic data, separated by groups, as well as a comprehensive classification of our research methodology according to Saunders et al. (2009) can be found in the Supplemental Material in the Open Science Framework (https://osf.io/5fd7h/).

Design

The study design was a quasi-experimental intervention study with repeated measurements (Figure 2). The recruitment of participants was carried out in close cooperation with two district governments. These governments informed the school administrations in their respective districts about the project and informed us which schools would be participating. Guidelines regarding the number of participating teachers were established in advance, ensuring an equal number of teachers participated across both district governments. The selection of teachers was the responsibility of the school administrations, with the restriction that the teachers had to be teaching a first- or second-grade class during the implementation period or worked as social educationalists. Subsequently, teachers were randomly assigned at the school level to one of the two conditions (CE vs. PI). To increase practicability and avoid teachers exchanging ideas and materials of the different approaches, all teachers at one school were in the same condition (CE: 57 teachers from 18 schools, PI: 61 teachers from 19 schools, average number of teachers per school: 2.97). Teachers of both conditions participated in a six-session PD program lasting throughout the school year. Between the PD sessions, they implemented the respective FA approach in their classes (see Treatment section for details).

First, at T1, teachers anticipated the acceptability and feasibility of the FA. Then, at the beginning of each PD session, they evaluated the implementation of their FA in the recent practice phase (T2–T6). In that way, we continuously measured implementation success at five or six measurement points (five measurement points for fidelity, cooperation, perceived learning outcome; six points for acceptability and feasibility). At the last measurement point, teachers indicated how sustainable they thought it would be to continue implementing the FA in their classes.

Treatment

Teachers were supported in implementing FA in their classes via the PD program and prepared material. Between the PD sessions, they implemented the respective FA approach (CE: using written task sets for assessment, PI: conducting focused conversations for assessment). While the PD sessions were mostly parallel, the FA approaches differed in their structure and materials provided. Whereas the CE approach was highly structured and directive in how to realize FA in the classroom, the PI approach provided teachers more flexibility regarding the design of the classroom activities. A summary of the FA approaches’ similarities and differences is provided in the appendix.

PD Program

Throughout the school year, teachers of both groups participated in a PD program (six sessions, total contact time: 24 hr). While Sessions 1, 2, and 6 were held face-to-face; Sessions 3, 4, and 5 were realized in a digital format due to the COVID-19 pandemic. For organizational and geographical reasons, two identical PD trainings were held for each condition. Trainings were given by tandems of trained mathematics educationalists. The PD sessions focused on two main aspects: the mathematical content, covering arithmetic topics for the first and second school grade (understanding numbers, understanding operations, process-related competencies), and the content essential to implementing FA (e.g., providing assessment-based differentiated instruction, adapting tasks, providing feedback). In both groups, activities alternated between case-based inputs, practice phases, and reflection phases.

Teachers were also introduced to the assessment concept they were to implement in their classes. In both groups, teachers prepared for implementing the assessment methods in their classrooms and reflected on the suitability and possible adjustments they could make in the next session. For this, teachers in CE were trained in selecting, adapting and using different assessment tasks and interpreting written documents to systematically assess all students’ learning levels, whereas teachers in PI learned how to flexibly conduct focused conversations for assessment, including reflecting on a specific tasks’ diagnostic value, anticipating student responses and developing further prompts.

Material

To support teachers in implementing FA, they were given material to assess their students’ competencies and tailor their instruction to students’ needs. Teachers in CE received prepared written task sets for assessment. These task sets consisted of tasks covering different competence domains (e.g., recognizing numbers on the number line) within one key topic (e.g., understanding numbers). Different task sets existed for first grade (number range of up to 20) and second grade (number range of up to 100). Task sets were compiled and adjusted by the teacher and then equally delivered to all students at the same time. The teacher then evaluated the results using a prepared evaluation table and competency grid.

In contrast, teachers in PI were given a set of cards for assessment to support them in conducting focused conversations for assessment. Each card comprised a list of content-related competencies the students should have, a task of high diagnostic value to start the conversation and suggestions for valuable observations and specific prompts the teacher could give. Sets of cards were available for the same key topics and competence domains as in CE. Teachers planned the conversations in advance. In class, all students could work on the same basic task while the teachers entered into deeper conversations with individual students or a group of students they selected.

Teachers in both groups used the assessment results for adaptive instruction. To provide instruction that was closely aligned with each student’s achievement level, teachers were provided with support suggestions (the same for both groups) that closely matched the assessed competence domains. Teachers could decide on the support setting: Most teachers integrated the suggestions into their regular math classes; that is, based on the assessed competencies, they selected support activities that were applicable for small groups or the whole class. If organizational conditions allowed, teachers also used the support suggestions in remedial teaching, especially with low-achieving students.

Measures

To examine the effects of the different FA approaches at the teacher level over the course of the school year, we used self-reports in which teachers evaluated their FA implementation in the recent practice phase (T2–T6). Acceptability and feasibility were additionally rated at T1 in an anticipatory fashion, such that teachers had been introduced to the FA concept and the material but had not yet tried out the concept in class. Sustainability was rated at T6. All items were rated on a 6-point Likert-type scale ranging from 1 = “does not apply at all” to 6 = “applies exactly.” Apart from small linguistic adjustments at T1, items were identical at all measurement points. Items were adapted for mathematics instructions and often referred to the concept of providing assessment-based, differentiated mathematics instruction using the provided material. In both groups, this was referred to as the “FÖDIMA concept,” regardless of the FA approach the teachers were implementing. By using this broad term as well as encouraging teachers to refer to the recent practice phase, we assume that the items reflect both teachers’ comprehensive evaluation of the concept and their specific reaction to the particular content and material. Because teachers completed the surveys during the PD sessions, we decided to use short scales. An overview of the instruments is displayed in Table 1.

Table 1.

Overview of the Instruments.

Scale	No. of items	α	Time points	Sample item
Acceptability (Meudt et al., 2020)	3	.86 to .96	T1–T6	I like the approach of the “FÖDIMA concept.”
Feasibility (Meudt et al., 2020)	4	.84 to .89	T1–T6	The necessary preparations (copies, etc.) for implementing the “FÖDIMA concept” were feasible.
Fidelity (own scale)	4	.76 to .91	T2–T6	I used the assessment results to provide differentiated support.
Cooperation (Meudt et al., 2020)	3	.78 to .90	T2–T6	The collegial discussions on the implementation of the “FÖDIMA concept” worked smoothly.
Perceived learning outcome (Meudt et al., 2020)	3	.78 to .89	T2–T6	Working with the “FÖDIMA concept” has led to great improvement in the students’ achievement.
Sustainability (own scale, adapted from Hebbecker & Souvignier, 2018)	3	.90	T6	I will continue implementing the “FÖDIMA concept” in the next school year.

Data Analysis

For Hypotheses 1 and 2, a hierarchical linear model for repeated measures (with measurement time points nested in persons) was estimated for each outcome variable using the lme4 package (Bates et al., 2015) in R. The treatment variable was dummy coded with the CE group, which represented the reference category (0 = CE, 1 = PI). We specified contrast effects in which the mean values at each measurement point were compared with the baseline. Therefore, measurement time points were also dummy coded using the first measurement point as reference category. Group differences were tested including the group variable (T1/T2) and the cross-level interactions between treatment variable and measurement time points (T2-T6/T3-T6) in the model. Random intercepts were specified. The model in variable notation (exemplary for acceptability) accordingly was as follows:

\begin{array}{l} Y_{i j} = γ_{00} + γ_{10} x D_{1 i} + γ_{20} x D_{2 i} + γ_{30} x D_{3 i} + \\ γ_{40} x D_{4 i} + γ_{50} x D_{5 i} + γ_{01} x G_{j} + γ_{11} x D_{1 i} x G_{j} + \\ γ_{21} x D_{2 i} x G_{j} + γ_{31} x D_{3 i} x G_{j} + \\ γ_{41} x D_{4 i} x G_{j} + γ_{51} x D_{5 i} x G_{j} + \\ μ_{0 j} + ε_{i j} \end{array}

Regression weights at baseline (γ₀₀) refer to the CE teachers’ responses at the first measurement point. The regression weights T2 to T6 (γ₁₀ – γ₅₀) or T3 to T6 (γ₁₀ – γ₄₀), respectively, refer to the difference in the respective implementation variables compared with the baseline in CE for the specific measurement points. “Group” variables (γ₀₁) refer to the difference between CE and PI at baseline. All interaction variables (γ₁₁ – γ₅₁) refer to the difference in change between CE and PI at a specific measurement point. Thus, regression weights for change in PI can be obtained by adding the weights of CE (γ₁₀ – γ₅₀) and the weights of PI for the matching interaction variables (γ₁₁ – γ₅₁).

For sustainability (H₃), a Wilcoxon rank-sum test was conducted, as the data were non-normally distributed (as indicated by a significant p value <.001 in the Shapiro–Wilk normality test).

A substantial amount of missing data emerged for all outcome variables at all measurement points (acceptability: T1: 11.9%, T2: 28.8%, T3: 39.8%, T4: 39.8%, T5: 49.2%, T6: 31.4%; feasibility: T1: 11.9%, T2: 28.0%, T3: 40.7%, T4: 40.7%, T5: 49.2%, T6: 31.4%; fidelity: T2: 29.7%, T3: 42.4%, T4: 39.8%, T5: 50.0%, T6: 32.2%; cooperation: T2: 31.4%, T3: 40.7%, T4: 39.8%, T5: 50.0%, T6: 32.2%; perceived learning outcome: T2: 39.8%, T3: 43.2%, T4: 42.4%, T5: 50.8%, T6: 33.1%, sustainability: T6: 33.1%). As missing data could have arisen due to random factors, such as teachers being ill or participating in other school events, we decided to use latent factor scores for scale building employing the full information maximum likelihood (FIML) procedure. Latent factor scores were grand-mean z-standardized. Due to estimation difficulties, factor scores were built based on mean values (implying all items contribute equally to the respective factor score) and restricting all intercepts to zero.¹ Factor determinacy indices (FDIs) were adequate for research purposes (Gorsuch, 1983). However, to test the results’ robustness given the considerable estimation difficulties, we conducted the analyses with different samples; that is, we analyzed data from teachers that attended at least three or four PD sessions. Apart from slight differences in p-values and small deviations regarding perceived learning outcome, results were stable. We provide the results for the additional analyses in the Supplemental Material.

We report Hedges’s g $\begin{array}{l} [{Cohen}^{'} s d x (1 - \frac{3}{4 (n 1 + n 2) - 9})] \end{array}$ as a measure of effect size for group differences and gav $(\frac{M d i f f}{\frac{S D 1 + S D 2}{2}})$ when calculating effect sizes for repeated measurement² (Lakens, 2013). Effects are interpreted according to Cohen (1988)³: small: g = .20, moderate: g = .50, large: g = .80. For Hypotheses 1 and 2, we applied one-tailed testing, and for Hypothesis 3 we applied two-tailed testing (α = .05). Due to the multiple comparisons conducted, the α-level was adjusted according to Bonferroni. For the five group comparisons at baseline (γ₀₁), we thus tested at α = .01. The same level applied to the comparisons regarding the development of CE (γ₁₀ – γ₅₀) and interaction terms (γ₁₁ – γ₅₁) for acceptability and feasibility (five comparisons each). For fidelity, cooperation and perceived learning outcome, we tested at α = .0125 (γ₁₀ – γ₄₀ and γ₁₁ – γ₄₁, four comparisons each).

Results

Mean values and standard deviations per group for all outcome variables are displayed in Table 2. An illustration of the ratings for each implementation variable over time for CE versus PI is provided in Figure 3.

Table 2.

Means and Standard Deviations for All Implementation Variables.

	CE				PI
	N	M	SD	g_av [95% CI]	N	M	SD	g_av [95% CI]	g_group [95% CI]
Acceptability
T1	52	5.37	0.62		52	4.53	0.92		-1.06 [-1.46, -.65]
T2	47	4.86	0.92	-.66 [-.73, -.29]	37	4.41	1.16	-.11 [-.43, .19]	-.43 [-.86, .01]
T3	35	5.17	0.88	-.27 [-.43, .03]	36	4.32	0.94	-.22 [-.49, .07]	-.92 [-1.40, -.43]
T4	35	4.85	0.95	-.66 [-.76, -.28]	36	4.12	0.99	-.43 [-.70, -.12]	-.74 [-1.21, -.26]
T5	30	4.84	1.09	-.62 [-.81, -.25]	30	3.68	1.20	-.80 [-1.18, -.52,]	-1.01 [-1.54, -.47]
T6	41	4.99	0.98	-.47 [-.62, -.14]	40	4.38	1.10	-.15 [-.45, .15]	-.58 [-1.02, -.14]
Feasibility
T1	52	4.77	0.95		52	3.81	1.04		-.96 [-1.36, -.55]
T2	47	4.32	1.13	-.43 [-.74, -.16]	38	3.15	1.19	-.59 [-.99, -.33]	-1.00 [-1.45, -.55]
T3	34	3.89	1.03	-.88 [-1.18, -.58]	36	3.22	1.08	-.55 [-.91, -.27]	-.63 [-1.11, -.16]
T4	35	3.90	1.26	-.78 [-1.21, -.53]	35	3.06	0.91	-.77 [-1.04, -.46]	-.75 [-1.23, -.27]
T5	30	3.02	1.18	-1.64 [-2.08, -1.42]	30	2.98	1.19	-.74 [-1.18, -.48]	-.04 [-.54, .46]
T6	41	3.61	1.01	-1.18 [-1.45, -.87,]	40	3.28	1.16	-.48 [-.85, -.21]	-.31 [-.74, .13]
Fidelity
T2	47	4.44	0.92		36	3.88	1.34		-.50 [-.94, -.06]
T3	34	4.49	1.06	.05 [-.26, .36]	34	4.01	0.84	.12 [-.25, .51]	-.50 [-.97, -.02]
T4	35	4.22	1.09	-.22 [-.53, .09]	36	3.69	1.02	-.16 [-.59, .21]	-.50 [-.97, -.03]
T5	30	4.01	1.30	-.39 [-.79, .07]	29	3.53	1.17	-.28 [-.79, .09]	-.38 [-.89, .13]
T6	41	4.68	0.77	.28 [-.01, .49]	39	4.15	1.01	.23 [-.12, .66]	-.59 [-1.03, -.14]
Cooperation
T2	46	3.99	1.32		35	3.23	1.47		-.54 [-.98, -.09]
T3	34	4.07	1.21	.06 [-.32, .48]	36	3.06	1.26	-.12 [-.63, .29]	-.81 [-1.29, -.33]
T4	35	3.55	1.29	-.34 [-.85, -.03]	36	3.15	1.49	-.05 [-.58, .42]	-.28 [-.74, .18]
T5	30	3.38	1.69	-.40 [-1.10, -.12]	29	3.11	1.43	-.08 [-.63, .39]	-.17 [-.67, .34]
T6	41	3.84	1.45	-.11 [-.57, .27]	39	3.54	1.45	.21 [-.17, .79]	-.20 [-.64, .23]
Perceived learning outcome
T2	39	3.71	1.06		32	3.36	1.12		-.32 [-.78, .15]
T3	34	4.17	0.70	.52 [.16, .76]	33	3.70	0.88	.34 [-.01, .69]	-.58 [-1.07, -.10]
T4	34	3.86	1.09	.14 [-.20, .50]	34	3.39	0.97	.03 [-.33, .39]	-.45 [-.93, .03]
T5	30	3.83	1.15	.11 [-.26, .50]	28	3.32	1.17	-.03 [-.46, .38]	-.43 [-.94, .09]
T6	40	4.17	1.11	.42 [.12, .80]	39	3.65	1.00	.27 [-.07, .65]	-.49 [-.93, -.05]
Sustainability
T6	40	5.00	0.91		39	4.65	0.97		-.37 [-.81, .07]

Note. CE = curriculum-embedded assessment, PI = planned-for-interaction assessment. g_av refers to effect sizes for repeated measurement comparing means of each measurement point with the baseline within one group; g_groups refers to the comparison between CE and PI for each measurement point.

Figure 3

Ratings for the Implementation Variables in CE and PI for All Points of Measurement

Hierarchical linear regression analyses were conducted for each implementation variable. Complete model results are provided in Table 3 (acceptability and feasibility; T1–T6) and Table 4 (fidelity, cooperation, and perceived learning outcome; T2–T6).

Table 3.

Model Results for Acceptability and Feasibility.

	Acceptability				Feasibility
	β	SE	t	p	β	SE	t	p
Baseline (T1)	.826*	.12	7.03	<.001	1.090*	.11	9.59	<.001
T2	-.520*	.10	-5.41	<.001	-.350*	.11	-3.18	.001
T3	-.359*	.10	-3.73	<.001	-.800*	.11	-7.27	<.001
T4	-.593*	.10	-6.16	<.001	-.887*	.11	-8.06	<.001
T5	-.619*	.10	-6.44	<.001	-1.534*	.11	-13.94	<.001
T6	-.420*	.10	-4.36	<.001	-1.063*	.11	-9.66	<.001
group (T1)	-.757*	.16	-4.63	<.001	-.770*	.16	-4.87	<.001
group*T2	.204	.13	1.53	.064	-.198	.15	-1.30	.098
group*T3	-.092	.13	-.69	.246	.035	.15	.23	.410
group*T4	-.042	.13	-.32	.377	.190	.15	1.24	.108
group*T5	-.231	.13	-1.73	.042	.504*	.15	3.30	<.001
group*T6	-.024	.13	-.18	.428	.402*	.15	2.63	.004

Note. One-tailed testing was applied; Baseline (T1) = γ₀₀; T2–T6 = γ₁₀ – γ₅₀; group (T1) = γ₀₁; group*T2–group*T6 = γ₁₁ – γ₅₁.

p < .01 (adjustment according to Bonferroni, see method section).

Table 4.

Model Results for Fidelity, Cooperation, and Perceived Learning Outcome.

	Fidelity				Cooperation				Perceived learning outcome
	β	SE	t	p	β	SE	t	p	β	SE	t	p
Baseline (T2)	.636*	.12	5.21	<.001	.434*	.13	3.37	<.001	.211	.13	1.62	.054
T3	-.280^†	.11	-2.61	.005	-.071	.10	-.69	.245	.148	.10	1.49	.069
T4	-.670^†	.11	-6.26	<.001	-.338^†	.10	-3.30	<.001	-.188	.10	-1.89	.030
T5	-.918^†	.11	-8.58	<.001	-.598^†	.10	-5.86	<.001	-.256^†	.10	-2.57	.005
T6	-.130	.11	-1.21	.114	-.125	.10	-1.22	.112	.258^†	.10	2.59	.005
group (T2)	-.517*	.17	-3.06	.001	-.500*	.18	-2.80	.003	-.418	.18	-2.31	.011
group*T3	.072	.15	.49	.314	-.194	.14	-1.37	.086	.074	.14	.54	.296
group*T4	.030	.15	.20	.421	.071	.14	.50	.308	-.011	.14	-.08	.469
group*T5	.112	.15	.75	.226	.443^†	.14	3.13	.001	.058	.14	.42	.336
group*T6	.103	.15	.70	.243	.186	.14	1.31	.095	.003	.14	.02	.492

Note. One-tailed testing was applied; Baseline (T2) = γ₀₀; T3–T6 = γ₁₀ – γ₄₀; group (T2) = γ₀₁; group*T3–group*T6 = γ₁₁–γ₄₁;

p < .01. ^†p < .0125 (adjustment according to Bonferroni, see method section).

Regarding the initial implementation phase (H₁), effects were significantly higher for CE compared with PI at baseline for all implementation variables (acceptability: β = −.757, p <.001; feasibility: β = −.770, p < .001; fidelity: β = −.517, p = .001; cooperation: β = −.500, p = .003), except for perceived learning outcome (β = −.418, p = .011). This advantage of CE was also stable for the second measurement point. While at the second measurement point both groups showed a lower rating on all variables (except for perceived learning outcome), these declines did not significantly differ between the groups.

Effects in both groups were then examined for the later measurement points (H₂). The development patterns were, in part, specific to the different implementation variables, but they also indicated a general trend: For all variables, significant decreases were found in the CE group for most of the measurement points. These decreases were most pronounced at T4 and T5, and less so at T6. For perceived learning outcome, however, we found a significant increase at T6 (β = .258, p = .005).

Comparing the development of CE and PI, only a partially consistent pattern emerged. For feasibility, fidelity, cooperation and perceived learning outcome, decreases tended to be lower for PI than CE. However, these effects were only significant for feasibility at T5 (β = .504, p < .001) and T6 (β = .402, p = .004) and cooperation at T5 (β = .443, p = .001). For some measurement points, we found an even greater (but nonsignificant) decrease for PI (feasibility: T2, cooperation: T3, perceived learning outcome: T4). Differences for perceived learning outcome were almost negligible. For acceptability, the decrease was slightly higher for PI than CE, but no significant effects were found.

For sustainability (H₃), mean values were higher for CE (M = 5.00) than PI (M = 4.65). However, the Wilcoxon rank sum test revealed no significant differences between the approaches (z = 1.69, p = .091, r = .16).

As the results not only represent a comprehensive evaluation of the concept but also relate to the specific content and material, we extended our analyses by averaging effect sizes for group comparisons for the first two measurement points and then also for the last two measurement points. In doing so, we aimed to get a better understanding of how teachers’ evaluations developed as variations caused by the specific topics were compensated for. In general, results suggest constant effects in favor of CE across time for acceptability (T1/T2: g = −.75, T3: g = −.92, T4: g = −.74, T5/T6: g = −.80), fidelity (T2/T3: g = −.50, T4: g = −.50, T5/T6: g = −.49), and perceived learning outcome (T2/T3: g = −.45, T4: g = −.45, T5/T6: g = −.46). In contrast, differences between CE and PI decreased for feasibility (T1/T2: g = −.98, T3: g = −.63, T4: g = −.75, T5/T6: g = −.18) and cooperation (T2/T3: g = −.68, T4: g = −.28, T5/T6: g = −.19).

Discussion

In this study, we aimed to investigate how FA can best be made accessible to teachers. Specifically, we compared the effects of two differently structured FA approaches (CE vs. PI) at the teacher level over the course of one school year. When implementing the approaches in their math classes, teachers were supported with ongoing PD, which was parallel between conditions but differed somewhat because of the approaches’ differences in structure. Throughout the school year, teachers rated implementation success at various measurement points. In line with Hypothesis 1, effects on all outcome variables (acceptability, feasibility, fidelity, cooperation), except for perceived learning outcome, were higher for CE in the initial phase of implementation. Hypothesis 2 was only partly confirmed: While teachers eventually rated the implementation variables for PI slightly better (compared with initial ratings) on a descriptive level, these changes were small, rarely significant and seemed to reflect a smaller decrease in ratings on the outcome variables. Here, when compensating for some situational instabilities, some outcome-specific trends emerged, representing rather constant differences between PI and CE for acceptability, fidelity, and perceived learning outcome but decreasing differences for feasibility and cooperation. Sustainability (measured at T6) did not significantly differ between the groups, thereby confirming Hypothesis 3.

Despite the partly outcome-specific development patterns, our results indicate some general trends: (a) in a head-to-head comparison, the more structured approach showed advantages, especially at the beginning of the implementation process, and (b) the teachers’ perceptions of the two FA approaches only weakly converged. We end by carefully discussing the slightly negative trend for both approaches in view of our study design.

First, regarding short-term effects, our results clearly highlight the benefits of highly structured FA approaches when implementing them into school practice. For the initial phase of implementation, the tighter CE guidelines led to better outcomes in various domains, and, thus, teachers rated this approach as having better acceptance and feasibility and affected actual teacher behavior (fidelity), and cooperation among colleagues. Our results, thereby, reflect and extend previous evidence on how to design effective FA approaches (Andersson & Palm, 2018; Hebbecker et al., 2022; Hebbecker & Souvignier, 2018; Hondrich et al., 2016). The provided CE-based assessments seem to have helped teachers act and feel confident in the classroom, thereby fulfilling the goal that concepts and teaching material are linked to everyday practice and ready to use (Heitink et al., 2016). In contrast, the PI-based approach involves more abstraction and may, therefore, require even more practical guidance to ensure that teachers perceive this approach positively and use it with high levels of fidelity. Practical teaching considerations might also be pivotal here. While PI-based assessments require individual or small group settings, CE seems better applied to lessons involving the whole class and, thus, is less time-consuming (State et al., 2017).

Regarding long-term effects, the hypothesized responses we expected teachers to have in their perceptions of PI and CE were only partially verified, and they tend to refer to smaller decreases in the outcome variables for PI rather than better PI ratings over time. However, as studies have shown that more flexible approaches do have some specific advantages (Goertzen et al., 2023; Schellenbach-Zell & Gräsel, 2010; Schrader et al., 2020) and because teachers in our study also wished to sustainably implement both approaches, it is worth examining the interplay between the approach’s specific design and implementation conditions. As such, and probably most important, is that one school year might be too short for teachers to comprehensively evaluate the success of implementation. Research on PD has yielded evidence that implementation processes may take years to achieve the desired effects (Desimone, 2002). When it comes to comprehensive frameworks such as FA, where implementation requires substantial changes in teaching activities and attitudes toward learning, this might be particularly true. In this context, Marshall and Drummond (2006) also spoke of adopting the “spirit of FA” which involves considering FA as an integral part of school practice rather than simply embedding a set of assessments (DeLuca et al., 2019; Gotwals & Cisterna, 2022). However, meeting this expectation within one school year may be difficult. Specifically, internalizing the concept can only occur once teachers translate the concept into concrete action and gradually gain confidence with applying it in the classroom. Here, it is possible that by embedding assessment tasks, teachers in CE had their first positive experiences with FA at an early stage of implementation, resulting in early reflection processes, whereas these changes may have taken longer to manifest in the PI approach. In other words, applying a specific instrument (here: providing students with the written task sets for assessment) may be easier than establishing instructional activities that strongly depend on the teacher’s knowledge and the quality of implementation, as is needed for PI-based questioning and discussion practices (Desimone & Garet, 2015). Similarly, teachers in the PI group may not have had sufficient resources or were not sufficiently aware of the effort required to adapt the approach to their specific circumstances. Thus, future PD programs could explicitly discuss ways to adapt the FA approach to local conditions (e.g., students’ needs) and integrate it into one’s own routine (Penuel et al., 2007). Overall though, for the PI approach’s flexibility to pay off, a longer implementation period might be needed. Thus, the positive development of feasibility and cooperation that we found for the PI approach at the end of the implementation period may indicate the start of a process in which these benefits gradually become evident.

Regardless of the extent of structuring, both approaches tended to show a negative trend for almost all outcome variables. Although these changes occurred on an acceptable to high level, these results highlight the challenge of sustainably disseminating innovations in the school context (Slavin et al., 2021) and suggest that continuous support is needed so that teachers not only perceive an innovation to be acceptable and feasible at the outset, but also that they gain positive practical experiences in class which then reinforce their positive attitudes (Clarke & Hollingsworth, 2002). Nonetheless, characteristics of the study’s design should also be acknowledged when interpreting the results.

As such, although we implemented a comprehensive concept, the material provided and the topics covered in PD were inherently content specific, which may have affected teachers’ perceptions at specific points of measurement. For instance, when participating in PD, teachers highly value when the provided material matches the topics they will teach in class, and they perceive any lack of congruence (for instance, as induced by mandatory curriculum pacing) as a barrier to implementation (Buczynski & Hansen, 2010; Desimone & Garet, 2015). Thus, some of the content in this study simply may not have met teachers’ expectations and needs, thus leading to less positive attitudes at respective points of measurement. As such, occasional deviations in the outcome variables’ development (for instance, the relatively low ratings at T5) may occur without severe consequences for the entire implementation process. Furthermore, for acceptability and feasibility, ratings at T1 represent an anticipatory evaluation before implementing the concept in the classroom and may, therefore, represent a very high standard of comparison. When teachers start to become familiar with a plausible and evidence-based concept, they may show initially positive attitudes. Yet, applying this concept in the “real world” and translating the idea into concrete teaching activities may entail some difficulties, resulting in somewhat lower ratings for the subsequent measurement points.

Limitations and Implications for Future Research

This study’s strength lies in its high ecological validity, as we were able to longitudinally evaluate the intervention’s implementation success under the real conditions of school practice. However, investigating effects in the field comes along with challenges. For instance, we had to deal with high rates of missing data mainly caused by illness, other school events or dropout. Although latent modeling allows for estimating effects for incomplete data, effects should be interpreted with caution given the large standard deviations. Furthermore, our study design only allowed for assessing implementation success via teachers’ self-reports, which can, of course, be biased due to social desirability, or retrospective recall, among others. While the assessment of teachers’ perceptions (e.g., acceptability, feasibility) inherently relies on self-reports, especially for assessing implementation fidelity, however, more objective assessment strategies would be desirable.

When drawing conclusions on the design of effective FA approaches, some limitations must be considered. For example, when teachers evaluate an innovation’s implementation success, they are likely to simultaneously evaluate the concept and the circumstances under which it is implemented, such that these evaluations may interfere. Thus, it may be fruitful to examine the relationships between outcome variables and other context factors, that is, collective self-efficacy. Given the varying conditions facing schools as well as the specific characteristics of our study (e.g., COVID-19 pandemic), generalizing the effects to other settings appears limited, although our sample size can be considered substantial compared with other longitudinal intervention studies. Furthermore, in our study, we focused on teachers’ overall assessment of two differently structured FA approaches, omitting possible interactions between the teachers’ individual prerequisites (i.e., teaching experience, attitudes, knowledge) and the innovation offered. Yet, the more expertise the teacher has, the more they may value the autonomy of the PI approach (Desimone, 2002). Thereby, a worthwhile pathway for future research and practice may be to study differential effects and, if applicable, offer “differentiated PD” (e.g., Antoniou & Kyriakides, 2013; Desimone & Garet, 2015).

The present study was designed to provide a direct comparison of approaches with varying degrees of structure. It seems possible—if not certain—that ultimately a combination of highly structured and more flexible elements may be key to the long term, sustainable implementation of FA. The design of such a combination (e.g., whether different components should be introduced sequentially) presents a valuable perspective for future research. While the current findings suggest that approaches with a higher degree of formalization should be introduced first, followed by increased flexibility, further testing of this assumption is still needed.

Practical Implications

Successful PD should help teachers become familiar with the theoretical principles of FA, translate the concept into instructional action and, ultimately, facilitate long-term implementation. The results of our study suggest that when designing PD, it might be worthwhile to initially implement FA approaches with a higher level of formalization. When directly compared, the highly structured concept was more positively perceived by teachers, thus facilitating instructional changes. For teachers to change their instructional behavior and use FA in practice, providing them with explicit guidelines and ready-to-use material seems pivotal. Offering intensive and continuous PD with ongoing reflection processes can further help teachers gain positive practical experiences. For an FA approach with more flexibility to pay off, an even longer implementation period may be needed. Our results suggest that teacher PD should account for that fact that FA entails a long, multi-stage process in which teachers first need an explicit structure to initiate changes in classroom behavior and then gradually become confident to flexibly adapt the concept to their specific conditions.

Supplemental Material

sj-docx-1-jte-10.1177_00224871251350680 – Supplemental material for Implementing Formative Assessment Into School Practice: A Matter of Structuring the Intervention?

Supplemental material, sj-docx-1-jte-10.1177_00224871251350680 for Implementing Formative Assessment Into School Practice: A Matter of Structuring the Intervention? by Larissa Aust, Jeanne-Celine Linker, Luise Eichholz, Jana Schiffer, Marcus Nührenbörger, Christoph Selter and Elmar Souvignier in Journal of Teacher Education

Footnotes

Appendix

Similarities and Differences Between the Two Groups.

	CE	PI
Similarities
PD sessions	Six PD sessions with a total of 24 contact hours over the course of one school year PD sessions were held by different tandems of trained mathematics educationalists Three online and three face-to-face sessions Two identical trainings for each group
PD content	Arithmetic in first and second grade (number range of up to 20/100) -Understanding numbers -Understanding operations (addition, subtraction, multiplication, division) -Process-related competencies Formative assessment -Providing assessment-based, differentiated instruction -Adapting tasks -Promoting student communication -Providing feedback
PD design features	Case-based input, practice and reflection phases Preparation of and reflection on the practical tasks’ implementation
Support suggestions	Support suggestions for all assessed competence domains For class, small group and individual settings Example: The number space is reduced. The student is asked to sort (a selection of) number cards according to size.
Differences
Assessment strategy	Systematic assessment of all students’ learning level using written task sets for assessment at specific critical junctures in knowledge acquisition	Deliberate, interactive assessment of a/some students’ learning level using focused conversations for assessment
Target group	Identical provision of task sets for all students at the same time	Teacher selects student(s) and tasks for the focused conversations
Teachers’ tasks	Selecting, adapting and using prepared written task sets for assessment Interpreting written documents Providing adaptive instruction	Planning focused conversations and anticipating student responses Observing students while completing the tasks, giving specific prompts, interpreting student answers Providing adaptive instruction
Provided assessment material	Prepared task sets for assessment, evaluation table, competency grid -Assessment of different competence domains (e.g., recognizing numbers on the number line, identifying numbers in a structured way) within one task set for assessment -Example: Fill in the missing numbers in the boxes on the number line.	Set of cards for assessment -One card per competence domain (e.g., recognizing numbers on the number line) -Example: List of most important content-related competencies: I can read numbers from the number line. / I can name predecessor and successor. Basic diagnostic task: Fill in the missing numbers in the boxes on the number line. Possible observations: When counting, does the student always count from the beginning? Prompts for a more differentiated assessment: Which number comes before 7?

Note. Examples from the material are written in italics.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the German Federal Ministry of Education and Research [grant numbers 01NV2102A and 01NV2102B]. Funding source had no other involvement.

Note to Open Science

The study’s analysis plan was preregistered in the Open Science Framework (https://osf.io/pzfd7). The data, reproducible codes and the supplemental material are available via the following link:

ORCID iDs

Larissa Aust

Jeanne-Celine Linker

Luise Eichholz

Jana Schiffer

Marcus Nührenbörger

Christoph Selter

Elmar Souvignier

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Larissa Aust is a research associate at the Institute of Psychology in Education at the University of Münster. Her research interests include formative assessment, teacher professionalization, and the implementation of evidence-based concepts into school practice.

Jeanne-Celine Linker is a research associate at the Institute for Development and Research in Mathematics Education at TU Dortmund University. After completing her training as a primary school teacher, she has been working on research projects on mathematics teaching and the professional development of teachers since 2021.

Luise Eichholz is a university lecturer in mathematics education specializing in primary education at the Institute for Development and Research in Mathematics Education at TU Dortmund University. Before this role, she spent 18 years working as a primary school teacher in Germany. Her research focuses on the professional development of in-service mathematics teachers at the primary school level.

Jana Schiffer is a research associate at the Institute for Basic and Inclusive Mathematics Education at the University of Münster. After completing her training as a primary school teacher, she has been working on research projects on inclusive mathematics teaching and the professional development of teachers since 2021.

Marcus Nührenbörger is professor for mathematics education specializing in primary and inclusive education at the Institute for Basic and Inclusive Mathematics Education at the University of Münster. His research interests lie in subject-related development and transfer research on mathematical teaching and learning processes, with a focus on fostering potential as well as collaboration.

Christoph Selter is professor for mathematics education specializing in primary education at the Institute for Development and Research in Mathematics Education at TU Dortmund University. His research interests lie in subject-related developmental research, research on professionalization of teachers and transfer research.

Elmar Souvignier is professor at the Institute of Psychology in Education at the University of Münster. His research interests include assessment and instruction in reading and mathematics, data-based decision-making, teacher professionalization, and implementation.

References

Andersson

Palm

(2018). Reasons for teachers’ successful development of a formative assessment practice through professional development: A motivation perspective. Assessment in Education: Principles, Policy & Practice, 25(6), 576–597. https://doi.org/10.1080/0969594X.2018.1430685

Antoniou

Kyriakides

(2013). A dynamic integrated approach to teacher professional development: Impact and sustainability of the effects on improving teacher behaviour and student outcomes. Teaching and Teacher Education, 29, 1–12. https://doi.org/10.1016/j.tate.2012.08.001

Baffsky

Ivers

Cullen

Batterham

P. J.

Toumbourou

Calear

A. L.

Werner-Seidler

McGillivray

Torok

(2022). A cluster randomised effectiveness-implementation trial of an intervention to increase the adoption of PAX Good Behaviour Game, a mental health prevention program, in Australian primary schools: Study protocol. Contemporary Clinical Trials Communications, 28, 100923. https://doi.org/10.1016/j.conctc.2022.100923

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Bennett

R. E.

(2011). Formative assessment: A critical review. Assessment in Education: Principles, Policy & Practice, 18(1), 5–25. https://doi.org/10.1080/0969594X.2010.513678

Boström

Palm

(2023). The effect of a formative assessment practice on student achievement in mathematics. Frontiers in Education, 8, Article 1101192. https://doi.org/10.3389/feduc.2023.1101192

Buczynski

Hansen

C. B.

(2010). Impact of professional development on teacher practice: Uncovering connections. Teaching and Teacher Education, 26(3), 599–607. https://doi.org/10.1016/j.tate.2009.09.006

Clarke

Hollingsworth

(2002). Elaborating a model of teacher professional growth. Teaching and Teacher Education, 18(8), 947–967. https://doi.org/10.1016/S0742-051X(02)00053-7

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Erlbaum. https://doi.org/10.4324/9780203771587

10.

Crawford

Carpenter

D. M.

II Wilson

M. T.

Schmeister

McDonald

(2012). Testing the relation between fidelity of implementation and student outcomes in math. Assessment for Effective Intervention, 37(4), 224–235. https://doi.org/10.1177/1534508411436111

11.

DeLuca

Chapman-Chin

Klinger

D. A.

(2019). Toward a teacher professional learning continuum in assessment for learning. Educational Assessment, 24(4), 267–285. https://doi.org/10.1080/10627197.2019.1670056

12.

Desimone

L. M.

(2002). How can comprehensive school reform models be successfully implemented? Review of Educational Research, 72(3), 433–497. https://doi.org/10.3102/00346543072003433

13.

Desimone

L. M.

Garet

M. S.

(2015). Best practices in teachers’ professional development in the United States. Psychology, Society, & Education, 7(3), 252–263.

14.

Duffy

G. G.

(1993). Teachers’ progress toward becoming expert strategy teachers. The Elementary School Journal, 94(2), 109–120. https://doi.org/10.1086/461754

15.

Duncan

R. G.

Rivet

A. E.

(2018). Learning progressions. In Fischer

Hmelo-Silver

C. E.

Goldman

S. R.

Reimann

(Eds.), International handbook of the learning sciences (pp. 422–432). Routledge.

16.

Ehlert

Hebbecker

Souvignier

(2022). Implementing evidence-based practices in reading classrooms – Longitudinal interrelations between teaching-related attitudes, innovation-specific perceptions, and perceived implementation success. Studies in Educational Evaluation, 75, 101203. https://doi.org/10.1016/j.stueduc.2022.101203

17.

Furtak

E. M.

(2012). Linking a learning progression for natural selection to teachers’ enactment of formative assessment. Journal of Research in Science Teaching, 49(9), 1181–1210. https://doi.org/10.1002/tea.21054

18.

Goertzen

Schils

Heeneman

(2023). Co-designing formative assessment practices: A collaboration between elementary school teachers and researchers to conceptualize and implement formative assessment as a unified practice. Teaching and Teacher Education, 134, 104306. https://doi.org/10.1016/j.tate.2023.104306

19.

Gorsuch

(1983). Factor analysis (2nd ed.). Lawrence Erlbaum Associates.

20.

Gotwals

A. W.

Cisterna

(2022). Formative assessment practice progressions for teacher preparation: A framework and illustrative case. Teaching and Teacher Education, 110, 103601. https://doi.org/10.1016/j.tate.2021.103601

21.

Hattie

(2023). Visible learning: The sequel: A synthesis of over 2,100 meta-analyses relating to achievement. Routledge. https://doi.org/10.4324/9781003380542

22.

Hebbecker

Meudt

S.‑I.

Schütze

Souvignier

(2022). Der Transfer von formativem Assessment in den Leseunterricht der Grundschule – die vermittelnde Rolle von Akzeptanz, Machbarkeit und Kooperation [Transfer of formative assessment in primary school reading lessons – the mediating role of acceptability, feasibility and cooperation]. Zeitschrift für Pädagogische Psychologie, 36(4), 248–262. https://doi.org/10.1024/1010-0652/a000279

23.

Hebbecker

Souvignier

(2018). Formatives Assessment im Leseunterricht der Grundschule – Implementation und Wirksamkeit eines modularen, materialgestützten Konzepts [Formative Assessment in third-grade reading instruction—Implementation and effectiveness of a modular concept including prepared teaching-material]. Zeitschrift für Erziehungswissenschaft, 21, 735–765. https://doi.org//10.1007/s11618-018-0834-y

24.

Heitink

M. C.

van der Kleij

F. M.

Veldkamp

B. P.

Schildkamp

Kippers

W. B.

(2016). A systematic review of prerequisites for implementing assessment for learning in classroom practice. Educational Research Review, 17, 50–62. https://doi.org/10.1016/j.edurev.2015.12.002

25.

Heritage

(2007). Formative Assessment: What do teachers need to know and do? Phi Delta Kappan, 89(2), 140–145. https://doi.org/10.1177/003172170708900210

26.

Hondrich

A. L.

Hertel

Adl-Amini

Klieme

(2016). Implementing curriculum-embedded formative assessment in primary school science classrooms. Assessment in Education: Principles, Policy & Practice, 23(3), 353–376. https://doi.org/10.1080/0969594X.2015.1049113

27.

Klute

Apthorp

Harlacher

Reale

(2017). Formative assessment and elementary school student academic achievement: A review of the evidence (REL 2017-259). Regional Educational Laboratory Central. https://files.eric.ed.gov/fulltext/ED572929.pdf

28.

Lakens

(2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, Article 863. https://doi.org/10.3389/fpsyg.2013.00863

29.

Lakin

J. M.

Shannon

D. M.

(2015). The role of treatment acceptability, effectiveness, and understanding in treatment fidelity: Predicting implementation variation in a middle school science program. Studies in Educational Evaluation, 47, 28–37. https://doi.org/10.1016/j.stueduc.2015.06.002

30.

Lane

Parrila

Bower

Bull

Cavanagh

Forbes

Jones

Leaper

Khosronejad

Pellicano

Powell

Ryan

Skrebneva

(2019). Literature review: Formative assessment evidence and practice. AITSL.

31.

Lee

Chung

H. Q.

Zhang

Abedi

Warschauer

(2020). The effectiveness and features of formative assessment in US K-12 education: A systematic review. Applied Measurement in Education, 33(2), 124–140. https://doi.org/10.1080/08957347.2020.1732383

32.

Marshall

Drummond

M. J.

(2006). How teachers engage with Assessment for Learning: Lessons from the classroom. Research Papers in Education, 21(2), 133–149. https://doi.org/10.1080/02671520600615638

33.

Meudt

S.‑I.

Zeuch

Neuber

Souvignier

(2020). Kurzskalen zur Erfassung des Transfererfolgs von Leseförderkonzepten in der Schulpraxis [Short scales for assessing the transfer success of reading promotion concepts in school practice]. In Mackowiak

Beckerle

Gentrup

Titz

(Eds.), Forschungsinstrumente im Kontext institutioneller (schrift-)sprachlicher Bildung (pp. 141–157). Verlag Julius Klinkhardt. https://doi.org/10.35468/5801_08

34.

Nguyen

Pietsch

Gümüş

(2021). Collective teacher innovativeness in 48 countries: Effects of teacher autonomy, collaborative culture, and professional learning. Teaching and Teacher Education, 106, 103463. https://doi.org/10.1016/j.tate.2021.103463

35.

Pas

E. T.

Johnson

S. R.

Debnam

K. J.

Hulleman

C. S.

Bradshaw

C. P.

(2019). Examining the relative utility of PBIS implementation fidelity scores in relation to student outcomes. Remedial and Special Education, 40(1), 6–15. https://doi.org/10.1177/0741932518805192

36.

Penuel

W. R.

Fishman

B. J.

Yamaguchi

Gallagher

L. P.

(2007). What makes professional development effective? Strategies that foster curriculum implementation. American Educational Research Journal, 44(4), 921–958. https://doi.org/10.3102/0002831207308221

37.

Petermann

(2014). Implementationsforschung: Grundbegriffe und Konzepte [Implementation research: Basic terms and concepts]. Psychologische Rundschau, 65(3), 122–128. https://doi.org/10.1026/0033-3042/a000214

38.

Prenger

Tappel

A. P. M.

Poortman

C. L.

Schildkamp

(2022). How can educational innovations become sustainable? A review of the empirical literature. Frontiers in Education, 7, Article 970715. https://doi.org/10.3389/feduc.2022.970715

39.

Proctor

Silmere

Raghavan

Hovmand

Aarons

Bunger

Griffey

Hensley

(2011). Outcomes for implementation research: Conceptual distinctions, measurement challenges, and research agenda. Administration and Policy in Mental Health, 38(2), 65–76. https://doi.org/10.1007/s10488-010-0319-7

40.

Renko

Knittle

Palsola

Lintunen

Hankonen

(2020). Acceptability, reach and implementation of a training to enhance teachers’ skills in physical activity promotion. BMC Public Health, 20(1), Article 1568. https://doi.org/10.1186/s12889-020-09653-x

41.

Saunders

Lewis

Thornhill

(2009). Research methods for business students (5th ed.). Pearson Education.

42.

Schellenbach-Zell

Gräsel

(2010). Teacher motivation for participating in school innovations–supporting factors. Journal for Educational Research Online, 2(2), 34–54. https://doi.org/10.25656/01:4574

43.

Schrader

Hasselhorn

Hetfleisch

Goeze

(2020). Stichwortbeitrag Implementationsforschung: Wie Wissenschaft zu Verbesserungen im Bildungssystem beitragen kann [Implementation research: How science can contribute to improvements in the education system]. Zeitschrift für Erziehungswissenschaft, 23, 9–59. https://doi.org/10.1007/s11618-020-00927-z

44.

Shavelson

R. J.

(2006). On the integration of formative assessment in teaching and learning: Implications for new pathways in teacher education. In Oser

F. K.

Achtenhagen

Renold

(Eds.), Competence-oriented teacher training: Old research demands and new pathways (pp. 63–78). Sense Publishers. https://doi.org/10.1163/9789087903374_006

45.

Shavelson

R. J.

Young

D. B.

Ayala

C. C.

Brandon

P. R.

Furtak

E. M.

Ruiz-Primo

M. A.

Tomita

M. K.

Yin

(2008). On the impact of curriculum-embedded formative assessment on learning: A collaboration between curriculum and assessment developers. Applied Measurement in Education, 21(4), 295–314. https://doi.org/10.1080/08957340802347647

46.

Slavin

R. E.

Cheung

A. C. K.

Zhuang

(2021). How could evidence-based reform advance education? ECNU Review of Education, 4(1), 7–24. https://doi.org/10.1177/2096531120976060

47.

State

T. M.

Harrison

J. R.

Kern

Lewis

T. J.

(2017). Feasibility and acceptability of classroom-based interventions for students with emotional/behavioral challenges at the high school level. Journal of Positive Behavior Interventions, 19(1), 26–36. https://doi.org/10.1177/1098300716648459

48.

van der Kleij

F. M.

Vermeulen

J. A.

Schildkamp

Eggen

T. J. H. M

. (2015). Integrating data-based decision making, Assessment for Learning and diagnostic testing in formative assessment. Assessment in Education: Principles, Policy & Practice, 22(3), 324–343. https://doi.org/10.1080/0969594X.2014.999024

49.

Wiliam

Thompson

(2008). Integrating assessment with learning: What will it take to make it work? In Dwyer

C. A.

(Ed.), The future of assessment. Shaping teaching and learning (pp. 53–84). Lawrence Erlbaum Associates. https://doi.org/10.4324/9781315086545-3

50.

Yin

Shavelson

R. J.

Ayala

C. C.

Ruiz-Primo

M. A.

Brandon

P. R.

Furtak

E. M.

Tomita

M. K.

Young

D. B.

(2008). On the impact of formative assessment on student motivation, achievement, and conceptual change. Applied Measurement in Education, 21(4), 335–359. https://doi.org/10.1080/08957340802347845

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB